Support for small files in `_split_file` #448

kmpaul · 2024-04-10T09:12:16Z

I've noticed that the kerchunk.grib2._split_file() function does not work if the file being scanned is smaller than 1024 bytes. The if len(head) < 1024: break # eof line/block is the culprit:

kerchunk/kerchunk/grib2.py

Lines 49 to 64 in a0c4f3b

    
           while f.tell() < size: 
        
               logger.debug(f"extract part {part + 1}") 
        
               head = f.read(1024) 
        
               if len(head) < 1024: 
        
                   break  # EOF 
        
               if b"GRIB" not in head: 
        
                   f.seek(-4, 1) 
        
                   continue 
        
               ind = head.index(b"GRIB") 
        
               start = f.tell() - 1024 + ind 
        
               part_size = int.from_bytes(head[ind + 12 : ind + 16], "big") 
        
               f.seek(start) 
        
               yield start, part_size, f.read(part_size) 
        
               part += 1 
        
               if skip and part >= skip: 
        
                   break

Is it safe to remove this early break so that files smaller than 1024 bytes can be dealt with? Essentially, the final while block would look like:

 while f.tell() < size: 
     logger.debug(f"extract part {part + 1}") 
     head = f.read(1024) 
     # if len(head) < 1024: 
     #     break  # EOF 
     if b"GRIB" not in head: 
         f.seek(-4, 1) 
         continue 
     ind = head.index(b"GRIB") 
     # start = f.tell() - 1024 + ind 
     start = f.tell() - len(head) + ind 
     part_size = int.from_bytes(head[ind + 12 : ind + 16], "big") 
     f.seek(start) 
     yield start, part_size, f.read(part_size) 
     part += 1 
     if skip and part >= skip: 
         break

It is unclear to me if this would cause problems with other GRIB files than the datasets I've handled myself.

The text was updated successfully, but these errors were encountered:

martindurant · 2024-04-10T13:11:00Z

I could maybe be if not head (i.e., we unexpectedly reached EOF), although it looks like this would be caught by looking for GRIB and the outer while. Your suggestion is very probably fine.

Out of interest, what possible information is there in such a small GRIB?

kmpaul · 2024-04-11T07:48:38Z

@martindurant: Ha! You would be surprised how much is in my tiny GRIB message. 😄 The actual file size is less than 200 bytes, which originally made me thing that my download had failed and the file was incomplete. However, it turns out it is a completely valid GRIB file. It's a testament to the fact that GRIB is not really a fully self-describing data format. You need other libraries (like eccodes and cfgrib) to "fill in the gaps." The information in that <200bytes is enough to construct a 72-million-member point cloud of all zeros. And even the lat-lon values are specified with a code value that contains the grid information (it is regular lat-lon grid). The result is a file that is smaller than 200B that can be used to procedurally generate (with cfgrib) an Xarray DataArray that is about 275MB.

I will put together a PR to make the changes I offered above for review.

kmpaul · 2024-04-11T12:16:34Z

I just created #451 to address this.

kmpaul · 2024-04-12T14:36:52Z

@martindurant: Thanks for your help and feedback on #446, #448, #450, and #451!

martindurant · 2024-04-12T14:37:40Z

Thanks for your efforts :)

kmpaul mentioned this issue Apr 11, 2024

Allow for small GRIB files #451

Merged

kmpaul mentioned this issue Apr 12, 2024

Update CI testing #452

Closed

martindurant closed this as completed in #451 Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for small files in `_split_file` #448

Support for small files in `_split_file` #448

kmpaul commented Apr 10, 2024

martindurant commented Apr 10, 2024

kmpaul commented Apr 11, 2024

kmpaul commented Apr 11, 2024

kmpaul commented Apr 12, 2024

martindurant commented Apr 12, 2024

Support for small files in _split_file #448

Support for small files in _split_file #448

Comments

kmpaul commented Apr 10, 2024

martindurant commented Apr 10, 2024

kmpaul commented Apr 11, 2024

kmpaul commented Apr 11, 2024

kmpaul commented Apr 12, 2024

martindurant commented Apr 12, 2024

Support for small files in `_split_file` #448

Support for small files in `_split_file` #448