Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for small files in _split_file #448

Closed
kmpaul opened this issue Apr 10, 2024 · 5 comments · Fixed by #451
Closed

Support for small files in _split_file #448

kmpaul opened this issue Apr 10, 2024 · 5 comments · Fixed by #451

Comments

@kmpaul
Copy link
Contributor

kmpaul commented Apr 10, 2024

I've noticed that the kerchunk.grib2._split_file() function does not work if the file being scanned is smaller than 1024 bytes. The if len(head) < 1024: break # eof line/block is the culprit:

while f.tell() < size:
logger.debug(f"extract part {part + 1}")
head = f.read(1024)
if len(head) < 1024:
break # EOF
if b"GRIB" not in head:
f.seek(-4, 1)
continue
ind = head.index(b"GRIB")
start = f.tell() - 1024 + ind
part_size = int.from_bytes(head[ind + 12 : ind + 16], "big")
f.seek(start)
yield start, part_size, f.read(part_size)
part += 1
if skip and part >= skip:
break

Is it safe to remove this early break so that files smaller than 1024 bytes can be dealt with? Essentially, the final while block would look like:

 while f.tell() < size: 
     logger.debug(f"extract part {part + 1}") 
     head = f.read(1024) 
     # if len(head) < 1024: 
     #     break  # EOF 
     if b"GRIB" not in head: 
         f.seek(-4, 1) 
         continue 
     ind = head.index(b"GRIB") 
     # start = f.tell() - 1024 + ind 
     start = f.tell() - len(head) + ind 
     part_size = int.from_bytes(head[ind + 12 : ind + 16], "big") 
     f.seek(start) 
     yield start, part_size, f.read(part_size) 
     part += 1 
     if skip and part >= skip: 
         break 

It is unclear to me if this would cause problems with other GRIB files than the datasets I've handled myself.

@martindurant
Copy link
Member

I could maybe be if not head (i.e., we unexpectedly reached EOF), although it looks like this would be caught by looking for GRIB and the outer while. Your suggestion is very probably fine.

Out of interest, what possible information is there in such a small GRIB?

@kmpaul
Copy link
Contributor Author

kmpaul commented Apr 11, 2024

@martindurant: Ha! You would be surprised how much is in my tiny GRIB message. 😄 The actual file size is less than 200 bytes, which originally made me thing that my download had failed and the file was incomplete. However, it turns out it is a completely valid GRIB file. It's a testament to the fact that GRIB is not really a fully self-describing data format. You need other libraries (like eccodes and cfgrib) to "fill in the gaps." The information in that <200bytes is enough to construct a 72-million-member point cloud of all zeros. And even the lat-lon values are specified with a code value that contains the grid information (it is regular lat-lon grid). The result is a file that is smaller than 200B that can be used to procedurally generate (with cfgrib) an Xarray DataArray that is about 275MB.

I will put together a PR to make the changes I offered above for review.

@kmpaul
Copy link
Contributor Author

kmpaul commented Apr 11, 2024

I just created #451 to address this.

@kmpaul
Copy link
Contributor Author

kmpaul commented Apr 12, 2024

@martindurant: Thanks for your help and feedback on #446, #448, #450, and #451!

@martindurant
Copy link
Member

Thanks for your efforts :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants