Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte range support #166

Closed
wants to merge 2 commits into from

Conversation

dopplershift
Copy link
Member

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

This turns on the external server tests (needed to run byte range tests). Also force byte range support on just to be sure (but it should be on by default).

Based on my testing, while byte range support is enabled, it's not working.

This is needed to run the byte range tests. Also force byte range
support on just to be sure (but it should be on by default).
@conda-forge-webservices
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@dopplershift
Copy link
Member Author

@WardF @DennisHeimbigner While the tests pass (you can ignore the spurious Windows failure), I've downloaded the built package and at least ncdump and netCDF4-python fail with the second test URL (https://noaa-goes16.s3.amazonaws.com/ABI-L1b-RadC/2017/059/03/OR_ABI-L1b-RadC-M3C13_G16_s20170590337505_e20170590340289_c20170590340316.nc#mode=bytes), but works with the first (https://remotetest.unidata.ucar.edu/thredds/fileServer/testdata/2004050300_eta_211.nc#bytes).

@dopplershift
Copy link
Member Author

@DennisHeimbigner That works.

@dopplershift
Copy link
Member Author

It's not clear to me what the difference between those is, since I don't see anything like forwarding going on if I hit it using cURL. I'll also note that it takes almost 10 seconds to open that second URL, almost like it's downloading the whole file, which would indicate it's not actually using byte ranges? Or could that be a consequence of how metadata is stored in the file?

@DennisHeimbigner
Copy link

The difference is that the second url explicitly states the Amazon region: us-east-1.
I have code to convert all S3 urls to a canonical form termed "path", where the bucket is not
part of the host, but rather the first segment of the path part of the URL. The above
two urls are both in what is called "virtual" format where the bucket is the first segment
of the host part of the url.
It is supposed to be the case that if the region is not specified, then it defaults to us-east-1.
It looks like I my code missed the case of a virtual path with default region.

@dopplershift
Copy link
Member Author

@DennisHeimbigner Right, I saw structurally the difference in the URL. What I don't understand is: why does it matter? In this case we shouldn't be using the S3 API at all (it's not even enabled in these builds). Those URLs should be accessed directly using generic HTTP support for reading using byte ranges. I also have another, non-S3 url that fails but should work: https://crudata.uea.ac.uk/cru/data/temperature/HadCRUT.4.6.0.0.median.nc#mode=bytes

@DennisHeimbigner
Copy link

The reason is that HDF5 has its own S3 reader that it uses for byte-range reading independent
of whether or not netcdf-c enables S3. That HDF5 code does not depend on aws-sdk-cpp.
Because of that, I process a path that is a URL and looks like some kind of S3 related url,
to convert the url to a canonical form before attempting to use it.
This will occur even if S3 is disabled in netcdf.

In looking at that original URL, it occurs to me that I assumed that the segment "noaa-goes16"
is a bucket, but perhaps it is really a special region instead. If so, then it appears that whatever
server is processing that URL has extended the set of legal S3 urls to be of the form

.s3.amazonsaws.com

instead of what I thought was the legal form:

s3.region.amazonaws.com

I need to review the set of legal URL formats for S3, if I can find it again.

@DennisHeimbigner
Copy link

@dopplershift
Copy link
Member Author

No, noaa-goes16 is the bucket.

The segment s3.amazonaws.com is documented as working for the us-east-1 region in the docs you linked to:

image

No region necessary.

Regardless, I'm not sure why it doesn't fall back and try the normal byte-range access that works for the remotetest URL above? Also, I'm not sure why tst_byterange isn't failing?

@dopplershift
Copy link
Member Author

I've opened Unidata/netcdf-c#2642 and Unidata/netcdf-c#2643 to track the issues with the noaa-goes16 bucket and the HadCRUT dataset, respectively.

@@ -12,3 +12,5 @@ provider:
test_on_native_only: true
conda_build:
pkg_format: '2'
azure:
store_build_artifacts: True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this is the source of the spurious failures on windows

conda-forge/conda-smithy#1724
conda-forge/conda-smithy#1725

@zklaus
Copy link

zklaus commented Jul 6, 2023

Byte range support had been activated in #107, but (accidentally?) only for Windows and the static builds on Unix. I have now activated it in #178 as part of the alignment of the Windows and Unix build scripts. As such, I guess this PR could be closed, however, it contains a lot of information about additional testing and possible upstream problems.

Looking at the upstream issues Unidata/netcdf-c#2642 and Unidata/netcdf-c#2643, they are allegedly addressed (see Unidata/netcdf-c#2642 (comment) and Unidata/netcdf-c#2643 (comment), respectively), but have yet to be closed.

@dopplershift, how do you want to proceed?

@dopplershift
Copy link
Member Author

I think we can close this.

@dopplershift dopplershift deleted the test-byte-range branch July 6, 2023 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants