-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of bashdatacatalog to find invalid files #428
Comments
As a follow-up, I was able to write a Python script that renamed the MERRA2 files based on the real file name listed under global attributes in the netCDF metadata. Unfortunately, the metadata in the HEMCO netCDF files does not provide the file name in a consistent format so renaming these files will be more difficult. |
Thanks for pointing this out @cbutenhoff. I didn't encounter this issue with xargs -P curl before. Could you let us know how many streams you used to download the data? We use MD5 checksums, which only verify the content of the file, not the file name. Unfortunately, the metadata format is different across collections as they are from different sources. Perhaps you can try extracting the key information with regular expressions. |
Thanks @yidant. At different times I used 4 and 8 streams. I'm not positive the parallel download caused the problem, but files I downloaded using 'wget' seem fine. In some (most?) of the HEMCO nc files, there is a 'history' attribute that contains the actual file name, though it's not consistently located. I'm trying to do some checks based on this. In the end I'll probably spend more time trying to rename corrupt filenames that I would have just redownloading all the input data :). |
This comment may be better placed as its own issue, but I noticed the data catalog for GCHPv14.3.0 I believe incorrectly includes 2022 data in the GFED4/v2023-03/2023 folder: ./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202201.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202202.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202203.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202204.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202205.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202206.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202207.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202208.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202209.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202210.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202211.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202212.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202201.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202202.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202203.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202204.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202205.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202206.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202207.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202208.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202209.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202210.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202211.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202212.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202201.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202202.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202203.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202204.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202205.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202206.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202207.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202208.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202209.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202210.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202211.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202212.nc |
Hi @cbutenhoff! I think your first issue is similar to this issue (#438 (comment)). After looking into it, we found this issue results from the Thanks for reporting the second issue! This new checksum file have been generated. It should be fixed. |
Your name
Chris Butenhoff
Your affiliation
Portland State University
Please provide a clear and concise description of your question or discussion topic.
I recently used bashdatacatalog to download input files for GCHP v14.3.1 for a multi-year simulation. The download took a cpl of days and when finished I noticed some (many?? ) of the file names were corrupted.
For example, HEMCO/OFFLINE_BIOVOC/v2021-12/0.5x0.625/2006/01/biovoc_05.20060103.nc actually is biovoc_05.20060102.nc according to the nc header info; MERRA2.20070101.I3.05x0625.nc4 is actually MERRA2.20070101.A3mstE.05x0625.nc4, and so on.
I believe this happened because I used the parallel option in xargs -P curl to download the files, and some communication/timing error occurred.
I would like to not download all the files again. I notice that bashdatacatalog-list has the -w option to identify files with incorrect checksums. I tried this to identify files that I know are invalid but bashdatacatalog-list was unable to identify those files.
Here is my usage to find the corrupt biovoc files:
> bashdatacatalog-list -aw -p "OFFLINE_BIOVOC/v2021-12/0.5x0.625/2006/01" InputDataCatalogs/**/*.cs
vI run it in my ExtData directory as I did when I downloaded the files. I have also tried running using pattern "biovoc" but it didn't return any file names either.
I don't know too much how checksums work. In the case where the file is intact but has the wrong filename, would the checksum still match?
Thanks for any help you can provide.
The text was updated successfully, but these errors were encountered: