Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of bashdatacatalog to find invalid files #428

Open
cbutenhoff opened this issue Jul 15, 2024 · 5 comments
Open

Use of bashdatacatalog to find invalid files #428

cbutenhoff opened this issue Jul 15, 2024 · 5 comments
Assignees
Labels
category: Question Further information is requested topic: Input Data Related to input data

Comments

@cbutenhoff
Copy link

Your name

Chris Butenhoff

Your affiliation

Portland State University

Please provide a clear and concise description of your question or discussion topic.

I recently used bashdatacatalog to download input files for GCHP v14.3.1 for a multi-year simulation. The download took a cpl of days and when finished I noticed some (many?? ) of the file names were corrupted.

For example, HEMCO/OFFLINE_BIOVOC/v2021-12/0.5x0.625/2006/01/biovoc_05.20060103.nc actually is biovoc_05.20060102.nc according to the nc header info; MERRA2.20070101.I3.05x0625.nc4 is actually MERRA2.20070101.A3mstE.05x0625.nc4, and so on.

I believe this happened because I used the parallel option in xargs -P curl to download the files, and some communication/timing error occurred.

I would like to not download all the files again. I notice that bashdatacatalog-list has the -w option to identify files with incorrect checksums. I tried this to identify files that I know are invalid but bashdatacatalog-list was unable to identify those files.

Here is my usage to find the corrupt biovoc files:

> bashdatacatalog-list -aw -p "OFFLINE_BIOVOC/v2021-12/0.5x0.625/2006/01" InputDataCatalogs/**/*.csv

I run it in my ExtData directory as I did when I downloaded the files. I have also tried running using pattern "biovoc" but it didn't return any file names either.

I don't know too much how checksums work. In the case where the file is intact but has the wrong filename, would the checksum still match?

Thanks for any help you can provide.

@cbutenhoff cbutenhoff added the category: Question Further information is requested label Jul 15, 2024
@yantosca yantosca added the topic: Input Data Related to input data label Jul 16, 2024
@cbutenhoff
Copy link
Author

As a follow-up, I was able to write a Python script that renamed the MERRA2 files based on the real file name listed under global attributes in the netCDF metadata. Unfortunately, the metadata in the HEMCO netCDF files does not provide the file name in a consistent format so renaming these files will be more difficult.

@yidant
Copy link
Contributor

yidant commented Jul 16, 2024

Thanks for pointing this out @cbutenhoff. I didn't encounter this issue with xargs -P curl before. Could you let us know how many streams you used to download the data?

We use MD5 checksums, which only verify the content of the file, not the file name.

Unfortunately, the metadata format is different across collections as they are from different sources. Perhaps you can try extracting the key information with regular expressions.

@cbutenhoff
Copy link
Author

Thanks @yidant. At different times I used 4 and 8 streams. I'm not positive the parallel download caused the problem, but files I downloaded using 'wget' seem fine.

In some (most?) of the HEMCO nc files, there is a 'history' attribute that contains the actual file name, though it's not consistently located. I'm trying to do some checks based on this.

In the end I'll probably spend more time trying to rename corrupt filenames that I would have just redownloading all the input data :).

@cbutenhoff
Copy link
Author

cbutenhoff commented Jul 17, 2024

This comment may be better placed as its own issue, but I noticed the data catalog for GCHPv14.3.0 I believe incorrectly includes 2022 data in the GFED4/v2023-03/2023 folder:

./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202201.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202202.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202203.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202204.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202205.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202206.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202207.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202208.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202209.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202210.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202211.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202212.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202201.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202202.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202203.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202204.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202205.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202206.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202207.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202208.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202209.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202210.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202211.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202212.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202201.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202202.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202203.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202204.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202205.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202206.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202207.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202208.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202209.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202210.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202211.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202212.nc

@yidant
Copy link
Contributor

yidant commented Sep 17, 2024

Hi @cbutenhoff!

I think your first issue is similar to this issue (#438 (comment)). After looking into it, we found this issue results from the xargs curl command failing to process the multi-line downloading list generated by bashdatacatalog. In addition to -P for parallel streams, you could use the -L 1 to specify that only one line of input should be passed to curl at a time. You could use commands like xargs -L 1 -P 4 curl instead.

Thanks for reporting the second issue! This new checksum file have been generated. It should be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Question Further information is requested topic: Input Data Related to input data
Projects
None yet
Development

No branches or pull requests

3 participants