Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report corrupt TIFF files, filter load_data where images are actually missing #76

Open
hanslovsky opened this issue Aug 10, 2023 · 9 comments
Labels
bug Something isn't working cpg0016

Comments

@hanslovsky
Copy link

hanslovsky commented Aug 10, 2023

I found a few corrupt tiff files in the JUMP production dataset. So far, I have only seen corrupt tiff files in sources 1 and 7 (4 files each). I will report back any additional corrupt tiff files that I may find during my download/conversion.

Here is what I have so far:

s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r03c04f01p01-ch1sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c18f02p01-ch4sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c19f02p01-ch1sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c37f04p01-ch4sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_3/images/CP60/images/BR5876c3__2022-04-29T20_47_20-Measurement 1/Images/r11c22f08p01-ch3sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif

How to confirm that these files are corrupt:

$ urls=(
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r03c04f01p01-ch1sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c18f02p01-ch4sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c19f02p01-ch1sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c37f04p01-ch4sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_3/images/CP60/images/BR5876c3__2022-04-29T20_47_20-Measurement\ 1/Images/r11c22f08p01-ch3sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif
)

$ for url in "${urls[@]}"; do aws s3 --no-sign-request cp $url .; done
download: s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r03c04f01p01-ch1sk1fk1fl1.tiff to ./r03c04f01p01-ch1sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c18f02p01-ch4sk1fk1fl1.tiff to ./r04c18f02p01-ch4sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c19f02p01-ch1sk1fk1fl1.tiff to ./r04c19f02p01-ch1sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c37f04p01-ch4sk1fk1fl1.tiff to ./r04c37f04p01-ch4sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_3/images/CP60/images/BR5876c3__2022-04-29T20_47_20-Measurement 1/Images/r11c22f08p01-ch3sk1fk1fl1.tiff to ./r11c22f08p01-ch3sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif to ./CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif
download: s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif to ./CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif
download: s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif to ./CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif
download: s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif to ./CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif

$ du -hs *tif *tiff
2.7M    CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif
2.7M    CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif
2.7M    CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif
2.7M    CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif
3.1M    r03c04f01p01-ch1sk1fk1fl1.tiff
2.8M    r04c18f02p01-ch4sk1fk1fl1.tiff
3.1M    r04c19f02p01-ch1sk1fk1fl1.tiff
2.6M    r04c37f04p01-ch4sk1fk1fl1.tiff
0       r11c22f08p01-ch3sk1fk1fl1.tiff

$ identify *tif *tiff
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `r03c04f01p01-ch1sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `r04c18f02p01-ch4sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `r04c19f02p01-ch1sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `r04c37f04p01-ch4sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.
identify: Cannot read TIFF header. `r11c22f08p01-ch3sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.

Notes:

  1. Those files seem to have the expected file size (except for the one from source 3), but the magic number is invalid/bad.
  2. I updated the list with 1 corrupt file from source 3
  3. I finished download of all other sources except 11 and have not found any other corrupt files.
@niranjchandrasekaran
Copy link
Contributor

niranjchandrasekaran commented Aug 10, 2023

Thanks @hanslovsky for flagging these. ccing @shntnu to bring this to his attention.

@hanslovsky
Copy link
Author

I downloaded all sources except source 11 (still working on that) and found only one additional corrupt file in source 3. All other sources (except 11) did not have corrupt files.

@shntnu
Copy link
Collaborator

shntnu commented Dec 8, 2023

Thank you so much for reporting this @hanslovsky

  • Could you let us know if source_11 had any corrupt files?
  • Do you have any thoughts on how we should report this? I was thinking we could create a new top-level folder https://github.com/jump-cellpainting/datasets/tree/main/errors and a CSV file within for each data component (e.g. images.csv to report missing/corrupt images).

@shntnu shntnu changed the title Some corrupt tiff files Report corrupt TIFF files Dec 8, 2023
@shntnu shntnu added bug Something isn't working cpg0016 labels Dec 8, 2023
@Arkkienkeli
Copy link

Arkkienkeli commented Feb 14, 2024

I did run identify on all sources and created a list of all corrupted images according to this utility.
If a value from Channel \ Well \ Site is missing, it means that the image is not in the metadata, for example, all corrupted images in this list from source_10 are actually not in the metadata (probably it is described here #61).

@shntnu @hanslovsky

Corrupted_images.csv

@hanslovsky
Copy link
Author

@Arkkienkeli your findings are consistent with mine (I did not report any corrupted images that are not in the metadata), with the exception of the one image of source 11. I did not report anything for source 11 in this issue because I was still working on it at that time. I will double-check my records to see if I have any notes on corrupted files for source 11.

I know that I reported missing images for source 11 in #78 but I don't know if that includes any corrupted images.

cc @shntnu

@hanslovsky
Copy link
Author

@Arkkienkeli I just double-checked the images I reported missing in source 11 (source_11-404.txt) and I found the image you reported corrupted in there as well. Now I can conclusively say that both our reportings are consistent.

Please note that I also found some images in source 11 that were simply not present, in plates EC000038and EC000066

@shntnu
Copy link
Collaborator

shntnu commented Mar 1, 2024

I will drop in some notes for now

cat ~/Downloads/source_11-404.txt |cut -d"/" -f6|sort|uniq -c
6064 EC000038__2021-06-04T17_37_00-Measurement1
   2 EC000066__2021-06-06T12_36_15-Measurement1
   1 EC000070__2021-06-09T23_50_19-Measurement1
   1 failed-paths
csvcut -c Source,Batch,Plate ~/Downloads/Corrupted_images.csv |sort|uniq|wc -l
      19

csvcut -c Source,Batch ~/Downloads/Corrupted_images.csv |sort|uniq|wc -l
      15
      
csvcut -c Source ~/Downloads/Corrupted_images.csv |sort|uniq|wc -l
      6
      
csvcut -c Source ~/Downloads/Corrupted_images.csv |sort|uniq -c
   5 1
  23 10
   1 11
   1 3
   4 7
   1 Source

Internal notes

  1. EC000038 on batch2. This plate has the metadata (xml file) and a significant number of images missing. I checked with XXX and she says they are also missing on the microscopy server. Should this be skipped?
  2. Order-of-magnitude, how many images are missing - 10, 100, 1000, 10000? I assume with no Index.idx.xml file you weren't able to run pe2loaddata, but it's pretty trivial to just make the load_data and load_data_with_illum CSVs from another plate in the batch with a find-and-replace on the plate name (and removing missing files from the load_data  csv per above). I think as long as you have at least say, half the plate still present, no reason to throw out this data.
  3. EC000038 on batch2. I checked it and found out we had > 2000 image sets useable. Copied over the xml file from another plate and processed it.

@shntnu
Copy link
Collaborator

shntnu commented Mar 1, 2024

Alright, overall

  • EC000038 the files that were missing here are because we created the load_data file by hand (see internal notes in the previous comment). We should edit the load_data to filter out the sites that have a missing image
  • EC000066 and EC000070 - turns out these two plates are also among those where we created the load_data file by hand, so we should do the same here
  • Here's the full list of source_11 plates missing load_data files: EC000038 , EC000066, EC000070, EC000156, EC000157 so we should expect similar issues with all of these

@hanslovsky @Arkkienkeli -- thank you so much for reporting this! You can proceed by simply ignoring these images. Our task is to update the load data files to remove the discrepancy

@shntnu
Copy link
Collaborator

shntnu commented Mar 1, 2024

I did run identify on all sources and created a list of all corrupted images according to this utility. If a value from Channel \ Well \ Site is missing, it means that the image is not in the metadata, for example, all corrupted images in this list from source_10 are actually not in the metadata (probably it is described here #61).

@shntnu @hanslovsky

Corrupted_images.csv

Regarding the corrupted files, we should likely take the same strategy – drop them from load_data. @Arkkienkeli -- You can proceed by ignoring these images because we no longer have access to the originals (thankfully that's only 34 images out of the gazillion)

@shntnu shntnu changed the title Report corrupt TIFF files Report corrupt TIFF files, filter load_data where images are actually missing Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cpg0016
Projects
None yet
Development

No branches or pull requests

4 participants