Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide information about missing files #78

Open
hanslovsky opened this issue Aug 11, 2023 · 2 comments
Open

Provide information about missing files #78

hanslovsky opened this issue Aug 11, 2023 · 2 comments

Comments

@hanslovsky
Copy link

hanslovsky commented Aug 11, 2023

I am trying to download all images for source_11 that I can find in the respective load_data_with_illum.parquet files. I found that for these parquet files,

['cpg0016-jump/source_11/workspace/load_data_csv/Batch2/EC000038/load_data_with_illum.parquet',
 'cpg0016-jump/source_11/workspace/load_data_csv/Batch2/EC000066/load_dat
[source_11-404.csv](https://github.com/jump-cellpainting/datasets/files/12325106/source_11-404.csv)
a_with_illum.parquet',
 'cpg0016-jump/source_11/workspace/load_data_csv/Batch2/EC000070/load_data_with_illum.parquet']

there are 1216 fields/sites with at least one missing image, for a total of 6068 missing images that I attached as CSV in source_11-404.txt (I had to change the extension from txt to csv to attach in this comment). This is what the CSV looks like:

$ head source_11-404.txt
failed-paths
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch2sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch4sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch3sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch5sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch1sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch2sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch4sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch3sk1fk1fl1.tiff
cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f04p01-ch5sk1fk1fl1.tiff

For example, aws s3 ls on the first file returns in above snippet exits with code 1, i.e. the key does not exist:

$ aws s3 --no-sign-request ls s3://cellpainting-gallery/cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch2sk1fk1fl1.tiff

$ echo $?
1

When I use the same key but change the channel from ch2 to ch1, that file exists:

$ aws s3 --no-sign-request ls s3://cellpainting-gallery/cpg0016-jump/source_11/images/Batch2/images/EC000038__2021-06-04T17_37_00-Measurement1/Images/r11c10f03p01-ch1sk1fk1fl1.tiff
2022-12-21 21:35:43    2058750 r11c10f03p01-ch1sk1fk1fl1.tiff

$ echo $?
0

I will double-check that I inferred the correct file names from the parquet files. The existence of ch1 in this example suggests that I inferred the correct names, at least for that field/site.

To find the number of missing fields/sites, I removed the channel sub-string:

$ cat notebooks/data/source_11-404.txt | sed 's/-ch[0-9]sk1fk1fl1.tiff//' | sort | uniq -c | wc -l
1217

Subtract 1 for the CSV header.

@hanslovsky
Copy link
Author

Note, I stated originally that I found 1216 wells with at least one missing image, but this is incorrect. I found 1216 fields/sites with at least one missing image.

@hanslovsky hanslovsky changed the title Images missing for 1216 wells in source_11 Images missing for 1216 fields/sites in source_11 Aug 14, 2023
@niranjchandrasekaran
Copy link
Contributor

Thank you @hanslovsky for the detailed report!

These images in source_11 are indeed missing (internal notes: https://github.com/jump-cellpainting/aws/issues/81#issuecomment-1266405250). I will keep this issue open so that we can think of ways to inform the users of the dataset that these files are missing.

@shntnu shntnu changed the title Images missing for 1216 fields/sites in source_11 Provide information about missing files Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants