-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only .parquet files in profiles directory #71
Comments
Hi Lewis, I am tagging @shntnu who should be able to tell you what our current plans are. |
Thanks @niranjchandrasekaran and @shntnu. The reason we ask, is because we can only access one of the (full plate) parquet files at the moment, and are missing the *_feature_select_negcon_plate.csv.gz, *_normalized_feature_select_plate.csv.gz etc. files. |
Hi Lewis, thanks for the additional context. Generating those additional files will require data alignment and normalization across all the sources, which we are still working on. Once we settle on the approach that we would take, we will either have per-plate parquet versions of those files or a single parquet file with all the plates (to be decided). |
Hi @niranjchandrasekaran, we were wondering if there is a decision for how these files should look and if this issue should be closed? Many thanks for your help. |
@lewismervin1 thanks for checking in. We're still working on a data processing pipeline for getting all the JUMP data aligned.
We will eventually provide per-plate parquet but the first few versions of the aligned data will be either a single PyArrow Dataset. Once we've completed implementing our new data validation system + schema (in progress here https://github.com/broadinstitute/cpg), we will distribute them as per-plate parquets (very likely using the same folder structure) |
@lewismervin1 |
We noticed that the expected
workspace
folder structure for profiles (https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md#profiles-folder-structure), i.e.:are actually directories of single parquet files (similar to the ones expected in
workspace_dl
https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md#profiles-folder-structure-1). Is this expected or does folder_structure.md need updating?Many thanks for any help!
The text was updated successfully, but these errors were encountered: