Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only .parquet files in profiles directory #71

Open
lewismervin1 opened this issue Jun 26, 2023 · 6 comments
Open

Only .parquet files in profiles directory #71

lewismervin1 opened this issue Jun 26, 2023 · 6 comments

Comments

@lewismervin1
Copy link

lewismervin1 commented Jun 26, 2023

We noticed that the expected workspace folder structure for profiles (https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md#profiles-folder-structure), i.e.:

└── profiles
    └── 2021_04_26_Batch1
        ├── BR00117035
        │   ├── BR00117035.csv.gz
        │   ├── BR00117035_augmented.csv.gz
        │   ├── BR00117035_normalized.csv.gz
        │   ├── BR00117035_normalized_feature_select_negcon_plate.csv.gz
        │   ├── BR00117035_normalized_feature_select_plate.csv.gz
        │   └── BR00117035_normalized_negcon.csv.gz
        └── BR00117036

are actually directories of single parquet files (similar to the ones expected in workspace_dl https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md#profiles-folder-structure-1). Is this expected or does folder_structure.md need updating?

Many thanks for any help!

@niranjchandrasekaran
Copy link
Contributor

Hi Lewis, I am tagging @shntnu who should be able to tell you what our current plans are.

@lewismervin1
Copy link
Author

Thanks @niranjchandrasekaran and @shntnu. The reason we ask, is because we can only access one of the (full plate) parquet files at the moment, and are missing the *_feature_select_negcon_plate.csv.gz, *_normalized_feature_select_plate.csv.gz etc. files.

@niranjchandrasekaran
Copy link
Contributor

Hi Lewis, thanks for the additional context. Generating those additional files will require data alignment and normalization across all the sources, which we are still working on. Once we settle on the approach that we would take, we will either have per-plate parquet versions of those files or a single parquet file with all the plates (to be decided).

@lewismervin1
Copy link
Author

Hi @niranjchandrasekaran, we were wondering if there is a decision for how these files should look and if this issue should be closed? Many thanks for your help.

@shntnu
Copy link
Contributor

shntnu commented Nov 8, 2023

@lewismervin1 thanks for checking in. We're still working on a data processing pipeline for getting all the JUMP data aligned.

we will either have per-plate parquet versions of those files or a single parquet file with all the plates (to be decided).

We will eventually provide per-plate parquet but the first few versions of the aligned data will be either a single PyArrow Dataset.

Once we've completed implementing our new data validation system + schema (in progress here https://github.com/broadinstitute/cpg), we will distribute them as per-plate parquets (very likely using the same folder structure)

@shntnu
Copy link
Contributor

shntnu commented Feb 21, 2024

We will eventually provide per-plate parquet but the first few versions of the aligned data will be either a single PyArrow Dataset.

@lewismervin1
This is now available (the PR is still open, but you can peek in already)

#99

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants