Skip to content

Commit

Permalink
Update profile_index.csv with integrated profiles (#127)
Browse files Browse the repository at this point in the history
* Update docs

* Add ALL

* Fix urls

* Update docs

* Clarify profiles

* update URL

* Update README.md

* Update README.md
  • Loading branch information
shntnu authored Oct 4, 2024
1 parent 50cd2ab commit 1c2e38f
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 18 deletions.
24 changes: 8 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,20 @@ All the data is hosted on the Cell Painting Gallery on the Registry of Open Data

## Details about the data

Currently, this collection comprises 4 datasets:
This collection comprises 4 datasets:

- The principal dataset of 116k chemical and >15k genetic perturbations the partners created in tandem (`cpg0016`), split across 12 data-generating centers. Human U2OS osteosarcoma cells are used.
- 3 pilot datasets created to test: different perturbation conditions (`cpg0000`, including different cell types), staining conditions (`cpg0001`), and microscopes (`cpg0002`).

### What’s available now

- All data [components](https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md) of the three pilots.
- All data [components](https://github.com/broadinstitute/cellpainting-gallery/blob/main/documentation/data_structure.md) of the three pilots.
- Most data components (images, raw CellProfiler output, single-cell profiles, aggregated CellProfiler profiles) from 12 sources for the principal dataset. Each source corresponds to a unique data generating center (except `source_7` and `source_13`, which were from the same center).
- All key [metadata](metadata/README.md) files.
- A [notebook](https://github.com/jump-cellpainting/datasets/blob/update-readme/sample_notebook.ipynb) to load and inspect the data currently available in the principal dataset.
- A [tutorial](https://broadinstitute.github.io/2023_12_JUMP_data_only_vignettes/howto/tutorial_basic.html) to load the different subsets of data in the principal dataset, each available as a single dataframe. The URLs to the subsets are [here](https://github.com/jump-cellpainting/datasets/blob/main/profile_index.csv). The corresponding folders for each contain all the data levels (e.g. this [folder](https://cellpainting-gallery.s3.amazonaws.com/index.html#cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/)). Snakemake workflows for producing these assembled profiles are available [here](https://github.com/broadinstitute/jump-profiling-recipe/releases/tag/v0.1.0).
- A [notebook](https://github.com/jump-cellpainting/datasets/blob/main/sample_notebook.ipynb) to load and inspect the data currently available in the principal dataset.
- Different subsets of data in the principal dataset, assembled into single parquet files. The URLs to the subsets are [here](https://github.com/jump-cellpainting/datasets/blob/main/manifests/profile_index.csv). The corresponding folders for each contain all the data levels (e.g. this [folder](https://cellpainting-gallery.s3.amazonaws.com/index.html#cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/)). Snakemake workflows for producing these assembled profiles are available [here](https://github.com/broadinstitute/jump-profiling-recipe/releases/tag/v0.1.0). We recommend working with the the `all` or `all_interpretable` subsets -- they contain all three data modalities in single dataframe. Note that cross-modality matching is still poor (ORF-CRISPR, COMPOUND-CRISPR, COMPOUND-ORF), but within modality generally works well.
- A [tutorial](https://broadinstitute.github.io/2023_12_JUMP_data_only_vignettes/howto/1_retrieve_profiles.html) to load these subsets of data.
- Other [tutorials](https://broad.io/jump) to work with `cpg0016`.

### What’s coming up

Expand All @@ -32,19 +34,9 @@ Currently, this collection comprises 4 datasets:

## How to load the data: notebooks and folder structure

See the [sample notebook](sample_notebook.ipynb) to learn more about how to load the data in the principal dataset.
This new resource <https://broad.io/jump> include vignettes demonstrating how to work with JUMP data.

To get set up to run the notebook, first install the python dependencies and activate the virtual environment

```bash
# install pipenv if you don't have it already https://pipenv.pypa.io/en/latest/#install-pipenv-today
pipenv install
pipenv shell
```

See the typical [folder structure](https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md) for datasets in the Cell Painting Gallery.

This new resource <https://broad.io/jump> will include vignettes demonstrating how to work with JUMP data. Currently, it contains one [tutorial](https://broadinstitute.github.io/2023_12_JUMP_data_only_vignettes/howto/tutorial_basic.html) which demonstrates how to load the different subsets of data within `cpg0016`.
See the typical [folder structure](https://github.com/broadinstitute/cellpainting-gallery/blob/main/documentation/data_structure.md) for datasets in the Cell Painting Gallery.

## Citation/license

Expand Down
2 changes: 2 additions & 0 deletions manifests/profile_index.csv
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@
"orf_interpretable","https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier.parquet","97b0c31d7d678ca2a5e2353df5799fd8-217"
"crispr_interpretable","https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles_wellpos_cc_var_mad_outlier.parquet","90b08b824c06bcf16dfc5e788e74f099-135"
"compound_interpretable","https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/COMPOUND/profiles_var_mad_int_featselect_harmony/profiles_var_mad_int.parquet","b638fa24310db569bc869af92e16f69c-1444"
"all","https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_0224e0f/ALL/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet","71d03c195e41739af0f1ba64b4f6be73-324"
"all_interpretable","https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_0224e0f/ALL/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect.parquet","023d74cbf007bb6d837724ac8aa78fb4-324"
4 changes: 2 additions & 2 deletions manifests/src/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ If necessary, update the associated names for new dataset types.
After updating a URL, the ETag (provided by S3) will no longer match. To update the ETags, run the following command from the home folder:

```bash
bash manifests/src/update_etags.sh | sponge > profile_index.csv
bash manifests/src/update_etags.sh manifests/profile_index.csv| sponge manifests/profile_index.csv
```

Note: If using Nix, all dependencies are already included in the flake at the root folder. Simply run `nix develop` before the above command.
Note: If using Nix, all dependencies are already included in the flake at the root folder. Simply run `nix develop --extra-experimental-features nix-command --extra-experimental-features flakes` before the above command.

## Commit changes

Expand Down

0 comments on commit 1c2e38f

Please sign in to comment.