`Dataset.genes` not "dataset specific" #304

ymahlich · 2025-01-21T18:05:41Z

Importing a dataset via cd.load(name=<DATASET_NAME> does populate / load genes.tsv completely, i.e. potentially containing more gene entries than are represented in the dataset (e.g. in transcriptomics).

In turn that means that loading multiple datasets potentially creates unnecessary memory overhead.

Is this a desired behaviour?

The text was updated successfully, but these errors were encountered:

sgosline · 2025-03-24T22:21:35Z

I think so - it's known that most datasets do not have the same genes in them, so using the same genes file as a superset helps ensure that we are doing our best to compare apples to apples. That being said, how significant is the memory overhead? Don't you mean disk space?

ymahlich · 2025-03-24T22:28:30Z

My understanding was that any data that is stored in a cd.Dataset object (e.g. ds.transcriptomics, ds.drugs, etc.) is always in relation to the data that is ds.experiments and does not contain information that does not relate to drug / cellline combinations in there. The fact that ds.genes contains all genes, and not just the ones that pertain to the cell lines in the ds.experiments object is a bit counter intuitive to me.

I would have to profile how much memory (RAM) is being allocated for each ds.genes dataframe, not sure about that. This came mostly to my attention when I was loading in multiple datasets into memory during the IMPROVE wrapper script writing and dataset analysis.

sgosline · 2025-04-24T18:27:08Z

fix: subset genes (in the union of the transcriptomics, proteomics, mutations, and cnv files) when loading into dataset.

ymahlich added invalid This doesn't seem right question Further information is requested labels Jan 21, 2025

ymahlich added this to CoderData Jan 21, 2025

sgosline added data update bug Something isn't working package and removed invalid This doesn't seem right question Further information is requested data update labels Apr 24, 2025

sgosline assigned ymahlich Apr 24, 2025

ymahlich linked a pull request May 7, 2025 that will close this issue

304 datasetgenes not dataset specific #381

Open

jjacobson95 added this to the v2.2 milestone May 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`Dataset.genes` not "dataset specific" #304

`Dataset.genes` not "dataset specific" #304

ymahlich commented Jan 21, 2025

sgosline commented Mar 24, 2025

Uh oh!

ymahlich commented Mar 24, 2025

Uh oh!

sgosline commented Apr 24, 2025 •

edited

Loading

Uh oh!

Dataset.genes not "dataset specific" #304

Dataset.genes not "dataset specific" #304

Comments

ymahlich commented Jan 21, 2025

sgosline commented Mar 24, 2025

Uh oh!

ymahlich commented Mar 24, 2025

Uh oh!

sgosline commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

`Dataset.genes` not "dataset specific" #304

`Dataset.genes` not "dataset specific" #304

sgosline commented Apr 24, 2025 •

edited

Loading