Skip to content

Dataset.genes not "dataset specific" #304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ymahlich opened this issue Jan 21, 2025 · 3 comments · May be fixed by #381
Open

Dataset.genes not "dataset specific" #304

ymahlich opened this issue Jan 21, 2025 · 3 comments · May be fixed by #381
Assignees
Labels
bug Something isn't working package
Milestone

Comments

@ymahlich
Copy link
Collaborator

Importing a dataset via cd.load(name=<DATASET_NAME> does populate / load genes.tsv completely, i.e. potentially containing more gene entries than are represented in the dataset (e.g. in transcriptomics).

In turn that means that loading multiple datasets potentially creates unnecessary memory overhead.

Is this a desired behaviour?

@ymahlich ymahlich added invalid This doesn't seem right question Further information is requested labels Jan 21, 2025
@sgosline
Copy link
Member

I think so - it's known that most datasets do not have the same genes in them, so using the same genes file as a superset helps ensure that we are doing our best to compare apples to apples. That being said, how significant is the memory overhead? Don't you mean disk space?

@ymahlich
Copy link
Collaborator Author

My understanding was that any data that is stored in a cd.Dataset object (e.g. ds.transcriptomics, ds.drugs, etc.) is always in relation to the data that is ds.experiments and does not contain information that does not relate to drug / cellline combinations in there. The fact that ds.genes contains all genes, and not just the ones that pertain to the cell lines in the ds.experiments object is a bit counter intuitive to me.

I would have to profile how much memory (RAM) is being allocated for each ds.genes dataframe, not sure about that. This came mostly to my attention when I was loading in multiple datasets into memory during the IMPROVE wrapper script writing and dataset analysis.

@sgosline sgosline added data update bug Something isn't working package and removed invalid This doesn't seem right question Further information is requested data update labels Apr 24, 2025
@sgosline
Copy link
Member

sgosline commented Apr 24, 2025

fix: subset genes (in the union of the transcriptomics, proteomics, mutations, and cnv files) when loading into dataset.

@ymahlich ymahlich linked a pull request May 7, 2025 that will close this issue
@ymahlich ymahlich linked a pull request May 7, 2025 that will close this issue
@jjacobson95 jjacobson95 added this to the v2.2 milestone May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working package
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

3 participants