You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Importing a dataset via cd.load(name=<DATASET_NAME> does populate / load genes.tsv completely, i.e. potentially containing more gene entries than are represented in the dataset (e.g. in transcriptomics).
In turn that means that loading multiple datasets potentially creates unnecessary memory overhead.
Is this a desired behaviour?
The text was updated successfully, but these errors were encountered:
I think so - it's known that most datasets do not have the same genes in them, so using the same genes file as a superset helps ensure that we are doing our best to compare apples to apples. That being said, how significant is the memory overhead? Don't you mean disk space?
My understanding was that any data that is stored in a cd.Dataset object (e.g. ds.transcriptomics, ds.drugs, etc.) is always in relation to the data that is ds.experiments and does not contain information that does not relate to drug / cellline combinations in there. The fact that ds.genes contains all genes, and not just the ones that pertain to the cell lines in the ds.experiments object is a bit counter intuitive to me.
I would have to profile how much memory (RAM) is being allocated for each ds.genes dataframe, not sure about that. This came mostly to my attention when I was loading in multiple datasets into memory during the IMPROVE wrapper script writing and dataset analysis.
Importing a dataset via
cd.load(name=<DATASET_NAME>
does populate / load genes.tsv completely, i.e. potentially containing more gene entries than are represented in the dataset (e.g. in transcriptomics).In turn that means that loading multiple datasets potentially creates unnecessary memory overhead.
Is this a desired behaviour?
The text was updated successfully, but these errors were encountered: