Skip to content
This repository was archived by the owner on Oct 28, 2022. It is now read-only.
This repository was archived by the owner on Oct 28, 2022. It is now read-only.

RNA expression data structure is inefficient #832

Open
@kozbo

Description

@kozbo

From email chain with Rob:

Here’s a few stories based on [Treehouse’s](https://treehousegenomics.soe.ucsc.edu/) current Jupyter analysis [notebooks](https://github.com/UCSC-Treehouse/jupyterhub):

1) I have a mapping {feature -> threshold value} and a sample. I'd like to get a vector of (feature, expression value) for all features in that sample where expression value > feature[threshold].

2) I'd like to get the features in a sample sorted by expression value.

3) I'd like to generate a set of samples based on phenotype information (not just "disease = ALAL" but more complex queries, able to include or exclude samples by identifier)

4) I have a set of sample identifiers. For each feature in samples in that set, I'd like to get statistics on its expression values within that set (median, quartiles, interquartile range).


For #1 I tried to code ups something similar but that would work on 1kgenomes:

https://jupyter.medbook.io/user/rcurrie/notebooks/readonly/rcurrie/Treehouse%20GA4GH%20User%20Stories.ipynb

# All males from the GBR population
biosample_ids = [b.id for b in c.search_biosamples(dataset_id=dataset.id)
                 if b.info["Population"].values
                 and "GBR" in [v.string_value for v in b.info["Population"].values]
                 and "male" in c.get_individual(b.individual_id).sex.term]
print "Samples found:", len(biosample_ids)

Takes about 8 seconds to run. The same type of query for the RNASeq database but looking for “Thyroid” and “Femail” took over 3 minutes

But currently as far as I can tell it is impossible to get the expression levels for an individual or biosample due to curation so its kind of a dead end. My gut at this point is we can’t use the server for Treehouse due to the lack of server side query as for most of the stories we’ll end up basically walking through every individual, biosample, and expression level on the client in order to get what we need.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions