-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Processing Requirements #17
Comments
I posted the preliminary datasets created by dhimmel/cancer-data@ffe66ab. Both datasets are sample × gene matrices, with one containing mutation status (for constructing
Currently, the datasets are compressed TSVs (flat text files). On my machine, reading the TSV into a So what would be ideal is a storage format for matrix-like files that supports indexed reading of specific rows and columns in Python. Ideally for convenience, the data could be stored and read directly from a file -- no need to have a separate application running to serve it. @awm33 do you have any ideas on what technologies may fit here? I've seen a project use pytables for a similar application. Does anyone have experience with pytables? |
@dhimmel So you have a wide tables, but few rows. This is a preliminary set? Will it be expanded beyond the initial 7,706 samples? If it stays relatively small, say under 100k samples, I think files could work. I assume that any one classifier search would only select 2-25ish genes? Or at least that's what I've heard for the frontend. Then you're really only selecting a very small subset of those matrices. Both pytables and memmap look like they may fit. Here is the IO options for both http://pandas.pydata.org/pandas-docs/stable/io.html http://docs.scipy.org/doc/numpy/reference/routines.io.html Are you guys using a mix of pandas and numpy structures? Something to be cognizant of is the performance and multithreading abilities of each structure. One option would be to prepare the matrices in pytable and or memmap format, upload them to S3/GCS, and load them to the workers ephemeral storage as a cache. This load could happen during instance deploy, application deploy, or maybe a script on the machine as well. Here's another approach. Uncompressed, these matrices are about 1.2GB, which is small enough to keep cached in a server's memory (assuming 15+GB ram). Maybe you could store an indexed copy of the matrices in-memory? You could keep a copy of the matrices in a pandas DataFrame, select the columns (genes) you want and copy them into a new DataFrame. It looks like DataFrames support indexed querying http://pandas.pydata.org/pandas-docs/version/0.15.0/indexing.html, but someone more familar with pandas may know a better structure. As long as operations againt that structure are thread safe. This also assumes we would recycle processes across jobs. |
I don't expect the sample number to grow substantially. @gwaygenomics, is TCGA still collecting data and expanding?
The classifier will train itself on many genes (often the full set of 20,501). Depending on the algorithm, the resulting model may only actually use a small number of genes. The machine learning portion of the project will need to take large subsets of the expression matrix. As far as the results viewer, I'm not sure whether the raw data will be incorporated.
I like this idea. Having the file stored locally provides lot's of versatility for code running on the instance.
Not a huge fan of this as it creates a dependency on having everything operate through a single python process? Threading in python, although fun, can be cumbersome and problematic due to the GIL. The subsetting of pandas dataframes is easy -- it's the communicating the copied subset to independent Python instances that will be hard.
Unclear at this point. pandas is more user friendly and uses numpy in the backend. numpy is less friendly but may have performance benefits. |
TCGA has completed sample collection but more data will be made public eventually. The updated data will include more complete mutation calls, more in depth clinical information, and a complete gene expression matrix. I think it should end up at around 15,000 total samples. We're working on the cutting edge of cancer genomic data right now with these sets, but cognoma should be flexible enough to handle data updates. |
Ok, assuming it's a somewhat linear trend, the dataset will end up being ~3GB. @dhimmel Yeah, I've never done multithreading in python and the two people I've discussed it with were like "if you need multithreading, don't use python". I think I'll miss how easy Clojure makes threading/concurrency. But you could still use the caching approach if a process would only be executing one task, but it would be kept around to pick up another task, that I'm assuming would need to use the same matrices. Cycling processes vs long running processes. I'm not sure what will end up being used. I think a long running worker daemon is ideal, if memory leaks are not an issue in this code. So one or more processes more per instance, each independently communicating with the task queue, running one job at a time. Generating new processes or child processes would clear out the memory of the previous task, but adds the overhead of managing the processes. |
If we're using python 3, https://docs.python.org/3/library/multiprocessing.html looks promising "effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads" |
For applications that are job/task based, Multiprocessing escapes the Global Interpreter Lock, but has to pickle (serialize) all data being sent to the subprocess. This will cause a big overhead. If we can find a format that allows for quick/indexed reading from disk that's preferable I think. If not, we should consider the in-memory approach -- I think threading would probably be the way to go, but we should test it out. Lot's of the machine learning functions execute non-python code and are thus likely to release the GIL. |
concurrent.futures looks neat. concurrent.futures and/or multiprocessing could be a way for a daemon process to manage per task child processes. Yeah, doesn't look like there is an easy way to share large amounts of data cross thread/process. In-memory cache may be a premature optimization anyway. Numpy uses libblas and liblapack. If pandas can tap into those through numpy, it should have good multicore performance. It looks like experimenting with the indexed file formats is the way to go, maybe trying a few formats. |
I'd like to get an idea of what, at least initially, the data processing requirements are for cognoma.
Here's some questions I think we'll need to answer. Let's discuss here.
How "big" is the data? Number of rows, columns, and bytes?
Can a classifier search run on a single multicore machine?
Will we need distributed computing for each classifier search?
Where will the cancer data live? How big is it? (goes into the first question)
Where will the results live? How big is an individual result set?
The text was updated successfully, but these errors were encountered: