-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sparse from_pandas ambiguity #1170
Comments
Hi @royassis, I would drop the rows containing the 0 values:
Example run:
Let me know if this answers your question. Thanks. |
Hey :) Actually my df has multiple columns. |
I think I understand now.
However, since all your coordinates ( If all the columns in your dataframe are the same datatype, I'm curious if you are looking for something more like this? It does not use
|
Ohh I understand now. I was hoping the from_pandas handles the conversion by itself. I got genomic data with thousands of columns. |
Thanks @royassis for checking TileDB out. If you could provide some information about your use case and the schema of the raw data, we can get back to you with an optimized array schema and optimized ingestion scripts. We have a lot of experience with genomics data. Cheers! |
Hey @stavrospapadopoulos I would love that, i'll check first with my colleagues on what data I can share. |
Hey again @stavrospapadopoulos We work with many formats, but mainly with .h5ad. We have many projects, one of them is a small streamlit app that reads data from .h5ad files and do some visualizations with Scanpy. As time went by and we added a few larger datasets we looked for a solution that will give better performance, we found tiledb. A usual query will be to pull the values from of all cell barcodes but only from a small number of genes (no more than 10). When working with pandas a usual dataframe has genes + some added feature as the column names. Cell barcodes as the index of the dataframe. The data is mostly sprase (many zeros but all rows). And we want to do aggregations on a small subset of genes but for all barcodes. |
Oh, you are working with single-cell data. You are in luck! We are working closely with the Chan Zuckerberg Initiative to define a unified single-cell data model and API, that is interoperable with Seurat, Bioconductor and ScanPy. Please check the API spec and ongoing TileDB implementation of the spec below: |
@nguyenv @stavrospapadopoulos Thank you both. I got more questions to ask, what is the best place for that ? |
You can join our Slack community or post questions on our forum. |
Hey there
I'm using tiledb.from_pandas to create a tiledb array from a pandas dataframe.
My question regards the sparse parameter of the from_pandas function.
My dataframe mostly consists of "0" values. Should I convert them to np.nan's ?
The text was updated successfully, but these errors were encountered: