Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse from_pandas ambiguity #1170

Open
royassis opened this issue Jun 11, 2022 · 10 comments
Open

sparse from_pandas ambiguity #1170

royassis opened this issue Jun 11, 2022 · 10 comments

Comments

@royassis
Copy link

Hey there

I'm using tiledb.from_pandas to create a tiledb array from a pandas dataframe.
My question regards the sparse parameter of the from_pandas function.

My dataframe mostly consists of "0" values. Should I convert them to np.nan's ?

@nguyenv
Copy link
Collaborator

nguyenv commented Jun 13, 2022

Hi @royassis,

I would drop the rows containing the 0 values:

import tiledb
import numpy as np
import pandas as pd

# dataframe with a lot of zero values
data = pd.DataFrame(np.random.randint(0, 2, size=(10)))
print(data)

# drop the non-zero values
data = data[data[0] != 0]

uri = "example_pd_sparse.tdb"

# remove the array if it already exists
vfs = tiledb.VFS()
if vfs.is_dir(uri):
    vfs.remove_dir(uri)

# create a tiledb array from a pandas dataframe
tiledb.from_pandas(uri, data, sparse=True)

# read the tiledb array
with tiledb.open(uri, "r") as A:
    print(A.df[:])
    print(data.equals(A.df[:]))

Example run:

(tiledb-3.10) vivian@mangonada:~/tiledb-bugs$ python pd-df-zeros.py
original data with zeros
   0
0  0
1  0
2  0
3  0
4  1
5  0
6  1
7  0
8  1
9  1
resulting sparse array
   0
4  1
6  1
8  1
9  1
True

Let me know if this answers your question.

Thanks.

@royassis
Copy link
Author

royassis commented Jun 13, 2022

Hey :)

Actually my df has multiple columns.

@nguyenv
Copy link
Collaborator

nguyenv commented Jun 13, 2022

I think I understand now. tiledb.from_pandas does recognize Pandas nullable dtypes (note the pd.UInt8Dtype()) and will accordingly set the TileDB attribute to be nullable (note the resulting schema). You can set nullable values with np.nan, pd.NA, or None in the dataframe and that will be reflected in the TileDB array.

import tiledb
import numpy as np
import pandas as pd

print("original data with zeros")
data = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
print(data)

# convert to a pandas nullable dtype and replace the 0s with nullable value
data = data.astype(pd.UInt8Dtype())
data = data.replace(0, None)

uri = "example_pd_sparse.tdb"

# remove the array if it already exists
vfs = tiledb.VFS()
if vfs.is_dir(uri):
    vfs.remove_dir(uri)

# create a tiledb array from a pandas dataframe
tiledb.from_pandas(uri, data, sparse=True)

# read the tiledb array
with tiledb.open(uri, "r") as A:
    print("resulting array")
    print(A.df[:])
    print(data.equals(A.df[:]))
original data with zeros
   0  1  2
0  1  0  1
1  1  0  0
2  0  1  0
3  0  0  0
4  0  0  1
5  1  0  1
6  0  1  1
7  1  1  1
8  1  0  0
9  1  0  0
resulting array
ArraySchema(
  domain=Domain(*[
    Dim(name='__tiledb_rows', domain=(0, 9), tile=9, dtype='int64', filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    Attr(name='0', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='1', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='2', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=True,
  allows_duplicates=True,
)
      0     1     2
0     1  <NA>     1
1     1  <NA>  <NA>
2  <NA>     1  <NA>
3  <NA>  <NA>  <NA>
4  <NA>  <NA>     1
5     1  <NA>     1
6  <NA>     1     1
7     1     1     1
8     1  <NA>  <NA>
9     1  <NA>  <NA>
True

However, since all your coordinates (__tiledb_rows) contain data, you won't have any of the benefits of a sparse array, and you might even be better off creating this as dense.

If all the columns in your dataframe are the same datatype, I'm curious if you are looking for something more like this? It does not use tiledb.from_pandas. We create a schema ourselves with two dimensions and one attribute. Convert the dataframe to a NumPy array and grab the nonzero coords and data to write to the TileDB array. You'll result in an array that is sparsely populated.

import tiledb
import numpy as np
import pandas as pd

print("original data with zeros")
original_data = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
print(original_data)

uri = "example_normal_sparse.tdb"
# remove the array if it already exists
vfs = tiledb.VFS()
if vfs.is_dir(uri):
    vfs.remove_dir(uri)

arr = original_data.to_numpy()

dom = tiledb.Domain(
    tiledb.Dim("col", domain=(0, 9), dtype=np.uint8),
    tiledb.Dim("row", domain=(0, 2), dtype=np.uint8),
)
att = tiledb.Attr(dtype=np.uint8)
schema = tiledb.ArraySchema(domain=dom, attrs=(att,), sparse=True)

tiledb.Array.create(uri, schema)

with tiledb.open(uri, "w") as A:
    A[np.nonzero(arr)] = arr[np.nonzero(arr)]

with tiledb.open(uri, "r") as A:
    print("resulting array")
    print(A.schema)
    print(A.df[:])

original data with zeros
   0  1  2
0  0  0  0
1  1  0  1
2  0  1  0
3  0  0  0
4  0  1  1
5  0  1  0
6  0  1  0
7  1  1  0
8  0  0  1
9  0  0  1
resulting array
ArraySchema(
  domain=Domain(*[
    Dim(name='col', domain=(0, 9), tile=10, dtype='uint8'),
    Dim(name='row', domain=(0, 2), tile=3, dtype='uint8'),
  ]),
  attrs=[
    Attr(name='', dtype='uint8', var=False, nullable=False),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=True,
  allows_duplicates=False,
)

    col  row
0     1    0  1
1     1    2  1
2     2    1  1
3     4    1  1
4     4    2  1
5     5    1  1
6     6    1  1
7     7    0  1
8     7    1  1
9     8    2  1
10    9    2  1

@royassis
Copy link
Author

Ohh I understand now. I was hoping the from_pandas handles the conversion by itself.

I got genomic data with thousands of columns.

@stavrospapadopoulos
Copy link
Member

Thanks @royassis for checking TileDB out. If you could provide some information about your use case and the schema of the raw data, we can get back to you with an optimized array schema and optimized ingestion scripts. We have a lot of experience with genomics data. Cheers!

@royassis
Copy link
Author

Hey @stavrospapadopoulos I would love that, i'll check first with my colleagues on what data I can share.

@royassis
Copy link
Author

royassis commented Jun 13, 2022

Hey again @stavrospapadopoulos

We work with many formats, but mainly with .h5ad.
We have datasets with up to 30,000 genes and up to millions of cell_barcodes.
Our data is usually very sparse.

We have many projects, one of them is a small streamlit app that reads data from .h5ad files and do some visualizations with Scanpy. As time went by and we added a few larger datasets we looked for a solution that will give better performance, we found tiledb.

A usual query will be to pull the values from of all cell barcodes but only from a small number of genes (no more than 10).
In the query we will also pull some feature data (e.g cell type), do some aggregation per gene for each cell type and visualize the results.

When working with pandas a usual dataframe has genes + some added feature as the column names. Cell barcodes as the index of the dataframe. The data is mostly sprase (many zeros but all rows). And we want to do aggregations on a small subset of genes but for all barcodes.

@stavrospapadopoulos
Copy link
Member

Oh, you are working with single-cell data. You are in luck! We are working closely with the Chan Zuckerberg Initiative to define a unified single-cell data model and API, that is interoperable with Seurat, Bioconductor and ScanPy. Please check the API spec and ongoing TileDB implementation of the spec below:

@royassis
Copy link
Author

@nguyenv @stavrospapadopoulos Thank you both.

I got more questions to ask, what is the best place for that ?

@stavrospapadopoulos
Copy link
Member

You can join our Slack community or post questions on our forum.

@ihnorton ihnorton changed the title spare from_pandas ambiguity sparse from_pandas ambiguity Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants