sparse from_pandas ambiguity #1170

royassis · 2022-06-11T09:49:13Z

Hey there

I'm using tiledb.from_pandas to create a tiledb array from a pandas dataframe.
My question regards the sparse parameter of the from_pandas function.

My dataframe mostly consists of "0" values. Should I convert them to np.nan's ?

nguyenv · 2022-06-13T13:55:29Z

Hi @royassis,

I would drop the rows containing the 0 values:

import tiledb
import numpy as np
import pandas as pd

# dataframe with a lot of zero values
data = pd.DataFrame(np.random.randint(0, 2, size=(10)))
print(data)

# drop the non-zero values
data = data[data[0] != 0]

uri = "example_pd_sparse.tdb"

# remove the array if it already exists
vfs = tiledb.VFS()
if vfs.is_dir(uri):
    vfs.remove_dir(uri)

# create a tiledb array from a pandas dataframe
tiledb.from_pandas(uri, data, sparse=True)

# read the tiledb array
with tiledb.open(uri, "r") as A:
    print(A.df[:])
    print(data.equals(A.df[:]))

Example run:

(tiledb-3.10) vivian@mangonada:~/tiledb-bugs$ python pd-df-zeros.py
original data with zeros
   0
0  0
1  0
2  0
3  0
4  1
5  0
6  1
7  0
8  1
9  1
resulting sparse array
   0
4  1
6  1
8  1
9  1
True

Let me know if this answers your question.

Thanks.

royassis · 2022-06-13T14:34:26Z

Hey :)

Actually my df has multiple columns.

nguyenv · 2022-06-13T16:31:52Z

I think I understand now. tiledb.from_pandas does recognize Pandas nullable dtypes (note the pd.UInt8Dtype()) and will accordingly set the TileDB attribute to be nullable (note the resulting schema). You can set nullable values with np.nan, pd.NA, or None in the dataframe and that will be reflected in the TileDB array.

import tiledb
import numpy as np
import pandas as pd

print("original data with zeros")
data = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
print(data)

# convert to a pandas nullable dtype and replace the 0s with nullable value
data = data.astype(pd.UInt8Dtype())
data = data.replace(0, None)

uri = "example_pd_sparse.tdb"

# remove the array if it already exists
vfs = tiledb.VFS()
if vfs.is_dir(uri):
    vfs.remove_dir(uri)

# create a tiledb array from a pandas dataframe
tiledb.from_pandas(uri, data, sparse=True)

# read the tiledb array
with tiledb.open(uri, "r") as A:
    print("resulting array")
    print(A.df[:])
    print(data.equals(A.df[:]))

original data with zeros
   0  1  2
0  1  0  1
1  1  0  0
2  0  1  0
3  0  0  0
4  0  0  1
5  1  0  1
6  0  1  1
7  1  1  1
8  1  0  0
9  1  0  0
resulting array
ArraySchema(
  domain=Domain(*[
    Dim(name='__tiledb_rows', domain=(0, 9), tile=9, dtype='int64', filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    Attr(name='0', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='1', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='2', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=True,
  allows_duplicates=True,
)
      0     1     2
0     1  <NA>     1
1     1  <NA>  <NA>
2  <NA>     1  <NA>
3  <NA>  <NA>  <NA>
4  <NA>  <NA>     1
5     1  <NA>     1
6  <NA>     1     1
7     1     1     1
8     1  <NA>  <NA>
9     1  <NA>  <NA>
True

However, since all your coordinates (__tiledb_rows) contain data, you won't have any of the benefits of a sparse array, and you might even be better off creating this as dense.

If all the columns in your dataframe are the same datatype, I'm curious if you are looking for something more like this? It does not use tiledb.from_pandas. We create a schema ourselves with two dimensions and one attribute. Convert the dataframe to a NumPy array and grab the nonzero coords and data to write to the TileDB array. You'll result in an array that is sparsely populated.

import tiledb
import numpy as np
import pandas as pd

print("original data with zeros")
original_data = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
print(original_data)

uri = "example_normal_sparse.tdb"
# remove the array if it already exists
vfs = tiledb.VFS()
if vfs.is_dir(uri):
    vfs.remove_dir(uri)

arr = original_data.to_numpy()

dom = tiledb.Domain(
    tiledb.Dim("col", domain=(0, 9), dtype=np.uint8),
    tiledb.Dim("row", domain=(0, 2), dtype=np.uint8),
)
att = tiledb.Attr(dtype=np.uint8)
schema = tiledb.ArraySchema(domain=dom, attrs=(att,), sparse=True)

tiledb.Array.create(uri, schema)

with tiledb.open(uri, "w") as A:
    A[np.nonzero(arr)] = arr[np.nonzero(arr)]

with tiledb.open(uri, "r") as A:
    print("resulting array")
    print(A.schema)
    print(A.df[:])

original data with zeros
   0  1  2
0  0  0  0
1  1  0  1
2  0  1  0
3  0  0  0
4  0  1  1
5  0  1  0
6  0  1  0
7  1  1  0
8  0  0  1
9  0  0  1
resulting array
ArraySchema(
  domain=Domain(*[
    Dim(name='col', domain=(0, 9), tile=10, dtype='uint8'),
    Dim(name='row', domain=(0, 2), tile=3, dtype='uint8'),
  ]),
  attrs=[
    Attr(name='', dtype='uint8', var=False, nullable=False),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=True,
  allows_duplicates=False,
)

    col  row
0     1    0  1
1     1    2  1
2     2    1  1
3     4    1  1
4     4    2  1
5     5    1  1
6     6    1  1
7     7    0  1
8     7    1  1
9     8    2  1
10    9    2  1

royassis · 2022-06-13T17:10:43Z

Ohh I understand now. I was hoping the from_pandas handles the conversion by itself.

I got genomic data with thousands of columns.

stavrospapadopoulos · 2022-06-13T17:56:12Z

Thanks @royassis for checking TileDB out. If you could provide some information about your use case and the schema of the raw data, we can get back to you with an optimized array schema and optimized ingestion scripts. We have a lot of experience with genomics data. Cheers!

royassis · 2022-06-13T19:06:50Z

Hey @stavrospapadopoulos I would love that, i'll check first with my colleagues on what data I can share.

royassis · 2022-06-13T19:38:02Z

Hey again @stavrospapadopoulos

We work with many formats, but mainly with .h5ad.
We have datasets with up to 30,000 genes and up to millions of cell_barcodes.
Our data is usually very sparse.

We have many projects, one of them is a small streamlit app that reads data from .h5ad files and do some visualizations with Scanpy. As time went by and we added a few larger datasets we looked for a solution that will give better performance, we found tiledb.

A usual query will be to pull the values from of all cell barcodes but only from a small number of genes (no more than 10).
In the query we will also pull some feature data (e.g cell type), do some aggregation per gene for each cell type and visualize the results.

When working with pandas a usual dataframe has genes + some added feature as the column names. Cell barcodes as the index of the dataframe. The data is mostly sprase (many zeros but all rows). And we want to do aggregations on a small subset of genes but for all barcodes.

stavrospapadopoulos · 2022-06-13T22:29:06Z

Oh, you are working with single-cell data. You are in luck! We are working closely with the Chan Zuckerberg Initiative to define a unified single-cell data model and API, that is interoperable with Seurat, Bioconductor and ScanPy. Please check the API spec and ongoing TileDB implementation of the spec below:

royassis · 2022-06-14T05:50:32Z

@nguyenv @stavrospapadopoulos Thank you both.

I got more questions to ask, what is the best place for that ?

stavrospapadopoulos · 2022-06-14T12:05:09Z

You can join our Slack community or post questions on our forum.

ihnorton changed the title ~~spare from_pandas ambiguity~~ sparse from_pandas ambiguity Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sparse from_pandas ambiguity #1170

sparse from_pandas ambiguity #1170

royassis commented Jun 11, 2022

nguyenv commented Jun 13, 2022

royassis commented Jun 13, 2022 •

edited

Loading

nguyenv commented Jun 13, 2022

royassis commented Jun 13, 2022

stavrospapadopoulos commented Jun 13, 2022

royassis commented Jun 13, 2022

royassis commented Jun 13, 2022 •

edited

Loading

stavrospapadopoulos commented Jun 13, 2022

royassis commented Jun 14, 2022

stavrospapadopoulos commented Jun 14, 2022

sparse from_pandas ambiguity #1170

sparse from_pandas ambiguity #1170

Comments

royassis commented Jun 11, 2022

nguyenv commented Jun 13, 2022

royassis commented Jun 13, 2022 • edited Loading

nguyenv commented Jun 13, 2022

royassis commented Jun 13, 2022

stavrospapadopoulos commented Jun 13, 2022

royassis commented Jun 13, 2022

royassis commented Jun 13, 2022 • edited Loading

stavrospapadopoulos commented Jun 13, 2022

royassis commented Jun 14, 2022

stavrospapadopoulos commented Jun 14, 2022

royassis commented Jun 13, 2022 •

edited

Loading

royassis commented Jun 13, 2022 •

edited

Loading