Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi_index causes TileDBError (std::bad_alloc) #296

Open
hnra opened this issue Apr 1, 2020 · 2 comments
Open

multi_index causes TileDBError (std::bad_alloc) #296

hnra opened this issue Apr 1, 2020 · 2 comments
Assignees

Comments

@hnra
Copy link

hnra commented Apr 1, 2020

I'm trying to write and read a ton randomly distributed values to a large sparse array. Writing to a TileDB sparse array is awesome; it's easy and super fast. However, I have not found a way to read the back.

My use-case involves updating the values of the sparse array by summing it with another sparse array in COO format. Using multi_index results in uncaught errors. Here's an example:

import tiledb as t
import numpy as np
import random

d1 = t.Dim(name="d1", domain=(1, 3_000_000), dtype=np.int64, tile=3000)
d2 = t.Dim(name="d2", domain=(1, 3_000_000), dtype=np.int64, tile=3000)
schema = t.ArraySchema(domain=t.Domain(d1, d2), sparse=True, attrs=[t.Attr(name="a", dtype=np.int64)])
t.SparseArray.create("test_sparse", schema)

# Just for testing
row = random.choices(range(1, 3_000_000), k=10_000_000)
col = random.choices(range(1, 3_000_000), k=10_000_000)
row, col = tuple(zip(*set(zip(row, col))))
cnt = random.choices(range(1, 3_000_000), k=len(row))
row, col = np.array(row), np.array(col)
cnt = np.array(cnt)

# Write the data to the array (super fast, awesome!)
with t.SparseArray("test_sparse", "w") as A:
    A[row, col] = cnt

# Read the data just written
with t.SparseArray("test_sparse", "r") as A:
    data = A.multi_index[row.tolist(), col.tolist()]
    print(data["a"])

The above multi_index call results in:

---------------------------------------------------------------------------
TileDBError                               Traceback (most recent call last)
<ipython-input-12-4a24ca219859> in <module>
      1 with t.SparseArray("test_sparse", "r") as A:
----> 2     data = A.multi_index[row.tolist(), col.tolist()]
      3     print(data["a"])

~/liu/anchor/.venv/lib/python3.8/site-packages/tiledb/multirange_indexing.py in __getitem__(self, idx)
    131 
    132         # TODO order
--> 133         result_dict = multi_index(
    134             self.array,
    135             attr_names,

tiledb/indexing.pyx in tiledb.libtiledb.multi_index()

tiledb/indexing.pyx in tiledb.libtiledb.multi_index()

tiledb/indexing.pyx in tiledb.libtiledb.execute_multi_index()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_ctx_err()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_tiledb_error()

TileDBError: Error: Internal TileDB uncaught exception; std::bad_alloc
@ihnorton ihnorton self-assigned this Apr 1, 2020
@ihnorton
Copy link
Member

ihnorton commented Apr 1, 2020

@hnra thanks for the bug report, investigating.

@hnra
Copy link
Author

hnra commented Apr 7, 2020

This may still be considered a bug, but I clearly misunderstood the functionality of multi_index. I thought it allowed for coordinate selection, similar to vindex or get_coordinate_selection in Zarr. The query I'm submitting in the code above is massive (I'm guessing it's 10,000,000^2) and cannot be allocated since it would require hundreds of TB.

Is there any way to access the elements of a SparseArray through coordinates, similar to when writing values? Right now I have resorted to splitting up my selections into multi_index queries, one for each row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants