-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison with HDF5(pytables) #346
Comments
Hi @graykode, thanks for posting this issue! Quick note: In your example above, you may want to the following modifications:
How TileDB performs writes TileDB follows a general algorithm that applies to both dense and sparse arrays, with any arbitrary filter (e.g., compression, encryption, etc) and all layouts. It works as follows:
Everything internally is heavily parallelized with TBB, but still the algorithm needs to perform the 2 fully copies I mention above. What happens in your example Your example is a special scenario where you have a single tile and, therefore, the cells in the numpy buffer have the same layout as the one that will be written to disk. Moreover, you are not specifying any filter. Consequently, it is possible to avoid the 2 extra copies, which is what I assume that HDF5 does. Hence the difference in performance. The solution We need to optimize for special scenarios like this. It is fairly easy before executing the query to determine whether it is a special scenario where no cell shuffling and filtering is involved. In those cases we should resort to directly writing from the user buffers (e.g., numpy arrays) to disk, bypassing the tile preparation and filtering steps. Thanks for putting this on our radar. We'll implement the optimizations and provide updates soon. |
Thanks for your reply I read your solution and understand that it simply means to save it in the form of a buffer, just Thanks @stavrospapadopoulos |
The schema = tiledb.ArraySchema(domain=dom,
sparse=False,
attrs=[
tiledb.Attr(name="a2", dtype=np.float64) # <- this is the change here
]) Regarding the solution, you do not need to do anything :-). We need to implement the solution I suggested inside TileDB core and once we merge a patch, you will just get the performance boost without changing your code. I hope this helps. |
I understand what you're saying. Then I will wait for your patch. Thanks |
I compared numpy array of shapes (200000, 784) with the use of a dense-array in tileDB and the 'create_array' of pytable.
However, the speed of the tileDB's I/O is significantly slower than that of the pytable.
tiledb
pytable
I'd like to know the cause.
I look forward to hearing some solutions. @ihnorton
The text was updated successfully, but these errors were encountered: