TileDB writer performance improvements by DavidStirling · Pull Request #39 · glencoesoftware/omero2pandas

DavidStirling · 2025-10-17T15:15:52Z

We previously tried to use a chunked writing approach to keep things as similar to the pytables writing API as possible. However in practice this had a few unintended consequences. Chunked writes into a tiledb create fragments which impact performance when trying to read the table. Our previous default chunk size was set arbitrarily and ended up being small (1000).

This PR makes a few revisions to improve things:

Default chunk size to 10,000
Add the Zstd compression filter which is applied by the native omero-plus generator.
Use the native chunking from tiledb's from_pandas method instead of doing so manually. This gives us better performance at the expense of losing granular progress tracking (there is no progress hook).
Add an optional consolidation/vacuum step after tiledb construction to clean up fragments (cleanup=True). Not sure if we want to expose this to the main entry point yet but it's there when calling the remote module directly. This is disabled by default to match previous functionality, we should evaluate whether this is necessary for typical workflows as it's an expensive process.

There's a broader question on whether we should try to figure out chunk size automatically somehow. It's not clear what provides optional performance with TileDB so we might need to investigate further.

For testing, try uploading a remote table starting with both a pandas dataframe in memory and a csv file on disk. Both should register without issues.

kkoz · 2025-10-22T13:38:26Z

omero2pandas/remote.py

+        bar.update(1)
+        if isinstance(source, (str, Path)):
+            tiledb.from_csv(output_path, source, sparse=True, full_domain=True,
+                            tile=10000, dim_filters=filters, attr_filters=None,


I believe we need tile to match chunksize. chunksize is just how many rows are read at a time. tile is what determines the number of files written. See https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/latest/python-api.html#tiledb.from_csv

The tile attribute was set to match the omero-plus tiledb implementation. I did test this using different chunksize arguments and that altered the number of fragments in the resulting table even though tile was unchanged. Was there any reason that 10000 was selected?

Ran create_tiledb with various chunk sizes and confirmed what you said - the chunk size seems to be the determining factor in the number of fragments. Not sure what tile is doing here. @chris-allan Do you know the significance of tile in from_csv here or why you chose 10000 as the default in https://github.com/glencoesoftware/omero-plus/blob/master/omero_plus/tables_tiledb.py#L420?

Cleanup and chunksize working as expected.

The tile attribute is setting the tile extent of the singular dimension from_csv will apply to the TileDB array it will create. Essentially, the chunk size. I chose 10000 because that was the size from_csv applied when I was doing my initial testing and I had no data to back up selecting any other number. Most pytables tables also use 10000; for that engine it is called "chunkshape" or "chunksize". If you want to dive into the details you can read about it here:

https://www.pytables.org/usersguide/optimization.html

There is no right value and it depends heavily on the use case. Increase it and decompression and read overhead for individual or random access row retrieval can be substantially affected. Decrease it and bulk retrieval is comparatively more expensive.

/cc @erindiel, @mabruce, @emilroz, @sbesson

mabruce · 2025-11-24T18:27:44Z

omero2pandas/remote.py

+            tiledb.from_pandas(output_path, source, sparse=True,
+                               full_domain=True, dim_filters=filters,
+                               attr_filters=None, chunksize=chunk_size,
+                               allow_duplciates=False)


Suggested change

allow_duplciates=False)

allows_duplicates=False)

https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#tiledb.from_pandas

Co-authored-by: Marc Bruce <6548052+mabruce@users.noreply.github.com>

omero2pandas/remote.py

Co-authored-by: Marc Bruce <6548052+mabruce@users.noreply.github.com>

mabruce

Tested using tiledb==0.34.0 to avoid triggering #41.

Tested csv input with and without cleanup, and with different chunk_size parameters. All scenarios tested produced identical DataFrames as read by tiledb.open(...).df[:].

Testing Pandas input and re-reading the table via tiledb.open(...).df[:] similarly produced identical DataFrames. Interestingly, both chunk_size=100000 and =10000 with cleanup=False still produced a single TileDB fragment instead of the 2 or 12 fragments respectively from the csv test (~115000 rows). Presumably this is down to whatever tiledb.from_csv does prior to handing it over to tiledb.from_pandas internally.

DavidStirling added 3 commits October 17, 2025 15:04

Use native tiledb writers

6eddab3

Default consolidation to off

7b2497a

Add cleanup arg to create_remote_table

fdb8cd4

DavidStirling requested review from kkoz and mabruce October 17, 2025 15:15

kkoz requested changes Oct 22, 2025

View reviewed changes

kkoz approved these changes Oct 29, 2025

View reviewed changes

DavidStirling mentioned this pull request Nov 24, 2025

TileDB conversion producing jump in __tiledb_rows vs real rows #41

Closed

mabruce reviewed Nov 24, 2025

View reviewed changes

Update omero2pandas/remote.py

0f02132

Co-authored-by: Marc Bruce <6548052+mabruce@users.noreply.github.com>

mabruce reviewed Nov 24, 2025

View reviewed changes

omero2pandas/remote.py Outdated Show resolved Hide resolved

Update omero2pandas/remote.py

8f26a30

Co-authored-by: Marc Bruce <6548052+mabruce@users.noreply.github.com>

mabruce approved these changes Nov 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TileDB writer performance improvements#39

TileDB writer performance improvements#39
DavidStirling wants to merge 5 commits intoglencoesoftware:mainfrom
DavidStirling:remote-tables-2

DavidStirling commented Oct 17, 2025

Uh oh!

kkoz Oct 22, 2025

Uh oh!

DavidStirling Oct 22, 2025

Uh oh!

kkoz Oct 22, 2025

Uh oh!

chris-allan Oct 22, 2025

Uh oh!

mabruce Nov 24, 2025 •

edited

Loading

Uh oh!

mabruce Nov 24, 2025

Uh oh!

Uh oh!

mabruce left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

DavidStirling commented Oct 17, 2025

Uh oh!

kkoz Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

DavidStirling Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

kkoz Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

chris-allan Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

mabruce Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mabruce Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mabruce left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mabruce Nov 24, 2025 •

edited

Loading