Skip to content

TileDB writer performance improvements#39

Open
DavidStirling wants to merge 5 commits intoglencoesoftware:mainfrom
DavidStirling:remote-tables-2
Open

TileDB writer performance improvements#39
DavidStirling wants to merge 5 commits intoglencoesoftware:mainfrom
DavidStirling:remote-tables-2

Conversation

@DavidStirling
Copy link
Contributor

We previously tried to use a chunked writing approach to keep things as similar to the pytables writing API as possible. However in practice this had a few unintended consequences. Chunked writes into a tiledb create fragments which impact performance when trying to read the table. Our previous default chunk size was set arbitrarily and ended up being small (1000).

This PR makes a few revisions to improve things:

  • Default chunk size to 10,000
  • Add the Zstd compression filter which is applied by the native omero-plus generator.
  • Use the native chunking from tiledb's from_pandas method instead of doing so manually. This gives us better performance at the expense of losing granular progress tracking (there is no progress hook).
  • Add an optional consolidation/vacuum step after tiledb construction to clean up fragments (cleanup=True). Not sure if we want to expose this to the main entry point yet but it's there when calling the remote module directly. This is disabled by default to match previous functionality, we should evaluate whether this is necessary for typical workflows as it's an expensive process.

There's a broader question on whether we should try to figure out chunk size automatically somehow. It's not clear what provides optional performance with TileDB so we might need to investigate further.

For testing, try uploading a remote table starting with both a pandas dataframe in memory and a csv file on disk. Both should register without issues.

@DavidStirling DavidStirling requested review from kkoz and mabruce October 17, 2025 15:15
bar.update(1)
if isinstance(source, (str, Path)):
tiledb.from_csv(output_path, source, sparse=True, full_domain=True,
tile=10000, dim_filters=filters, attr_filters=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need tile to match chunksize. chunksize is just how many rows are read at a time. tile is what determines the number of files written. See https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/latest/python-api.html#tiledb.from_csv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tile attribute was set to match the omero-plus tiledb implementation. I did test this using different chunksize arguments and that altered the number of fragments in the resulting table even though tile was unchanged. Was there any reason that 10000 was selected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran create_tiledb with various chunk sizes and confirmed what you said - the chunk size seems to be the determining factor in the number of fragments. Not sure what tile is doing here. @chris-allan Do you know the significance of tile in from_csv here or why you chose 10000 as the default in https://github.com/glencoesoftware/omero-plus/blob/master/omero_plus/tables_tiledb.py#L420?

Cleanup and chunksize working as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tile attribute is setting the tile extent of the singular dimension from_csv will apply to the TileDB array it will create. Essentially, the chunk size. I chose 10000 because that was the size from_csv applied when I was doing my initial testing and I had no data to back up selecting any other number. Most pytables tables also use 10000; for that engine it is called "chunkshape" or "chunksize". If you want to dive into the details you can read about it here:

There is no right value and it depends heavily on the use case. Increase it and decompression and read overhead for individual or random access row retrieval can be substantially affected. Decrease it and bulk retrieval is comparatively more expensive.

/cc @erindiel, @mabruce, @emilroz, @sbesson

tiledb.from_pandas(output_path, source, sparse=True,
full_domain=True, dim_filters=filters,
attr_filters=None, chunksize=chunk_size,
allow_duplciates=False)
Copy link
Contributor

@mabruce mabruce Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
allow_duplciates=False)
allows_duplicates=False)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-authored-by: Marc Bruce <6548052+mabruce@users.noreply.github.com>
Co-authored-by: Marc Bruce <6548052+mabruce@users.noreply.github.com>
Copy link
Contributor

@mabruce mabruce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested using tiledb==0.34.0 to avoid triggering #41.

Tested csv input with and without cleanup, and with different chunk_size parameters. All scenarios tested produced identical DataFrames as read by tiledb.open(...).df[:].

Testing Pandas input and re-reading the table via tiledb.open(...).df[:] similarly produced identical DataFrames. Interestingly, both chunk_size=100000 and =10000 with cleanup=False still produced a single TileDB fragment instead of the 2 or 12 fragments respectively from the csv test (~115000 rows). Presumably this is down to whatever tiledb.from_csv does prior to handing it over to tiledb.from_pandas internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants