-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[c++/python] Add SOMAArray::stats
to get query stats
#1927
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking as 'request changes' solely to put a 'pause' on this while discussing with @KiterLuc
My concerns -- not just about this PR; more to make sure this is really what the limitations are with core and that we can't do anything about it in TileDB-SOMA:
- My mental model of how stats are supposed to work is
- (a)
stats_enable
- (b) do something
- (c) do something else
- (d,e,f) make all sorts of nested whatever
- (g)
stats_dump
- This produces one dump for whatever core accesses were done through all of points b-f
- (a)
What appears to be the case now is this is broken, in the core API, in three ways:
- The final
stats_dump
does not includequery->stats()
output for various instances ofquery
-- some things appear in the final dump, others in the "side data" from a single query - We have to do
exp.ms["RNA"].X["data"].stats()
separately - Even then: on this PR as-is, we get the results of one query -- so in the case of
X.read().concat()
that takes three paginations, the stats will only have the results of the last query
If the core stats API is really such that we have no choice in TileDB-SOMA, we need a "caveat emptor" somewhere that stats
calls will need to be inserted inside each iteration of X.read()
, in particular, code will need to be modified in order to be profiled correctly.
At the risk of getting this wrong, there are actually (AFAIK) two APIs. The one you describe where we globally enable and collect. But there is also a second one, just on the query engine. TileDB-R calls the 'getter' if a toggle is on, and this functions independently of the global state. Mini example below the fold. And it seems to me that the PR hits the same accessor and should therefor work the same way. > uri <- "/tmp/tiledb/penguins"
> res <- tiledb_array(uri, return_as="data.table", query_statistics=TRUE)[] # ask for query stats
> stats <- attr(res, "query_statistics") # we return JSON string as attribute of the return object
> str( jsonlite::fromJSON(stats) )
List of 2
$ timers :List of 36
..$ Context.StorageManager.Query.Reader.unfilter_coord_tiles.sum : num 0.000301
..$ Context.StorageManager.Query.Reader.unfilter_coord_tiles.avg : num 1e-04
..$ Context.StorageManager.Query.Reader.unfilter_attr_tiles.sum : num 0.00264
..$ Context.StorageManager.Query.Reader.unfilter_attr_tiles.avg : num 0.00033
..$ Context.StorageManager.Query.Reader.tile_offset_sizes.sum : num 4.41e-06
..$ Context.StorageManager.Query.Reader.tile_offset_sizes.avg : num 4.41e-06
..$ Context.StorageManager.Query.Reader.resize_fixed_results_to_copy.sum: num 1.85e-06
..$ Context.StorageManager.Query.Reader.resize_fixed_results_to_copy.avg: num 1.85e-06
..$ Context.StorageManager.Query.Reader.read_tiles.sum : num 0.000289
..$ Context.StorageManager.Query.Reader.read_tiles.avg : num 7.23e-05
..$ Context.StorageManager.Query.Reader.read_coordinate_tiles.sum : num 0.000115
..$ Context.StorageManager.Query.Reader.read_coordinate_tiles.avg : num 5.74e-05
..$ Context.StorageManager.Query.Reader.read_attribute_tiles.sum : num 0.000183
..$ Context.StorageManager.Query.Reader.read_attribute_tiles.avg : num 9.17e-05
..$ Context.StorageManager.Query.Reader.read_and_unfilter_coords.sum : num 0.00045
..$ Context.StorageManager.Query.Reader.read_and_unfilter_coords.avg : num 0.00045
..$ Context.StorageManager.Query.Reader.read_and_unfilter_attributes.sum: num 0.00284
..$ Context.StorageManager.Query.Reader.read_and_unfilter_attributes.avg: num 0.00284
..$ Context.StorageManager.Query.Reader.process_tiles.sum : num 0.00323
..$ Context.StorageManager.Query.Reader.process_tiles.avg : num 0.00323
..$ Context.StorageManager.Query.Reader.load_tile_var_sizes.sum : num 3.21e-06
..$ Context.StorageManager.Query.Reader.load_tile_var_sizes.avg : num 3.21e-06
..$ Context.StorageManager.Query.Reader.load_tile_offsets.sum : num 0.000583
..$ Context.StorageManager.Query.Reader.load_tile_offsets.avg : num 0.000194
..$ Context.StorageManager.Query.Reader.load_initial_data.sum : num 0.000266
..$ Context.StorageManager.Query.Reader.load_initial_data.avg : num 0.000266
..$ Context.StorageManager.Query.Reader.dowork.sum : num 0.00469
..$ Context.StorageManager.Query.Reader.dowork.avg : num 0.00469
..$ Context.StorageManager.Query.Reader.create_result_tiles.sum : num 3.55e-06
..$ Context.StorageManager.Query.Reader.create_result_tiles.avg : num 3.55e-06
..$ Context.StorageManager.Query.Reader.copy_fixed_data_tiles.sum : num 0.000338
..$ Context.StorageManager.Query.Reader.copy_fixed_data_tiles.avg : num 3.76e-05
..$ Context.StorageManager.Query.Reader.compute_tile_bitmaps.sum : num 8.19e-05
..$ Context.StorageManager.Query.Reader.compute_tile_bitmaps.avg : num 8.19e-05
..$ Context.StorageManager.Query.Reader.apply_query_condition.sum : num 5.58e-07
..$ Context.StorageManager.Query.Reader.apply_query_condition.avg : num 5.58e-07
$ counters:List of 12
..$ Context.StorageManager.Query.Reader.result_num : int 344
..$ Context.StorageManager.Query.Reader.read_unfiltered_byte_num: int 16856
..$ Context.StorageManager.Query.Reader.num_tiles_read : int 9
..$ Context.StorageManager.Query.Reader.loop_num : int 1
..$ Context.StorageManager.Query.Reader.internal_loop_num : int 1
..$ Context.StorageManager.Query.Reader.ignored_tiles : int 0
..$ Context.StorageManager.Query.Reader.dim_num : int 1
..$ Context.StorageManager.Query.Reader.dim_fixed_num : int 1
..$ Context.StorageManager.Query.Reader.cell_num : int 4128
..$ Context.StorageManager.Query.Reader.attr_num : int 8
..$ Context.StorageManager.Query.Reader.attr_nullable_num : int 5
..$ Context.StorageManager.Query.Reader.attr_fixed_num : int 8
> |
Thanks @eddelbuettel We're agreed there are two APIs, one on the global state, and the other on the query Above, I enumerated three concerns with that status quo (a status quo which I don't much like) My articulated first choice/hope is that we can find a way to simply get all the stats through one interface My articulated second choice/hope is that we need to communicate to all users of this that there are two paths now, and furthermore, that |
I will hold off on making additional changes for now. |
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1927 +/- ##
=======================================
Coverage 89.76% 89.76%
=======================================
Files 34 34
Lines 3597 3597
=======================================
Hits 3229 3229
Misses 368 368
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@nguyenv I think what we talked about in Slack will work, namely:
Also: this is a non-urgent PR. However, tiledb stats are basic to have, and important for supportability, so at some point we need to figure out a way to let people access query stats. |
I like option 3 because it's most similar to how the global stats work where first you enable stats and then get the stats in another call. |
Had the some instinctive reaction yesterday when that came up: one query statistic dump per read attempt during the iterations, resulting in one json string per read attempt. If the user is iterating over reads, the burden may be in the reader to fetch the individual strings which should work both R and Python. And just how |
@nguyenv new info on #1932 ... I think there's a straight-up bug here (I don't know where) and we should not be having to create a separate per-query stats accessor. Details on #1932 but TL;DR query stats do appear in global stats for the same data & same query in TileDB-Py and TileDB-SOMA-R, just not in TileDB-SOMA-Py ... |
Some observations: If we open a
If we open a
This also does not get us the query stats (where
|
Issue and/or context:
Requested by @johnkerl. Context: #1932.
Changes:
SOMAArray:;stats
which callstiledb::Query::stats
to get stats of query runSOMAArray.stats
Python binding