-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Last value Cache - Increase in outliers beyond 16 threads for 100 series #25562
Comments
Hey @MaduMitha-Ravi - I'm wondering if we have observed similar break down in performance for higher thread counts when issuing regular queries, i.e., not to the last cache? I want to rule out that this is related to something systemic vs. in the last cache specifically before digging into what might be wrong in the cache. |
I will do some quick runs and update in here. We can modify the issue based on evidence. |
@hiltontj You suspicion is right. More outliers spike with the increase in concurrency. |
@hiltontj We encountered a problem with concurrency in IOx before that required moving query planning off of the IO threadpool and onto the DF threadpool. The pr is influxdata/influxdb_iox#11029 which has pointers to related PRs and issues that are worth reading through. Basically, we weren't able to take advantage of all the cores of a larger machine because we have two threadpools: one for tokio IO and one for DF query execution. Too much happening in the IO threadpool would cause IO stalls and make it so we couldn't effectively utilize all cores. Might be the case again, but might not. Thought it was worth highlighting. |
Thanks for confirming @MaduMitha-Ravi and for the pointer @pauldix - @MaduMitha-Ravi is this is a major blocker? If so, I can start looking into it; otherwise, I will dive into this next week once I am through with #25539 |
@hiltontj Not a blocker, just a concern. We can take it up next week. |
Working on the re-runs with the latest build (with fix), will update once I am done. |
Thanks for the update @MaduMitha-Ravi - wasn't expecting that, but clearly this warrants more investigation. I will open a separate issue to write out a plan for investigating this further. Can you provide the command line arguments that are being used to run the |
Trace Analysis of 4 concurrent threads,
Note: This is just the available trace spans, it is missing a few. @hiltontj Attaching two traces from the similar query, one took 7ms and other 70ms. The query is last value cache for count of 1 for 100 series. |
@MaduMitha-Ravi are the 7ms and 70ms traces for the same SQL query, but different samples? or are they from different SQL queries? |
Same query, but the tag value (device_id) differs. |
@hiltontj Below are traces from the similar query that differs only by the tag value/where clause.
c4-oss-multipleseries-count1.json Note: A small correction, the data shared yesterday is not for 100 series just multiple series. I am re-doing with exact 100 series, though it does not change the regression. |
Thanks @MaduMitha-Ravi - so, to clarify: For No concurrency - 14 series, that had an For Concurrency 4 - 17 series, that was four concurrent queries, each with an |
yes, that is right. |
@hiltontj Results with exact 100 series in the where clause for Last value cache. Query: Query latency and corresponding QPS - Regression is observed even with no concurrency: Traces for comparison of latency in components: c4-oss-100series-count1.json Configuration details
|
Thanks @MaduMitha-Ravi - I have opened #25615 where we are investigating ways of fixing the concurrency issues. It looks like it is a systemic issue still, so hopefully once we sort that out the last cache will go back to performing well. I'll notify you on any PRs |
Latest Last value cache results on PRO - Dec 12thLarge machine configuration![]() ![]() 8XL machine configurationTo investigate: @hiltontj
![]() ![]() At concurrent query load of 72 threads, |
That is odd. The last cache doesn't do any partitioning, so for individual queries cannot leverage multiple cores; however, for many concurrent queries I would think the runtime would be fanning things out.
Really hard to say why. I think we need to take a close look at how we are setting up the tokio runtimes and experiment with some different configurations. I don't think we have it right. |
@hiltontj If you do have any configurations in mind (tokio, IO threads etc.), I can try them and share the numbers. |
@MaduMitha-Ravi - the only control available right now is to specify the number of threads for the DataFusion runtime, via:
Which defaults to the number of cores on the system, so with these higher core machine tests it should be leveraging those cores. The server also runs a separate IO runtime, which also spins up as many threads as there are CPUs; however, there is no config to control that at the moment. I plan to discuss this with the team in our sync meeting today, and hopefully we can land on a path forward and get something out for y'all to test against. |
It would be interesting to see this overlaid with the queries per second as well. Showing the throughput in addition to the latencies is useful. |
@MaduMitha-Ravi is this still an issue? |
@MaduMitha-Ravi - I opened #26077 to add more tracing spans to the last cache query execution. As for how we improve the above. I think having those spans could inform that decision. |
Thanks @hiltontj , I would like to keep this open until then. |
Agreed. I suspect some sort of partitioning/parallelization will be needed to make the LVC go faster for larger series sets, but more specifically, the spans I described in that issue can inform how much of the time is spent on on evaluating the |
Increase in Outliers beyond 16 thread concurrency (Last value cache)
Could there be a wait happening on some internal resources?
Evidence
Note: How we capture latency (P95 reported) is by having backgrounded threads which are 12, 14, 16 etc. and collect the metrics from just one. This shows on concurrent load, how a particular user observes performance. Stating that, QPS could have been impacted by the outliers observed.
The text was updated successfully, but these errors were encountered: