Last value Cache - Increase in outliers beyond 16 threads for 100 series #25562

MaduMitha-Ravi · 2024-11-18T03:36:37Z

Increase in Outliers beyond 16 thread concurrency (Last value cache)

For concurrency threads greater than 16, we are observing more outliers nearly 5x-10x of the typical latency thus impacting the P95 numbers
CPU usage was less than 20% and Memory consumption was less than 25%
This observation and pattern looks like there are some restrictions/limitations that are paving way for outliers of latencies.

Could there be a wait happening on some internal resources?

Evidence

Note: How we capture latency (P95 reported) is by having backgrounded threads which are 12, 14, 16 etc. and collect the metrics from just one. This shows on concurrent load, how a particular user observes performance. Stating that, QPS could have been impacted by the outliers observed.

hiltontj · 2024-11-18T15:42:10Z

Hey @MaduMitha-Ravi - I'm wondering if we have observed similar break down in performance for higher thread counts when issuing regular queries, i.e., not to the last cache? I want to rule out that this is related to something systemic vs. in the last cache specifically before digging into what might be wrong in the cache.

MaduMitha-Ravi · 2024-11-18T15:45:45Z

I will do some quick runs and update in here. We can modify the issue based on evidence.

MaduMitha-Ravi · 2024-11-18T21:42:05Z

@hiltontj You suspicion is right. More outliers spike with the increase in concurrency.

pauldix · 2024-11-19T00:36:02Z

@hiltontj We encountered a problem with concurrency in IOx before that required moving query planning off of the IO threadpool and onto the DF threadpool. The pr is influxdata/influxdb_iox#11029 which has pointers to related PRs and issues that are worth reading through.

Basically, we weren't able to take advantage of all the cores of a larger machine because we have two threadpools: one for tokio IO and one for DF query execution. Too much happening in the IO threadpool would cause IO stalls and make it so we couldn't effectively utilize all cores.

Might be the case again, but might not. Thought it was worth highlighting.

hiltontj · 2024-11-19T14:36:56Z

Thanks for confirming @MaduMitha-Ravi and for the pointer @pauldix - @MaduMitha-Ravi is this is a major blocker? If so, I can start looking into it; otherwise, I will dive into this next week once I am through with #25539

MaduMitha-Ravi · 2024-11-19T14:44:58Z

@hiltontj Not a blocker, just a concern. We can take it up next week.

MaduMitha-Ravi · 2024-12-02T20:47:20Z

Working on the re-runs with the latest build (with fix), will update once I am done.

MaduMitha-Ravi · 2024-12-04T03:59:24Z

Results after the merge in on DF Threadpool.

Latencies have spiked compared to previous experiment
CPU usage has increased significantly , reaches 80% at concurrency of 4 (previously was less than 25%)
Latencies over runs shows variability got introduced

Note:

Re-runs are in progress to be sure of the results
This performance impact could be the effect of the DF Thread pool or due to some other change in the software.

hiltontj · 2024-12-04T14:29:07Z

Thanks for the update @MaduMitha-Ravi - wasn't expecting that, but clearly this warrants more investigation. I will open a separate issue to write out a plan for investigating this further.

Can you provide the command line arguments that are being used to run the influxdb3 serve process?

MaduMitha-Ravi · 2024-12-04T19:34:00Z

Trace Analysis of 4 concurrent threads,

Measurement	Compare Item	Query Name	execute_stream_partitioned (ms)	sql_to_logical_plan (ms)	get_namespace (ms)	flight_planner (ms)	sql_to_physical_plan (ms)	query_planning (ms)	acquire (ms)	permit (ms)	create_physical_plan (ms)
P95	oss	lastcachevalue_10_100s	1.363	1.081	0.037	6.435	4.785	9.824	0.007	1.956	3.968
P95	oss	lastcachevalue_1_100s	2.183	1.513	0.068	7.489	4.997	11.193	0.008	2.088	4.109
P95	pro	lastcachevalue_10_100s	1.322	1.130	0.084	6.921	4.981	9.624	0.006	1.655	4.369
P95	pro	lastcachevalue_1_100s	1.616	1.332	0.123	6.748	4.501	8.660	0.008	1.602	3.972

Note: This is just the available trace spans, it is missing a few.

@hiltontj Attaching two traces from the similar query, one took 7ms and other 70ms. The query is last value cache for count of 1 for 100 series.

trace_7ms.json
trace_70ms.json

hiltontj · 2024-12-04T20:13:15Z

@MaduMitha-Ravi are the 7ms and 70ms traces for the same SQL query, but different samples? or are they from different SQL queries?

MaduMitha-Ravi · 2024-12-04T20:17:13Z

Same query, but the tag value (device_id) differs.
SELECT * FROM last_cache('senec_data', 'ntag1-lv-1') where device_id in (....);

MaduMitha-Ravi · 2024-12-04T20:37:46Z

@hiltontj Below are traces from the similar query that differs only by the tag value/where clause.

SELECT * FROM last_cache('senec_data', 'ntag1-lv-1') where device_id in (....);

No concurrency - 14 series (query took 8.9 ms)
Concurrency 4 - 17 series (query took 22.54 ms)
Concurrency 48 - 18 series (query took 598.72 ms)

c4-oss-multipleseries-count1.json
c48-oss-multipleseries-count1.json
noconcurrency-oss-multipleseries-count1.json

Note: A small correction, the data shared yesterday is not for 100 series just multiple series. I am re-doing with exact 100 series, though it does not change the regression.

hiltontj · 2024-12-04T20:45:04Z

Thanks @MaduMitha-Ravi - so, to clarify:

For No concurrency - 14 series, that had an in (...) clause that contained 14 device ID values?

For Concurrency 4 - 17 series, that was four concurrent queries, each with an in (...) clause that contains a different 17 device IDs?

MaduMitha-Ravi · 2024-12-04T20:46:19Z

yes, that is right.

MaduMitha-Ravi · 2024-12-05T18:08:43Z

@hiltontj Results with exact 100 series in the where clause for Last value cache.

Query:
SELECT * FROM last_cache('senec_data', 'ntag1-lv-1') where device_id in (....); where IN has 100 values.

Query latency and corresponding QPS - Regression is observed even with no concurrency:

Resource Usage:

Traces for comparison of latency in components:
No concurrency - 19.75 ms
Concurrency of 4 - 34.7 ms
Concurrency of 32 - 399.4 ms

c4-oss-100series-count1.json
c32-oss-100series-count1.json
noconcurrency-oss-100series-count1.json

Configuration details
Machine: m6gd.large, 2 cores and 8 GiB RAM
Serve

serve \
            --host-id {{.Machine.Name}} \
            --bucket {{.Args.BucketName}} \
            {{if .Cluster.Monolith.Spec.ProVersion}} \
                --mode {{.Args.Mode}} \
                {{if and .Args.ReplicatingHosts ( not .Args.IsCompactor ) }} --replicas {{.Args.ReplicatingHosts}} {{end}} \
                {{if .Args.RunCompactions }} --run-compactions {{end}} \
                {{if .Args.CompactionId}} --compactor-id {{.Args.CompactionId}} {{end}} \
                {{if .Args.IsCompactor}} --compaction-hosts {{.Args.ReplicatingHosts}} {{end}} \
            {{end}} \
            {{if eq .Args.Provider "aws"}} \
                --object-store s3 \
                --aws-default-region {{.Cluster.Spec.Region}} \
                --aws-access-key-id "{{.Args.AwsAccessKeyId}}" \
                --aws-secret-access-key "{{.Args.AwsSecretAcessKey}}"
            {{else if eq .Args.Provider "gcp"}} \
                --object-store google
            {{end}}

hiltontj · 2024-12-06T14:40:52Z

Thanks @MaduMitha-Ravi - I have opened #25615 where we are investigating ways of fixing the concurrency issues. It looks like it is a systemic issue still, so hopefully once we sort that out the last cache will go back to performing well. I'll notify you on any PRs

MaduMitha-Ravi · 2024-12-13T15:02:25Z

Latest Last value cache results on PRO - Dec 12th

Large machine configuration

8XL machine configuration

To investigate: @hiltontj

Why are we getting limited at 3 cores of 32 cores (or) 10% of CPU usage
Increase in variability with the increase in concurrency, CPU restrictions could be a reason?

At concurrent query load of 72 threads,

At concurrent query load of 16 threads,

At concurrent query load of 4 threads,

hiltontj · 2024-12-16T15:28:19Z

Why are we getting limited at 3 cores of 32 cores (or) 10% of CPU usage

That is odd. The last cache doesn't do any partitioning, so for individual queries cannot leverage multiple cores; however, for many concurrent queries I would think the runtime would be fanning things out.

Increase in variability with the increase in concurrency, CPU restrictions could be a reason?

Really hard to say why. I think we need to take a close look at how we are setting up the tokio runtimes and experiment with some different configurations. I don't think we have it right.

MaduMitha-Ravi · 2024-12-16T15:32:28Z

@hiltontj If you do have any configurations in mind (tokio, IO threads etc.), I can try them and share the numbers.

hiltontj · 2024-12-16T15:48:41Z

@MaduMitha-Ravi - the only control available right now is to specify the number of threads for the DataFusion runtime, via:

      --datafusion-num-threads <datafusion_runtime_num_threads>
          Set the maximum number of Datafusion runtime threads to use.

          Defaults to the number of logical cores on the system.

          [env: INFLUXDB_IOX_DATAFUSION_NUM_THREADS=]

Which defaults to the number of cores on the system, so with these higher core machine tests it should be leveraging those cores.

The server also runs a separate IO runtime, which also spins up as many threads as there are CPUs; however, there is no config to control that at the moment. I plan to discuss this with the team in our sync meeting today, and hopefully we can land on a path forward and get something out for y'all to test against.

MaduMitha-Ravi · 2025-01-08T03:09:38Z

Latest results of Last value cache with concurrency - Jan 7th

Key Observations

The graph shows that when there is room for CPU & Memory but they don't getting utilized beyond threads 32
Latencies increases and QPS gets stagnated beyond threads 32

Below graphs shows (1) increase in latency and (2) increase in query variability for the same queries under higher concurrent loads.

Latencies at Concurrency 16:

Latencies at Concurrency 32:

Without the outliers (less than 200ms),

Latencies at Concurrency 48:

Without the outliers (less than 800ms),

pauldix · 2025-01-08T11:44:04Z

It would be interesting to see this overlaid with the queries per second as well. Showing the throughput in addition to the latencies is useful.

barbaranelson · 2025-02-22T02:01:16Z

@MaduMitha-Ravi is this still an issue?

MaduMitha-Ravi · 2025-02-27T05:01:34Z

Yes, the concern still exists, below are the latest numbers.

The Last value cache with concurrency are executed across two datasets 17 rows/sec and 1k rows/sec on 4ENT where each node is 4XLarge
Bottleneck on QPS though the resources usage under concurrency is less than 30%
More variability for Multi-series Last value cache as the concurrency increases; We do not have spans that showcases the reason for the variability
Few rpc/query errors were observed with the increase in concurrency

query error: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: read tcp 172.31.23.234:32832->54.241.101.49:8086: use of closed network connection"

Query performance and QPS graphs,

Below graphs shows the resource usage,

60K dataset on Single and Multi-series Last value cache with Concurrency as 16 threads:

60K dataset on Single and Multi-series Last value cache with Concurrency as 1 thread:

hiltontj · 2025-02-27T11:32:11Z

@MaduMitha-Ravi - I opened #26077 to add more tracing spans to the last cache query execution.

As for how we improve the above. I think having those spans could inform that decision.

MaduMitha-Ravi · 2025-02-27T15:29:59Z

Thanks @hiltontj , I would like to keep this open until then.

hiltontj · 2025-02-27T15:43:28Z

I would like to keep this open until then.

Agreed. I suspect some sort of partitioning/parallelization will be needed to make the LVC go faster for larger series sets, but more specifically, the spans I described in that issue can inform how much of the time is spent on on evaluating the IN (...) predicate vs. scanning the cache leaf nodes.

MaduMitha-Ravi added the v3 label Nov 18, 2024

hiltontj mentioned this issue Nov 29, 2024

fix: plan queries on DF threadpool to not block IO in REST API #25604

Merged

hiltontj added the kind/perf label Dec 2, 2024

hiltontj mentioned this issue Dec 4, 2024

Concurrency issues investigation #25615

Open

MaduMitha-Ravi mentioned this issue Dec 16, 2024

Queries to last cache with large IN list predicates seem slow #25550

Open

hiltontj mentioned this issue Feb 27, 2025

Add tracing spans to last value cache #26077

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Last value Cache - Increase in outliers beyond 16 threads for 100 series #25562

Last value Cache - Increase in outliers beyond 16 threads for 100 series #25562

MaduMitha-Ravi commented Nov 18, 2024

hiltontj commented Nov 18, 2024

MaduMitha-Ravi commented Nov 18, 2024

MaduMitha-Ravi commented Nov 18, 2024

pauldix commented Nov 19, 2024

hiltontj commented Nov 19, 2024

MaduMitha-Ravi commented Nov 19, 2024

MaduMitha-Ravi commented Dec 2, 2024

MaduMitha-Ravi commented Dec 4, 2024

hiltontj commented Dec 4, 2024

MaduMitha-Ravi commented Dec 4, 2024

hiltontj commented Dec 4, 2024

MaduMitha-Ravi commented Dec 4, 2024

MaduMitha-Ravi commented Dec 4, 2024

hiltontj commented Dec 4, 2024

MaduMitha-Ravi commented Dec 4, 2024

MaduMitha-Ravi commented Dec 5, 2024 •

edited

Loading

hiltontj commented Dec 6, 2024

MaduMitha-Ravi commented Dec 13, 2024

hiltontj commented Dec 16, 2024

MaduMitha-Ravi commented Dec 16, 2024

hiltontj commented Dec 16, 2024

MaduMitha-Ravi commented Jan 8, 2025

pauldix commented Jan 8, 2025

barbaranelson commented Feb 22, 2025

MaduMitha-Ravi commented Feb 27, 2025

hiltontj commented Feb 27, 2025

MaduMitha-Ravi commented Feb 27, 2025

hiltontj commented Feb 27, 2025

Last value Cache - Increase in outliers beyond 16 threads for 100 series #25562

Last value Cache - Increase in outliers beyond 16 threads for 100 series #25562

Comments

MaduMitha-Ravi commented Nov 18, 2024

Increase in Outliers beyond 16 thread concurrency (Last value cache)

Evidence

hiltontj commented Nov 18, 2024

MaduMitha-Ravi commented Nov 18, 2024

MaduMitha-Ravi commented Nov 18, 2024

pauldix commented Nov 19, 2024

hiltontj commented Nov 19, 2024

MaduMitha-Ravi commented Nov 19, 2024

MaduMitha-Ravi commented Dec 2, 2024

MaduMitha-Ravi commented Dec 4, 2024

hiltontj commented Dec 4, 2024

MaduMitha-Ravi commented Dec 4, 2024

hiltontj commented Dec 4, 2024

MaduMitha-Ravi commented Dec 4, 2024

MaduMitha-Ravi commented Dec 4, 2024

hiltontj commented Dec 4, 2024

MaduMitha-Ravi commented Dec 4, 2024

MaduMitha-Ravi commented Dec 5, 2024 • edited Loading

hiltontj commented Dec 6, 2024

MaduMitha-Ravi commented Dec 13, 2024

Latest Last value cache results on PRO - Dec 12th

Large machine configuration

8XL machine configuration

hiltontj commented Dec 16, 2024

MaduMitha-Ravi commented Dec 16, 2024

hiltontj commented Dec 16, 2024

MaduMitha-Ravi commented Jan 8, 2025

Latest results of Last value cache with concurrency - Jan 7th

Key Observations

pauldix commented Jan 8, 2025

barbaranelson commented Feb 22, 2025

MaduMitha-Ravi commented Feb 27, 2025

hiltontj commented Feb 27, 2025

MaduMitha-Ravi commented Feb 27, 2025

hiltontj commented Feb 27, 2025

MaduMitha-Ravi commented Dec 5, 2024 •

edited

Loading