Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency issues investigation #25615

Open
hiltontj opened this issue Dec 4, 2024 · 10 comments
Open

Concurrency issues investigation #25615

hiltontj opened this issue Dec 4, 2024 · 10 comments
Labels

Comments

@hiltontj
Copy link
Contributor

hiltontj commented Dec 4, 2024

Problem statement

The performance team has found that the influxdb3 process does not handle concurrency well. In #25604 we made a change very similar to what was done in IOx to move the query planning off of the main IO thread pool, but this did not improve things and instead seems to have lead to a regression (see #25562 (comment)).

This issue is for investigating why the system is breaking down under concurrent loads.

Ideas

Not configuring the main tokio runtime

Here we initialize the tokio runtime using the defaults:

let tokio_runtime = get_runtime(None)?;

This means that the tokio runtime is initializing with as many threads as there are CPU cores, so it could be competing with the DataFusion runtime as a result.

Hard-coding the query concurrency semaphore to 10

Here we hard-code the query concurrency semaphore for the:

concurrent_query_limit: 10,

This sets the semaphore limit here:

let query_execution_semaphore =
Arc::new(semaphore_metrics.new_semaphore(concurrent_query_limit));

Which is supposedly used by the flight service here:

async fn acquire_semaphore(&self, span: Option<Span>) -> InstrumentedAsyncOwnedSemaphorePermit {
Arc::clone(&self.query_execution_semaphore)
.acquire_owned(span)
.await
.expect("Semaphore should not be closed by anyone")
}

@hiltontj hiltontj added the v3 label Dec 4, 2024
@hiltontj hiltontj changed the title Concurrency investigation Concurrency issues investigation Dec 4, 2024
@pauldix
Copy link
Member

pauldix commented Dec 4, 2024

Should we do away with the query concurrency semaphore? I worry about if it unnecessarily slows things down if a query is waiting on IO (i.e. another query could be executing, but won't because of the semaphore).

There's definitely a point where we'll want to park incoming requests because of maxed out resource utilization, I just doubt that a semaphore is the best primitive for this. I'd rather just remove it for now and address query throttling later.

@hiltontj
Copy link
Contributor Author

hiltontj commented Dec 4, 2024

Should we do away with the query concurrency semaphore?

I think so. It looks like IOx does this in the querier by setting the limit to u16::MAX

@mgattozzi
Copy link
Contributor

Might be relevant as I think we synced this change over from Pro, but we did enable IO on the dedicated executor because without it we had crashes and the default to not have it enabled is from IOx. This was part of their split of having separate executors. I wonder if it makes sense to have just one executor to avoid contention in Monolith at least.

https://github.com/influxdata/influxdb_pro/pull/174

@pauldix
Copy link
Member

pauldix commented Dec 4, 2024

Let's remove the semaphore completely. One less thing to confound the investigation.

@pauldix
Copy link
Member

pauldix commented Dec 4, 2024

@mgattozzi it would be an interesting experiment to put everything on a single pool. Maybe we should give that a try too?

@hiltontj
Copy link
Contributor Author

hiltontj commented Dec 4, 2024

Let's remove the semaphore completely. One less thing to confound the investigation.

Can try to remove completely but I'm not sure it can be because it is relied on by core traits used in the Flight service. But I think setting it to u16::MAX should have the desired effect.

@praveen-influx
Copy link
Contributor

Can we run it through a profiler to see where it's spending the time (maybe on the box that perf team are using)?

@hiltontj
Copy link
Contributor Author

hiltontj commented Dec 6, 2024

It looks like IOx does this in the querier by setting the limit to u16::MAX

That was incorrect, IOx doesn't set u16::MAX as the limit, that is just the limit that they impose. IOx uses the default as 10, however, their scenario is different given that that default is set on a per-querier basis, so for us, it may make sense to remove the limit by default, but still allow it to be configurable. I will open an issue to deal with that specifically.

@hiltontj
Copy link
Contributor Author

hiltontj commented Dec 6, 2024

I opened #25627

@MaduMitha-Ravi
Copy link

@hiltontj fyi only - the concurrency issue is global across the product and just for Last value cache. The details I have observed is,

  • Even with concurrent query load of 2, the CPU cores are utilized to 95-99% on a large machine (2 cores and 8 GiB RAM)and increase in the latency is observed
  • To ensure this is not a limitation by the machine used, tried the same experiment on 2XLarge machine (8 cores and 32 GiB RAM) and still observed the CPU usage of 95-99% with increase in the latency

As the next step,
Taking the latest build of PRO, and retrying the concurrency problems with Large, 2XL, 8XL machine configurations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants