Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17160: Time based tracking of core admin requests with Caffeine cache #2304

Merged
merged 16 commits into from
Jul 15, 2024

Conversation

psalagnac
Copy link
Contributor

https://issues.apache.org/jira/browse/SOLR-17160

This is a simpler version of #2271, that now uses Caffeine cache instead of manually dropping old requests.

Description

Core admin completed/failed requests are tracked in-memory, but we only track a maximum of 100 requests by dropping the oldest ones. If for any reason we drop a request before the client (an other Solr node) fetched the status, we will get unexpected errors at the top level command.

Solution

Instead of the hard limit of 100 tracked requests, we track all requests in a Caffeine cache, configured to drop old requests.
This is a trade-off to not keep memory forever and to be sure we won't lose status in the case of many incoming requests.

The change adds two different timeouts:

  • an hour for every requests (overridable with property "solr.admin.requests.running.timeout.minutes")
  • 5 minutes for completed/failed requests once the status is already retrieved by client (overridable with property "solr.admin.requests.completed.timeout.minutes")

Tests

Added a unit test in CoreAdminHandlerTest to check expired requests are not tracked in-memory anymore.

Also checks we don't have any issue for collection backup with very high numbers of shards.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

Comment on lines +65 to +66
requestStatusResponse.response = taskObject.getRspObject();
requestStatusResponse.response = taskObject.getOperationRspObject();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused; you are setting the response twice?

Copy link
Contributor Author

@psalagnac psalagnac Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch!
Something is definitely wrong here, but this is existing code. I did not notice that.
I think this deserves a a fix by itself. Values for msg and response in the V2 API response are not consistent depending on the status. And it seems to me they're not consistent with V1 either.
I think this was introduced with #2144.

Maybe I should put this PR on hold and fix this first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah; didn't know it was existing code. Do in whichever order you want; it doesn't really matter.

@@ -528,4 +534,39 @@ public void testNonexistentCoreReload() throws Exception {
e.getMessage());
admin.close();
}

@Test
public void testTrackedRequestExpiration() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonderful test

Copy link

github-actions bot commented May 8, 2024

This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the [email protected] mailing list. Thank you for your contribution!

@github-actions github-actions bot added the stale PR not updated in 60 days label May 8, 2024
@github-actions github-actions bot removed the stale PR not updated in 60 days label Jun 12, 2024
@@ -436,7 +436,8 @@ public static class CoreAdminAsyncTracker {
* we're not supposed to hit it. This is just a protection to grow in memory too much when
* receiving an abusive number of admin requests.
*/
private static final int MAX_TRACKED_REQUESTS = 10_000;
private static final int MAX_TRACKED_REQUESTS =
EnvUtils.getPropertyAsInteger("solr.admin.requests.running.max", 10_000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is only for "async" requests, maybe the name should incorporate this limit like 'solr.admin.async.max". ("requests" and "running" being kind of implicit maybe)?

@dsmiley
Copy link
Contributor

dsmiley commented Jul 9, 2024

WDYT of this proposed CHANGES.txt under Improvements:

SOLR-17160: Core Admin "async" request status tracking is no longer capped at 100; it's 10k. Statuses are now removed 5 minutes after the read of a completed/failed status. (Pierre Salagnac)

@dsmiley
Copy link
Contributor

dsmiley commented Jul 9, 2024

Perhaps an additional sentence to highlight that this helps collection backup/restore and other operations scale to 100+ shards.

@dsmiley
Copy link
Contributor

dsmiley commented Jul 12, 2024

I handled CHANGES.txt myself. I'll merge this sometime this weekend. Looking forward to seeing this one done!

@dsmiley dsmiley merged commit d3b4c2e into apache:main Jul 15, 2024
3 of 4 checks passed
dsmiley pushed a commit that referenced this pull request Jul 16, 2024
)

Core Admin "async" request status tracking is no longer capped at 100; it's 10k.
Statuses are now removed 5 minutes after the read of a completed/failed status.
Helps collection async backup/restore and other operations scale to 100+ shards.

Co-authored-by: David Smiley <[email protected]>
(cherry picked from commit d3b4c2e)
@psalagnac psalagnac deleted the max-requests-caffeine branch July 22, 2024 08:41
@psalagnac
Copy link
Contributor Author

I handled CHANGES.txt myself. I'll merge this sometime this weekend. Looking forward to seeing this one done!

Thanks for handling this @dsmiley. I was off for some time and unable to do any change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants