Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Add segment transfer timeout dynamic setting #13679

Merged
merged 12 commits into from
May 23, 2024

Conversation

linuxpi
Copy link
Contributor

@linuxpi linuxpi commented May 15, 2024

Description

  • Segment uploads for Remote Store happen without any timeout and can hold the thread on latch.await() for ever if there is an error which is swallowed silently by the code.
  • Since the future is never completed, the shard is stuck and is never able to retry the upload
  • This leads to remote time lag to increase and never come down until we restart the process
  • This PR adds the timeout to segment uploads to recover automatically from such situations after the timeout.

Related Issues

Resolves #13783

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 8f59a8d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 85416eb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for a4ac198: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 62e73e5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for cbca4ee: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Varun Bansal <[email protected]>
Copy link
Contributor

✅ Gradle check result for a6417b4: SUCCESS

Copy link

codecov bot commented May 21, 2024

Codecov Report

Attention: Patch coverage is 86.66667% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 71.58%. Comparing base (b15cb0c) to head (251ecd0).
Report is 299 commits behind head on main.

Files Patch % Lines
...va/org/opensearch/indices/RemoteStoreSettings.java 75.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13679      +/-   ##
============================================
+ Coverage     71.42%   71.58%   +0.16%     
- Complexity    59978    61255    +1277     
============================================
  Files          4985     5063      +78     
  Lines        282275   287917    +5642     
  Branches      40946    41691     +745     
============================================
+ Hits         201603   206108    +4505     
- Misses        63999    64770     +771     
- Partials      16673    17039     +366     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@skumawat2025 skumawat2025 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sachinpkale
Copy link
Member

Please add changelog entry, fix the gradle build and complete the check-list in PR description.

Signed-off-by: Varun Bansal <[email protected]>
@linuxpi
Copy link
Contributor Author

linuxpi commented May 22, 2024

Please add changelog entry, fix the gradle build and complete the check-list in PR description.

Thanks @sachinpkale for approving. I've added the necessary details and addressed your comment.

@linuxpi linuxpi self-assigned this May 22, 2024
@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Storage Issues and PRs relating to data and metadata storage Storage:Resiliency Issues and PRs related to the storage resiliency v2.15.0 Issues and PRs related to version 2.15.0 labels May 22, 2024
@linuxpi linuxpi added backport 2.x Backport to 2.x branch and removed enhancement Enhancement or improvement to existing feature or request v2.15.0 Issues and PRs related to version 2.15.0 labels May 22, 2024
Copy link
Contributor

❕ Gradle check result for 251ecd0: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@sachinpkale sachinpkale merged commit b3049fb into opensearch-project:main May 23, 2024
32 of 35 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-13679-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 b3049fb5ec860f63fdfe33c5b176869b1e4255e6
# Push it to GitHub
git push --set-upstream origin backport/backport-13679-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-13679-to-2.x.

@linuxpi linuxpi deleted the segment-upload-timeout branch May 23, 2024 07:29
linuxpi added a commit to linuxpi/OpenSearch that referenced this pull request May 23, 2024
…ch-project#13679)

* [Remote Store] Add segment transfer timeout dynamic setting

Signed-off-by: Varun Bansal <[email protected]>
(cherry picked from commit b3049fb)
linuxpi added a commit to linuxpi/OpenSearch that referenced this pull request May 24, 2024
…ch-project#13679)

* [Remote Store] Add segment transfer timeout dynamic setting

Signed-off-by: Varun Bansal <[email protected]>
(cherry picked from commit b3049fb)
sachinpkale pushed a commit that referenced this pull request May 24, 2024
…13793)

* [Remote Store] Add segment transfer timeout dynamic setting

Signed-off-by: Varun Bansal <[email protected]>
(cherry picked from commit b3049fb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed Storage:Resiliency Issues and PRs related to the storage resiliency Storage Issues and PRs relating to data and metadata storage
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

[Remote Store] Add support to timeout segment uploads
4 participants