Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightning: Make sure we are using default block size of 16KB if user does not specify one. #60097

Merged
merged 11 commits into from
Mar 20, 2025

Conversation

OliverS929
Copy link
Contributor

@OliverS929 OliverS929 commented Mar 16, 2025

What problem does this PR solve?

Issue Number: close #59947

Problem Summary:
Make we are using a sufficient default block size. Ref #49514

What changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

    To test the specific issue addressed in this PR, I used a ~1.4TB dataset consisting mostly of duplicate data. Before the fix, memory usage spiked during the ingest phase due to the large index metadata loaded by Pebble, causing OOM kills on a 16c64g VM. With the fix, memory consumption remained stable, staying below 17GB and leading to no disastrous memory spikes.

  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-triage-completed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 16, 2025
Copy link

tiprow bot commented Mar 16, 2025

Hi @OliverS929. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@OliverS929
Copy link
Contributor Author

/ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Mar 16, 2025
@OliverS929 OliverS929 changed the title Lightning: Make sure we are using a block size that is larger than 16KB. Lightning: Make sure we are using a block size that is larger than 16KB by default. Mar 16, 2025
@Benjamin2037
Copy link
Collaborator

Benjamin2037 commented Mar 16, 2025

Please make sure to add enough test cases to avoid regression later.

Copy link

codecov bot commented Mar 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.9514%. Comparing base (77f118f) to head (b519cb8).
Report is 20 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #60097        +/-   ##
================================================
+ Coverage   73.1493%   73.9514%   +0.8020%     
================================================
  Files          1706       1738        +32     
  Lines        471415     483520     +12105     
================================================
+ Hits         344837     357570     +12733     
+ Misses       105415     104036      -1379     
- Partials      21163      21914       +751     
Flag Coverage Δ
integration 45.8637% <57.1428%> (?)
unit 72.5181% <100.0000%> (-0.0793%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.6910% <ø> (∅)
parser ∅ <ø> (∅)
br 46.9612% <ø> (-1.1339%) ⬇️
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ti-chi-bot ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 17, 2025
@OliverS929 OliverS929 changed the title Lightning: Make sure we are using a block size that is larger than 16KB by default. Lightning: Make sure we are using default block size of 16KB if user does not specify one. Mar 17, 2025
@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 17, 2025
@OliverS929
Copy link
Contributor Author

/retest

@@ -59,6 +59,9 @@ var (
normalIterStartKey = []byte{1}
)

// DefaultBlockSize ensures we are using a block size larger than 16KB, whereas 4KB is the default block size of Pebble.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also explain why 4KB pebble default value is not a good choice? and what problem will it cause?

Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you write the detail of manual test?

@@ -1423,6 +1426,12 @@ func newSSTWriter(path string, blockSize int) (*sstable.Writer, error) {
if err != nil {
return nil, errors.Trace(err)
}

// Logic to check the block size we are using is 16KB by default.
if blockSize <= 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check even this blocksize set, we still has risk to OOM?

@OliverS929
Copy link
Contributor Author

/retest

@OliverS929
Copy link
Contributor Author

/retest

1 similar comment
@OliverS929
Copy link
Contributor Author

/retest

require.True(t, blockSizeField.IsValid(), "blockSize field should be valid")
require.Equal(t, config.DefaultBlockSize, int(blockSizeField.Int()))

// Clean up
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why above comment withnot upper case,here the comment with upper case start?please make consistency.

@D3Hunter
Copy link
Contributor

can you write the detail of manual test?

Sure, I can provide a brief overview of the dataset size and test structure. However, I’m concerned that sharing further details in this PR might not be appropriate, as they could involve confidential information related to specific customer use cases.

you can ignore the customer part, just describe the steps and results from tech point of view

Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

// potentially causing a memory spike and leading to an Out of Memory (OOM) scenario.
// If the user specifies a smaller block size, respect their choice.
if blockSize <= 0 {
blockSize = config.DefaultBlockSize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also replace the literal inside NewConfig, BlockSize: 16 * 1024,

@OliverS929
Copy link
Contributor Author

can you write the detail of manual test?

Sure, I can provide a brief overview of the dataset size and test structure. However, I’m concerned that sharing further details in this PR might not be appropriate, as they could involve confidential information related to specific customer use cases.

you can ignore the customer part, just describe the steps and results from tech point of view

Sure. To test the specific issue addressed in this PR, I used a ~1.4TB dataset consisting mostly of duplicate data. Before the fix, memory usage spiked during the ingest phase due to the large index metadata loaded by Pebble, causing OOM kills on a 16c64g VM. With the fix, memory consumption remained stable, staying below 17GB and leading to no disastrous memory spikes.

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 19, 2025
Copy link

ti-chi-bot bot commented Mar 19, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-03-17 10:19:43.397593011 +0000 UTC m=+264477.081829106: ☑️ agreed by lance6716.
  • 2025-03-19 09:16:05.011375656 +0000 UTC m=+433458.695611751: ☑️ agreed by wjhuang2016.

Copy link
Collaborator

@Benjamin2037 Benjamin2037 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

ti-chi-bot bot commented Mar 19, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Benjamin2037, lance6716, wjhuang2016

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@D3Hunter
Copy link
Contributor

can you write the detail of manual test?

Sure, I can provide a brief overview of the dataset size and test structure. However, I’m concerned that sharing further details in this PR might not be appropriate, as they could involve confidential information related to specific customer use cases.

you can ignore the customer part, just describe the steps and results from tech point of view

Sure. To test the specific issue addressed in this PR, I used a ~1.4TB dataset consisting mostly of duplicate data. Before the fix, memory usage spiked during the ingest phase due to the large index metadata loaded by Pebble, causing OOM kills on a 16c64g VM. With the fix, memory consumption remained stable, staying below 17GB and leading to no disastrous memory spikes.

please add it to the PR description, under [ ] manual test section

@Benjamin2037
Copy link
Collaborator

Please remember add integration test.

@Benjamin2037
Copy link
Collaborator

/retest

@OliverS929
Copy link
Contributor Author

can you write the detail of manual test?

Sure, I can provide a brief overview of the dataset size and test structure. However, I’m concerned that sharing further details in this PR might not be appropriate, as they could involve confidential information related to specific customer use cases.

you can ignore the customer part, just describe the steps and results from tech point of view

Sure. To test the specific issue addressed in this PR, I used a ~1.4TB dataset consisting mostly of duplicate data. Before the fix, memory usage spiked during the ingest phase due to the large index metadata loaded by Pebble, causing OOM kills on a 16c64g VM. With the fix, memory consumption remained stable, staying below 17GB and leading to no disastrous memory spikes.

please add it to the PR description, under [ ] manual test section

Done.

@OliverS929
Copy link
Contributor Author

/retest

3 similar comments
@OliverS929
Copy link
Contributor Author

/retest

@OliverS929
Copy link
Contributor Author

/retest

@OliverS929
Copy link
Contributor Author

/retest

@ti-chi-bot ti-chi-bot bot merged commit 514204e into pingcap:master Mar 20, 2025
25 checks passed
@OliverS929
Copy link
Contributor Author

/cherry-pick release-8.5

@OliverS929
Copy link
Contributor Author

/cherry-pick release-8.1

@ti-chi-bot
Copy link
Member

@OliverS929: new pull request created to branch release-8.5: #60184.

In response to this:

/cherry-pick release-8.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

@OliverS929: new pull request created to branch release-8.1: #60185.

In response to this:

/cherry-pick release-8.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lightning OOM during write/ingest to TiKV when import large mount of data
6 participants