Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCT for backup #8752

Open
regevran opened this issue Sep 18, 2024 · 20 comments · Fixed by #9307 · May be fixed by #10112
Open

SCT for backup #8752

regevran opened this issue Sep 18, 2024 · 20 comments · Fixed by #9307 · May be fixed by #10112
Assignees
Labels
type/epic Complex mission split into a task list
Milestone

Comments

@regevran
Copy link

regevran commented Sep 18, 2024

Measure backup process performance - how long does it take to backup.
Measure user-queries during backup - what is the impact of backup on user experience.
Measures inside Scylla.
SM tests may or may not be combined with the tests in this issue.

@soyacz
Copy link
Contributor

soyacz commented Sep 18, 2024

I think this is on @mikliapko plate as part of SM tests.

@regevran
Copy link
Author

I think this is on @mikliapko plate as part of SM tests.

I updated the description, does it make more sense now?

@soyacz
Copy link
Contributor

soyacz commented Sep 19, 2024

I think this is on @mikliapko plate as part of SM tests.

I updated the description, does it make more sense now?

@mikliapko ^^

@mikliapko
Copy link
Contributor

@regevran Could you please provide some context on this issue?

  • Where it comes from;
  • How soon should we have such a test;
  • Is there any preferred Scylla + dataset configuration to test.

Currently in Manager test scope we don't have backup performance tests, only for restore.

cc: @rayakurl

@regevran
Copy link
Author

Where it comes from;

2024 Q4 is planned to finish with better backup and restore capabilities.
This issue is to make sure we get there with measurements.

How soon should we have such a test;

@kreuzerkrieg is assigned to this and will start shortly working on the details.

Is there any preferred Scylla + dataset configuration to test.

Yes. I think the focus should be on a large, full cluster backups. And we should measure heavy user loads during the backup.
The numbers (how large, what does it mean heavy load, etc.) to be discussed.

@rayakurl
Copy link
Contributor

@regevran - can you please add more details regarding how @mikliapko should assist?

@regevran
Copy link
Author

@regevran - can you please add more details regarding how @mikliapko should assist?

@kreuzerkrieg is collecting information including the requested support 😸.

@kreuzerkrieg
Copy link
Contributor

kreuzerkrieg commented Oct 16, 2024

@mikliapko I have a couple of questions.
Took a look at this file and looks like we already have backup/restore test that measures performance of it. At least this is what the comment states

The test is extensively used for restore benchmarking purposes and consists of the following steps

  1. Do we track results of this tests? is it reported and can be reviewed somehow?
  2. Do we have any benchmarking of read performance during the backup? something comparable, like, we have the bench results with and without backup running?
  3. Do we have bechmarking performance of reading from a table residing on the same cluster where another table being restored?

We need these three to compare the current state of the system with the one we will have in the nearest future - scylla based backup, so we need these numbers. If we dont have all or any of above, what would it take to create these metrics?

CC: @rayakurl

@mikliapko
Copy link
Contributor

  1. Do we track results of this tests? is it reported and can be reviewed somehow?

Yes, recently we introduced two jobs:

The purpose of these jobs is to check restore performance build by build.
So, important note - it's only for restore. We don't benchmark backup there.

  1. Do we have any benchmarking of read performance during the backup? something comparable, like, we have the bench results with and without backup running?

No, we don't have such benchmarks. At least I'm not aware of any of them.
@fruch Perhaps, you know more about it?

  1. Do we have bechmarking performance of reading from a table residing on the same cluster where another table being restored?

Nope.

We need these three to compare the current state of the system with the one we will have in the nearest future - scylla based backup, so we need these numbers. If we dont have all or any of above, what would it take to create these metrics?

Actually, it's hard to provide you any more or less exact estimations without digging deeper.
@kreuzerkrieg When I'd like to have these metrics? If we need them urgently, we will prioritize these things and start working on them.
CC: @rayakurl

@kreuzerkrieg
Copy link
Contributor

@mikliapko as of urgency, good question, @regevran what do you say? what is the expectation?

@regevran
Copy link
Author

We plan to start the new approach in the beginning of November 2024. It would be great if the tests are ready then with current implementation already measured.

@mikliapko
Copy link
Contributor

We plan to start the new approach in the beginning of November 2024. It would be great if the tests are ready then with current implementation already measured.

I see, alright, we will talk with @rayakurl and prioritize it accordingly.

@mikliapko
Copy link
Contributor

  • Do we have any benchmarking of read performance during the backup? something comparable, like, we have the bench results with and without backup running?
  • Do we have bechmarking performance of reading from a table residing on the same cluster where another table being restored?

@karol-kokoszka @Michal-Leszczynski Guys, I want to hear your opinion about these tests value for Manager in general to not implement something that would be used only one time. I suppose they might be pretty useful.

@mikliapko
Copy link
Contributor

We plan to start the new approach in the beginning of November 2024. It would be great if the tests are ready then with current implementation already measured.

@regevran
We had a discussion with @rayakurl about priorities. Is this something that you have capacity to implement?

@regevran
Copy link
Author

@regevran We had a discussion with @rayakurl about priorities. Is this something that you have capacity to implement?

Yes, we'll do it

@kreuzerkrieg
Copy link
Contributor

@rayakurl who can assist (guide) with this task?

@rayakurl
Copy link
Contributor

@rayakurl who can assist (guide) with this task?

@kreuzerkrieg @mikliapko can assist

@regevran
Copy link
Author

@rayakurl who can assist (guide) with this task?

@kreuzerkrieg @mikliapko can assist

Also please make sure @cezarmoise is in the loop too.

@Michal-Leszczynski
Copy link

Do we have any benchmarking of read performance during the backup? something comparable, like, we have the bench results with and without backup
Do we have bechmarking performance of reading from a table residing on the same cluster where another table being restored?

@karol-kokoszka @Michal-Leszczynski Guys, I want to hear your opinion about these tests value for Manager in general to not implement something that would be used only one time. I suppose they might be pretty useful.

I would say that the first benchmark is more interesting and important, as backups run all the time and cluster should work fine when they are running. The second one is less important, as for now the main objective is to make full cluster restore as fast as possible - it assumes that cluster does not handle any user related traffic. Optimizing restore on a running cluster is left as a future effort for now.

@regevran
Copy link
Author

We are not done with this issue, as we still couldn't create a good baseline for measures.

@regevran regevran reopened this Nov 28, 2024
@regevran regevran added the type/epic Complex mission split into a task list label Nov 28, 2024
@regevran regevran added this to the 4.4 milestone Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment