New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(snapshots): optimise memory consumption for restores #3465
base: master
Are you sure you want to change the base?
Conversation
Code was stacking closures of metadata of the whole snapshot before starting to process the files. This PR invert the logic by processing the directories breadth-first, freeing memory as the files are processed.
👍 |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3465 +/- ##
=======================================
Coverage 75.83% 75.84%
=======================================
Files 465 465
Lines 37166 37166
=======================================
+ Hits 28184 28187 +3
+ Misses 7050 7048 -2
+ Partials 1932 1931 -1 ☔ View full report in Codecov by Sentry. |
In principle I like this, the only concern is progress reporting - it will likely mess up ETA estimation for restores. Can you run this change before/after on some large directory and perhaps capture couple screenshots when this is running? |
Yes sure, I will try to get more information / stats / screenshots. For information, the snapshots of my tests are done on a 5 TB volume of data (shared drive of plain files). |
In this way, probably, the As an alternative, can we divide the workers in two sets, one consumes from front and the other consumes from back? We can also adjust the number of workers in each set to adjust the priority. |
@arouene what do the stats shown above belong to? (~5 TB, ~26 million files, ~18K directories) Is that the total content in the repository or something else? Also, what's the makeup of the directory being restored?
|
Thanks for your interest in this PR! I'm trying to get better stats...
It's the total content of the directory which is snapshoted, it's also about the size of the S3 repository as the files are not changing much nor often.
Number of directories: 18 530 So the layout is very flat, with one big directory that contains a long list of sub-directories that them-selves contains a lot of files. Like so :
The largest directory is 1 149 564 files (at depth 4). the typical, is about a thousand files per folder (for about a thousand folders), with some subfolders sometimes that have many extra files. Would it be possible to get the number of files/directories to restore with the size, without caching all the metadata ? |
Code was stacking closures of metadata of the whole snapshot before starting to process the files.
This PR invert the logic by processing the directories breadth-first, freeing memory as the files are processed.
Resolves #3460.