-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New policies for storage archival process #3384
base: staging
Are you sure you want to change the base?
Conversation
docs/policy/storage_policy.md
Outdated
|
||
99% of the archival requests were made for two different hubs - Datahub and Data 100 hubs. We can hypothesize that most users using other hubs are either a) not aware of the archival service or b) do not require the service. If we plan to improve our outreach message to ensure that other hub users are aware of the archival process, we may still run into issues with our capacity to handle such requests. We collectively identify that the manual storage archival process is something we want to move away from via automation. | ||
|
||
Less than 1% (~50) of all Datahub users (~10k) make data archival requests. To solve this demand from a small subset of users, we allocate a significant amount of storage (as much as 31 TB) as shown in the first snapshot above. Google cloud estimator hypothesizes that it may cost us closer to $5000 per year for storing this volume of data. Here is a [link](https://cloud.google.com/products/calculator/#id=686b9639-ae2e-4a94-a5b9-30aeb1135e6c) to an approximate estimate of cloud costs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on current usage, it's costing us about 140$ a month, which is about 1700$ a year - not 5000$. If you look at our billing, logs actually cost us more.
So if the goal is to reduce cost, there are multiple other avenues for us to look at before we start deleting user data.
I think we can also switch these to archive storage - see https://cloud.google.com/products/calculator/#id=60effa0c-499a-4f0b-b2a2-fb4cfa0276d8 for calculation of how much that would cost for a year, about 280$ for the same 20TB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion here is that we switch our archived objects to archive class rather than 'standard' class, as that matches our use case for them much better. You can see more about the pricing for different storage classes here: https://cloud.google.com/storage/pricing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a fundamental question to me here is - what is the problem we are trying to solve? Is it extra load on @felder in servicing these requests? Or the cost of archiving? If cost is the issue, I think switching it to archival will help give us a 94% reduction in cost (I opened #3389 to track). If workload is the issue, we can find ways to do more automation there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also opened #3388 as a draft communication to end users policy, which I think should help a lot with notifications and informing users how they can be good citizens (thanks to Eric Fraser for the idea).
@yuvipanda Incorporated your feedback and pushed an update to the policy doc. Do review and merge the changes if it makes sense! |
@balajialg Thank you for making some changes! I'm still struggling to understand the 'what is the problem we are trying to solve here?' question I framed in #3384 (comment). And I'm not entirely sure what part of the changes addresses that. Our policies should look radically different based on what it is that we are trying to solve, so I'd love to frame our conversation around that. |
@yuvipanda - Let me know if I am coming across clearly with the purpose of this policy proposal. The purpose of this policy proposal is to build transparency about our storage policy and process with all users who use Datahub. "All users" is the keyword here. I assume our goal (&probably the problem we as the infrastructure team want to solve for ourselves) at the start of the process is to revisit the storage policy from the first principles with the objective of making it more user-centric & reduce the effort and cloud costs involved (if possible). I know that this is a broad statement that has multiple objectives. I see this proposal as us documenting our exploration of the multiple policy options and finalizing the policy pathway forward for users and our future reference. Articulating what our policy is, storing it in a place that is accessible for our users, and communicating this policy change to them at different stages of their engagement with Datahub - a) When they first log in to Datahub and b) If and when their storage needs exceed the threshold limit set and c) When their data is to get archived is important to build transparency with our users. From a user perspective, this proposal seeks to be the single source of truth with regard to our finalized storage policy. This policy should go hand in hand with the communication proposal you had outlined as part of PR #3388. Given this context, Let me know if you have input on how I can reframe the below question based on the rationale outlined above (Supposing that the rationale outlined makes sense from your lens). Let me know! |
@yuvipanda questions whether we are solving any problem by having a policy proposal for the 100 GB storage threshold. Considering that, John highlighted cloud costs are not a big concern at this juncture, and initiatives like #3389 would bring down cloud costs over a longer duration. His suggestion would be to focus on the communication of storage policies instead of adding more policy guardrails regarding storage! |
@yuvipanda note that it's not just cloud storage we're concerned with here. About half of the "compute engine" costs are for the persistent disks which are a concern for this policy. |
Folks, It would be great if we iterate on this proposal and finalize our policy by the end of next week. |
@balajialg @felder how about we automate running the archiver so it runs every week, and then for people with >100GB, we archive on 3 months of inactivity? That should help take it off the more expensive POSIX storage. |
As for deletion, I'd say we can do something like 'your files will be deleted 18 months after they are archived' or something of that sorts, and enforce that consistently - along with the automated messages mentioned in #3388 so users are aware. I don't want us to delete user directories automatically because they exceeded some threshold, and not archiving them because they're big actually costs us more money. |
So if the goal is to save more expensive on-disk storage, I propose that we run the archiver continuously (I'll have to redesign it slightly but doable), and if your homedir is >100GB your cutoff is 3 months rather than 6. How is that? |
@yuvipanda I'm definitely open to the idea of running the archiver continuously. However yeah we'd need to consider that carefully. For instance I'd like to at some point have archived directories (on disk) get removed. As things stand now, removing a directory from disk also removes the ability for the owner of that data to know where it went. I figure we could either remove the data from archival storage at the same time (which implies after at least 12 months) or provide another method of retrieval if we want to keep the data in perpetuity. |
Cool, I included the policy suggestion to archive files with size > 100 GB for the 90-day cut-off as part of the proposal. I assume the operational details @felder talked about are not within the scope of this policy proposal but should be discussed as part of any Github issues. We can revisit this proposal if there is any updated information. Can one of you merge this PR if this seems like a reasonable policy proposal? |
@felder's 3 key reasons why defining this policy is extremely important NOW,
|
Thanks for the response, @balajialg. I agree these are all important problems to solve. I personally don't think we should be writing actual policy that treats users differently based on their home directory storage before they're even aware of current policies. My suggestion is that we try to get #3388 implemented this coming semester, and see how that goes - and table this particular policy until the next semester. |
@yuvipanda Seems reasonable to me. I will let @felder take the final call on this as he will be most affected by this decision. @felder - What do you think about holding off this policy and reviewing this at the end of the semester (Let's assume that #2 is not a big headache during Fall 22)? I can postpone the scheduled meeting to sometime in December. |
@balajialg seems reasonable. |
Let's iterate on this proposal through this PR which is a follow-up to #3377 !