New policies for storage archival process #3384

balajialg · 2022-05-13T02:05:42Z

Let's iterate on this proposal through this PR which is a follow-up to #3377 !

yuvipanda · 2022-05-19T07:31:08Z

docs/policy/storage_policy.md

+
+99% of the archival requests were made for two different hubs - Datahub and Data 100 hubs. We can hypothesize that most users using other hubs are either a) not aware of the archival service or b) do not require the service. If we plan to improve our outreach message to ensure that other hub users are aware of the archival process, we may still run into issues with our capacity to handle such requests.  We collectively identify that the manual storage archival process is something we want to move away from via automation.
+
+Less than 1% (~50) of all Datahub users (~10k) make data archival requests. To solve this demand from a small subset of users, we allocate a significant amount of storage (as much as 31 TB) as shown in the first snapshot above. Google cloud estimator hypothesizes that it may cost us closer to $5000 per year for storing this volume of data. Here is a [link](https://cloud.google.com/products/calculator/#id=686b9639-ae2e-4a94-a5b9-30aeb1135e6c) to an approximate estimate of cloud costs. 


Based on current usage, it's costing us about 140$ a month, which is about 1700$ a year - not 5000$. If you look at our billing, logs actually cost us more.

So if the goal is to reduce cost, there are multiple other avenues for us to look at before we start deleting user data.

I think we can also switch these to archive storage - see https://cloud.google.com/products/calculator/#id=60effa0c-499a-4f0b-b2a2-fb4cfa0276d8 for calculation of how much that would cost for a year, about 280$ for the same 20TB.

My suggestion here is that we switch our archived objects to archive class rather than 'standard' class, as that matches our use case for them much better. You can see more about the pricing for different storage classes here: https://cloud.google.com/storage/pricing.

I think a fundamental question to me here is - what is the problem we are trying to solve? Is it extra load on @felder in servicing these requests? Or the cost of archiving? If cost is the issue, I think switching it to archival will help give us a 94% reduction in cost (I opened #3389 to track). If workload is the issue, we can find ways to do more automation there.

I've also opened #3388 as a draft communication to end users policy, which I think should help a lot with notifications and informing users how they can be good citizens (thanks to Eric Fraser for the idea).

… inputs

balajialg · 2022-05-20T20:57:35Z

@yuvipanda Incorporated your feedback and pushed an update to the policy doc. Do review and merge the changes if it makes sense!

yuvipanda · 2022-05-23T14:38:52Z

@balajialg Thank you for making some changes! I'm still struggling to understand the 'what is the problem we are trying to solve here?' question I framed in #3384 (comment). And I'm not entirely sure what part of the changes addresses that. Our policies should look radically different based on what it is that we are trying to solve, so I'd love to frame our conversation around that.

balajialg · 2022-05-23T15:50:58Z

@yuvipanda - Let me know if I am coming across clearly with the purpose of this policy proposal. The purpose of this policy proposal is to build transparency about our storage policy and process with all users who use Datahub. "All users" is the keyword here. I assume our goal (&probably the problem we as the infrastructure team want to solve for ourselves) at the start of the process is to revisit the storage policy from the first principles with the objective of making it more user-centric & reduce the effort and cloud costs involved (if possible). I know that this is a broad statement that has multiple objectives. I see this proposal as us documenting our exploration of the multiple policy options and finalizing the policy pathway forward for users and our future reference.

Articulating what our policy is, storing it in a place that is accessible for our users, and communicating this policy change to them at different stages of their engagement with Datahub - a) When they first log in to Datahub and b) If and when their storage needs exceed the threshold limit set and c) When their data is to get archived is important to build transparency with our users. From a user perspective, this proposal seeks to be the single source of truth with regard to our finalized storage policy. This policy should go hand in hand with the communication proposal you had outlined as part of PR #3388.

Given this context, Let me know if you have input on how I can reframe the below question based on the rationale outlined above (Supposing that the rationale outlined makes sense from your lens). Let me know!
The policy proposal seeks to address the question "What is our policy proposal that needs to get transparently communicated when users stored more than the archival threshold (~100 GB of data) in their home directories?"
The policy proposal seeks to address the question "What is our policy proposal that needs to get transparently communicated to our users so that they can understand how we handle their data from the time they log in to the time where their data gets archived"

balajialg · 2022-05-26T21:00:28Z

@yuvipanda questions whether we are solving any problem by having a policy proposal for the 100 GB storage threshold. Considering that, John highlighted cloud costs are not a big concern at this juncture, and initiatives like #3389 would bring down cloud costs over a longer duration. His suggestion would be to focus on the communication of storage policies instead of adding more policy guardrails regarding storage!

felder · 2022-05-26T22:03:26Z

@yuvipanda note that it's not just cloud storage we're concerned with here.

About half of the "compute engine" costs are for the persistent disks which are a concern for this policy.

balajialg · 2022-05-28T00:14:17Z

Folks, It would be great if we iterate on this proposal and finalize our policy by the end of next week.

yuvipanda · 2022-05-28T05:14:52Z

@balajialg @felder how about we automate running the archiver so it runs every week, and then for people with >100GB, we archive on 3 months of inactivity? That should help take it off the more expensive POSIX storage.

yuvipanda · 2022-05-28T05:26:01Z

As for deletion, I'd say we can do something like 'your files will be deleted 18 months after they are archived' or something of that sorts, and enforce that consistently - along with the automated messages mentioned in #3388 so users are aware. I don't want us to delete user directories automatically because they exceeded some threshold, and not archiving them because they're big actually costs us more money.

yuvipanda · 2022-05-30T10:02:48Z

So if the goal is to save more expensive on-disk storage, I propose that we run the archiver continuously (I'll have to redesign it slightly but doable), and if your homedir is >100GB your cutoff is 3 months rather than 6. How is that?

felder · 2022-05-31T19:18:32Z

@yuvipanda I'm definitely open to the idea of running the archiver continuously. However yeah we'd need to consider that carefully. For instance I'd like to at some point have archived directories (on disk) get removed. As things stand now, removing a directory from disk also removes the ability for the owner of that data to know where it went. I figure we could either remove the data from archival storage at the same time (which implies after at least 12 months) or provide another method of retrieval if we want to keep the data in perpetuity.

balajialg · 2022-06-06T18:55:47Z

Cool, I included the policy suggestion to archive files with size > 100 GB for the 90-day cut-off as part of the proposal. I assume the operational details @felder talked about are not within the scope of this policy proposal but should be discussed as part of any Github issues. We can revisit this proposal if there is any updated information. Can one of you merge this PR if this seems like a reasonable policy proposal?

balajialg · 2022-07-07T21:33:30Z

@felder's 3 key reasons why defining this policy is extremely important NOW,

Service Management headaches around growing and shrinking storage (~Jon's time)
Cloud Costs (Half of the costs are storage related)
Handling boundary cases where students store large files.

yuvipanda · 2022-07-08T02:28:37Z

Thanks for the response, @balajialg. I agree these are all important problems to solve. I personally don't think we should be writing actual policy that treats users differently based on their home directory storage before they're even aware of current policies. My suggestion is that we try to get #3388 implemented this coming semester, and see how that goes - and table this particular policy until the next semester.

balajialg · 2022-07-08T19:27:32Z

@yuvipanda Seems reasonable to me. I will let @felder take the final call on this as he will be most affected by this decision. @felder - What do you think about holding off this policy and reviewing this at the end of the semester (Let's assume that #2 is not a big headache during Fall 22)? I can postpone the scheduled meeting to sometime in December.

felder · 2022-07-13T19:25:15Z

@balajialg seems reasonable.

New policies for storage archival process

d3c8777

balajialg requested review from ericvd-ucb, felder, ryanlovett and yuvipanda May 13, 2022 02:06

yuvipanda reviewed May 19, 2022

View reviewed changes

Finalizing the policy proposal with the decision incorporating team's…

dcd8936

… inputs

Incorporating Yuvi's policy suggestion for boundary cases

cc9624b

Merge branch 'staging' into storagebranch

61b87ef

balajialg reopened this Aug 12, 2022

balajialg added the needs: policy label Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New policies for storage archival process #3384

New policies for storage archival process #3384

balajialg commented May 13, 2022 •

edited

Loading

yuvipanda May 19, 2022

yuvipanda May 19, 2022

yuvipanda May 19, 2022

yuvipanda May 19, 2022

balajialg commented May 20, 2022

yuvipanda commented May 23, 2022

balajialg commented May 23, 2022 •

edited

Loading

balajialg commented May 26, 2022 •

edited

Loading

felder commented May 26, 2022

balajialg commented May 28, 2022

yuvipanda commented May 28, 2022

yuvipanda commented May 28, 2022

yuvipanda commented May 30, 2022

felder commented May 31, 2022

balajialg commented Jun 6, 2022 •

edited

Loading

balajialg commented Jul 7, 2022 •

edited

Loading

yuvipanda commented Jul 8, 2022

balajialg commented Jul 8, 2022

felder commented Jul 13, 2022


		99% of the archival requests were made for two different hubs - Datahub and Data 100 hubs. We can hypothesize that most users using other hubs are either a) not aware of the archival service or b) do not require the service. If we plan to improve our outreach message to ensure that other hub users are aware of the archival process, we may still run into issues with our capacity to handle such requests. We collectively identify that the manual storage archival process is something we want to move away from via automation.

		Less than 1% (~50) of all Datahub users (~10k) make data archival requests. To solve this demand from a small subset of users, we allocate a significant amount of storage (as much as 31 TB) as shown in the first snapshot above. Google cloud estimator hypothesizes that it may cost us closer to $5000 per year for storing this volume of data. Here is a [link](https://cloud.google.com/products/calculator/#id=686b9639-ae2e-4a94-a5b9-30aeb1135e6c) to an approximate estimate of cloud costs.

New policies for storage archival process #3384

Are you sure you want to change the base?

New policies for storage archival process #3384

Conversation

balajialg commented May 13, 2022 • edited Loading

yuvipanda May 19, 2022

Choose a reason for hiding this comment

yuvipanda May 19, 2022

Choose a reason for hiding this comment

yuvipanda May 19, 2022

Choose a reason for hiding this comment

yuvipanda May 19, 2022

Choose a reason for hiding this comment

balajialg commented May 20, 2022

yuvipanda commented May 23, 2022

balajialg commented May 23, 2022 • edited Loading

balajialg commented May 26, 2022 • edited Loading

felder commented May 26, 2022

balajialg commented May 28, 2022

yuvipanda commented May 28, 2022

yuvipanda commented May 28, 2022

yuvipanda commented May 30, 2022

felder commented May 31, 2022

balajialg commented Jun 6, 2022 • edited Loading

balajialg commented Jul 7, 2022 • edited Loading

yuvipanda commented Jul 8, 2022

balajialg commented Jul 8, 2022

felder commented Jul 13, 2022

balajialg commented May 13, 2022 •

edited

Loading

balajialg commented May 23, 2022 •

edited

Loading

balajialg commented May 26, 2022 •

edited

Loading

balajialg commented Jun 6, 2022 •

edited

Loading

balajialg commented Jul 7, 2022 •

edited

Loading