Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] [2h] Explore and understand 'Split Cost Allocation Data' on EKS #4465

Closed
Tracked by #4453
yuvipanda opened this issue Jul 20, 2024 · 10 comments
Closed
Tracked by #4453
Assignees

Comments

@yuvipanda
Copy link
Member

yuvipanda commented Jul 20, 2024

AWS has recently enabled better kubernetes integration for its cost data exports. We should explore it to see if it will serve our needs - I suspect it may.

I've enabled it for the openscapes cluster, so we will have data to work with shortly.

Things to investigate:

  1. Setting up an AWS Athena DB with this information via cloudformation (https://docs.aws.amazon.com/cur/latest/userguide/use-athena-cf.html)
  2. The Grafana AWS Athena datasource https://grafana.com/grafana/plugins/grafana-athena-datasource/

Things we wanna track

Thing Trackable with split cost allocation? Trackable with AWS tags (and multiple hubs sharing nodes) Trackable with AWS tags (hubs having dedicated nodes)
EC2 memory / CPU cost yes no yes
EC2 GPU ? NA yes
EC2 base disk no no yes
EFS no yes yes
Network Egress (via browser) no no no
Network Egress (via user programmatically sending out data no no yes
Persistent and Scratch Bucket S3 Use NA yes yes
Requestor Pays S3 use NA no no

Spike outcome

Based on my exploration here, and on what was determined to be the things that would be valuble to admins right now (per #4384), I've made the following choices:

  1. We can use AWS Athena for these queries, so yay.
  2. We can not use the split cost allocation feature, because it doesn't cover a couple of resources important to us (disk, primarily)
  3. For clusters where we want to offer 'per hub cost tracking', this means each hub must be on its own tagged nodepool.
@yuvipanda
Copy link
Member Author

I tried to follow https://docs.aws.amazon.com/cur/latest/userguide/use-athena-cf.html but couldn't find the .yml file with the cloudformation template. Also I had set up the object prefix to be raw/ and turns out the extra / meant there were two / in the object name.

I set up another export in the meantime.

@yuvipanda
Copy link
Member Author

I can't do the manual athena attempt either, because https://docs.aws.amazon.com/cur/latest/userguide/create-manual-table.html is asking me for a .sql file I don't have in the export. I wonder if the extra / is related. I will take a look once the new export sets up again and delivers data.

I've spent 15min on this so far.

@yuvipanda
Copy link
Member Author

Aaaah, looking at https://docs.aws.amazon.com/cur/latest/userguide/dataexports-processing.html, I see:

Currently, Data Exports doesn't provide the SQL file for setting up Athena to query your exports like Cost and Usage Reports (CUR) does.

@yuvipanda
Copy link
Member Author

I've enabled this export now with Cost and Usage Reports 'legacy' page.

@yuvipanda
Copy link
Member Author

I can run SQL queries with Athena now!

SELECT line_item_product_code,
sum(line_item_blended_cost) AS cost, month
FROM athenacurcfn_2i2c_cost_export."2i2c_cost_export"
WHERE year='2024'
GROUP BY  line_item_product_code, month
HAVING sum(line_item_blended_cost) > 0
ORDER BY  line_item_product_code;

which gives as output:

Image

So that's great.

Next is to examine the split output columns to see if we can use those.

@yuvipanda
Copy link
Member Author

I can verify that individual pod names actually do make it in here, as part of line_item_resource_id. It looks like arn:aws:eks:us-west-2:783616723547:pod/openscapeshub/prod/jupyter-<username>/591c2cc5-6d89-455f-9c90-c8ddced97357, which is exciting but out of scope for right now.

@yuvipanda
Copy link
Member Author

Running this query to see what kind of costs get allocated:

SELECT  line_item_usage_type,  sum(split_line_item_split_cost) as "cost", resource_tags_aws_eks_namespace FROM "athenacurcfn_2i2c_cost_export"."2i2c_cost_export"  where split_line_item_split_cost >0.0 group by line_item_usage_type, resource_tags_aws_eks_namespace limit 100;

I see:

  • USW2-EKS-EC2-vCPU-Hours
  • USW2-EKS-EC2-GB-Hours

And unfortunately, only that. This means the following costs are unattributed:

  1. Disks (for hub db disks, prometheus, for sure - not sure about instance base disks)
  2. Network egress costs

While we could tag the hub db disks and prometheus, I'm not sure we can do the same for network base disks.

Not being able to tag network requests presents both a smaller and bigger challenge. Smaller because almost all our egress goes through the proxy pods and ingress pods anyway, so per-namespace networking would be kinda 'off' regardless (everything would get attributed to nginx-ingress). But bigger challenge because it's possible for this to get really expensive, and we need to be careful to make sure we can track this information.

@yuvipanda
Copy link
Member Author

I've activated the tags kubernetes.io/created-for/pvc/name and kubernetes.io/created-for/pvc/namespace (set by the kubernetes automatic provisioner) to see if we can incorporate those into cost calculations.

I'll explore our networking situation, as well as 'requestor pays' situation

@yuvipanda
Copy link
Member Author

For requestor_pays, we can look at line_item_resource_id to see what bucket the operations are against. This would allow us to account for charges against our buckets vs buckets elsewhere. However, it doesn't mention which entity made the request that caused this charge, so we can not really distinguish this per-hub.

@yuvipanda
Copy link
Member Author

Based on my exploration here, and on what was determined to be the things that would be valuble to admins right now (per #4384), I've made the following choices:

  1. We can use AWS Athena for these queries, so yay.
  2. We can not use the split cost allocation feature, because it doesn't cover a couple of resources important to us (disk, primarily)
  3. For clusters where we want to offer 'per hub cost tracking', this means each hub must be on its own tagged nodepool.

I'll proceed to refine more tasks based on this.

Looking at my time tracking, this has taken about 90minutes spread out over 3 days, which isn't so bad :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant