Skip to content

fix(power): correct power attribution by separating used vs idle power #2125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sthaha
Copy link
Collaborator

@sthaha sthaha commented Jun 3, 2025

This PR fixes a critical power attribution bug where workloads (processes/containers/VMs)
were being attributed the full node power consumption instead of only the power
corresponding to their actual CPU utilization.

Key Changes:

  • Introduced used vs idle power split: Node power is now divided between "used" (active workload) and "idle" (system overhead) portions based on actual CPU usage ratio from /proc/stat
  • Fixed attribution algorithm: Processes/containers now receive power attribution only from the "used" power portion rather than total node power
  • Enhanced metrics granularity: Added new kepler_node_cpu_[active|idle]_[watts|joules_total] kepler_node_cpu_usage_ratio metric

Impact: Provides significantly more accurate power attribution that correctly reflects the relationship between CPU utilization and energy consumption, eliminating over-attribution to workloads.

@github-actions github-actions bot added the fix A bug fix label Jun 3, 2025
@sthaha sthaha requested a review from vimalk78 June 4, 2025 06:25
@sthaha sthaha force-pushed the fix-power-attribution branch 2 times, most recently from 32eedaa to 1c74d29 Compare June 4, 2025 07:45
@sthaha sthaha force-pushed the fix-power-attribution branch from 1c74d29 to cf402c6 Compare June 5, 2025 06:56
Copy link

codecov bot commented Jun 5, 2025

Codecov Report

Attention: Patch coverage is 96.34146% with 9 lines in your changes missing coverage. Please review.

Project coverage is 92.65%. Comparing base (5a7e093) to head (913a0be).
Report is 3 commits behind head on reboot.

Files with missing lines Patch % Lines
internal/resource/procfs_reader.go 81.25% 4 Missing and 2 partials ⚠️
internal/resource/informer.go 72.72% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           reboot    #2125      +/-   ##
==========================================
+ Coverage   92.46%   92.65%   +0.19%     
==========================================
  Files          33       33              
  Lines        2587     2779     +192     
==========================================
+ Hits         2392     2575     +183     
- Misses        154      160       +6     
- Partials       41       44       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sthaha sthaha force-pushed the fix-power-attribution branch 2 times, most recently from 962c3c2 to f461dfa Compare June 5, 2025 08:24
@sthaha sthaha changed the title WIP: fix power attribution fix(power): correct power attribution by separating used vs idle power Jun 5, 2025
@sthaha sthaha marked this pull request as ready for review June 5, 2025 08:26
@sthaha sthaha force-pushed the fix-power-attribution branch 4 times, most recently from 38d74ab to 2b1b99b Compare June 5, 2025 13:00
@github-actions github-actions bot added the chore Routine tasks or maintenance label Jun 5, 2025
@sthaha sthaha force-pushed the fix-power-attribution branch from 2b1b99b to c1d1b52 Compare June 5, 2025 13:09
sthaha added 2 commits June 5, 2025 09:46
…ibution

Previously, workloads were attributed power from total node consumption,
causing over-attribution during low CPU usage periods. This commit implements
proportional power attribution using CPU utilization ratios.

Changes:
- Add NodeUsage type with active/idle power breakdown
- Calculate CPU usage ratio from /proc/stat
- Attribute only active power to workloads based on CPU time
- Export separate active/idle power metrics

Fixes power attribution accuracy by ensuring workloads are only attributed
power proportional to their actual CPU utilization.

Signed-off-by: Sunil Thaha <[email protected]>
@sthaha sthaha force-pushed the fix-power-attribution branch from c1d1b52 to 913a0be Compare June 5, 2025 13:46
Comment on lines +105 to +106
// active over total, where active = total - idle
// and total = user + nice + system + idle + iowait + irq + softirq + steal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment to say active = total - (idle + iowait)

@@ -14,32 +14,37 @@ import (
"k8s.io/utils/clock"
)

type Node struct {
CPUTimeDelta float64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this field contains sum(proc.CPUTimeDelta) , not related to node

Comment on lines +222 to +236
ch <- prometheus.MustNewConstMetric(
c.nodeCPUActiveJoulesDesc,
prometheus.CounterValue,
energy.ActiveEnergy.Joules(),
zoneName, path,
)

ch <- prometheus.MustNewConstMetric(
c.nodeCPUIdleJoulesDesc,
prometheus.CounterValue,
energy.IdleEnergy.Joules(),
zoneName, path,
)

// watts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node active/idle joules goes up and down, it is more like a gauge than a counter
image

Copy link
Collaborator Author

@sthaha sthaha Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you plotting? rate ?

You are right ... I am not incrementing (prev + current) the counter and only calculating the irate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, not with rate

@sthaha
Copy link
Collaborator Author

sthaha commented Jun 7, 2025

@vimalk78 I am merging this so that pod can be rebased on this. I will continue to fix the incorrect implementation of ActiveEnergy counter as a follow up commit.

@sthaha sthaha merged commit 11f34cd into sustainable-computing-io:reboot Jun 7, 2025
16 checks passed
@vprashar2929
Copy link
Collaborator

also:
Screenshot 2025-06-07 at 5 05 44 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Routine tasks or maintenance fix A bug fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants