Skip to content

feat(idle-power): Update Idle Energy Estimation Calculation #1934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

KaiyiLiu1234
Copy link
Collaborator

Changes:

  • Idle Energy Calculation has been changed to track the minimum known cpu time and absolute energy and maximum known cpu time and absolute energy. Using these two points, kepler calculates the slope and y intercept (idle estimated energy or energy when cpu time == 0).
  • Idle Energy is only exposed when kepler has a sufficiently wide difference between the minimum and maximum known cpu times, the calculated idle energy converge, and negative idle energy estimations are ignored.

Pending Issue (must address):

  • CPU Time is provided per node, but idle power and absolute power is provided in per socket.

Before this PR is ready to merge, the above Pending Issue needs to be addressed.

Replaced MinIdlePower calculation with a Linear Regressor model which
calculates Idle Power every scrape interval.

Signed-off-by: Kaiyi Liu <[email protected]>
Included checks for sample size (via spread) and history which stores the previous
n number of calculated idle energy.

Signed-off-by: Kaiyi Liu <[email protected]>
Updated kepler to expose idle power via prometheus using the new
idle power calculation with Linear Regression.

Signed-off-by: Kaiyi Liu <[email protected]>
@KaiyiLiu1234
Copy link
Collaborator Author

@sthaha @sunya-ch @rootfs Please take a look before the upcoming Community Meeting. Thanks!

// Modify Manually
spreadDiff = 0.3
historyLength = 10
energyTypeToMinSlope = map[string]float64{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment on how these are used?

result *IdleEnergyResult
}

func (ic *IdleEnergyCalculator) UpdateIdleEnergy(newResutilization float64, newEnergyDelta float64, maxTheoreticalCPUTime float64) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (ic *IdleEnergyCalculator) UpdateIdleEnergy(newResutilization float64, newEnergyDelta float64, maxTheoreticalCPUTime float64) {
func (ic *IdleEnergyCalculator) UpdateIdleEnergy(newResUtilization float64, newEnergyDelta float64, maxTheoreticalCPUTime float64) {

klog.V(5).Infof("Excess Datapoint: (%f, %f)", newResutilization, newEnergyDelta)
// Record History
klog.V(5).Infof("Push Idle Energy to history")
appendToSliceWithSizeRestriction(&ic.result.history, historyLength, ic.result.calculatedIdleEnergy)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please explain why this is needed? Would this be removed in the final draft ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tracks consistency. One of the requirements is to make sure the idle power is consistent. So keeping track of the recent history (say the past 10 calculations) and comparing the average of the history with the current idle power can indicate if idle power is consistent (ex. percentage error difference should be less than 0.1). Note, spread and history are tracked and once the idle power passes both, the idle power is freely reported without restrictions (let me know if this should be changed)

history []float64
}

type IdleEnergyCalculator struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KaiyiLiu1234 , could you please add a unit test for this? It will help us understand the behaviour better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I can add this unit test in node_stats_test.go

Added dev/latest dashboards showing differences in idle power calculations and
included bug fixes in idle power calculation.

Signed-off-by: Kaiyi Liu <[email protected]>
@KaiyiLiu1234
Copy link
Collaborator Author

These images show the idle power calculation between kepler latest and kepler dev. kepler dev contains the new way of calculating idle and dynamic power.
Screenshot From 2025-03-03 19-07-51
Screenshot From 2025-03-03 19-08-26

Copy link
Collaborator

@sunya-ch sunya-ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to get intercept from two points which spread more than x(50)% relative to the theoretical CPU time makes sense to me.
Just leave some comment that we can reduce some unused calculation.

// note minutilization == maxutilization only occurs when we only have one value at the very beginning
// in that case, we can rely on the default values provided by NewIdleEnergyCalculator
//if ic.minUtilization.X < ic.maxUtilization.X {
if newMinUtilizationX < newMaxUtilizationX {
Copy link
Collaborator

@sunya-ch sunya-ch Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can skip calculating intercept by checking

diff := math.Abs(newMaxUtilization.X/maxTheoreticalCPUTime-newMinUtilization.X/maxTheoreticalCPUTime )
if diff < spreadDiff {
   klog.V(5).Infof("data spread too small %f", diff) <- please feel free to change log message
   return 
}

And then we can use only when ic.result.calculatedIdleEnergy is set to >= ic.minIntercept (which it seems to set to 0).

Simplified UpdateIdleEnergy function to remove excessive
if statements. Divided CalcIdleEnergy LR into multiple
functions in order to satisfy Single Responsibility Principle (SRP).

Signed-off-by: Kaiyi Liu <[email protected]>
@KaiyiLiu1234
Copy link
Collaborator Author

Note Unit test fails due to the removal of CalcDynEnergy from stats.go. I believe it's best to in a separate PR remove this functionality and modify follow up unit tests. This means this PR will not fix the bug of idle energy influencing dynamic energy. Additionally, UpdateIdleEnergy can also be divided into multiple functions to satisfy SRP.

Added Unit Tests for UpdateIdleEnergy and UpdateIdleEnergyWithLinearRegression.

Signed-off-by: Kaiyi Liu <[email protected]>
Fixed UpdateIdleEnergyWithLinearRegression typo.

Signed-off-by: Kaiyi Liu <[email protected]>
Add Documentation for Idle Energy Calculator, Linear Model Regression
Calculations, and UpdateIdleEnergy functions.

Signed-off-by: Kaiyi Liu <[email protected]>
Reorganized Idle Energy Validation Grafana Dashboards into
rows with consistent coloring.

Signed-off-by: Kaiyi Liu <[email protected]>
Copy link

codecov bot commented Mar 17, 2025

Codecov Report

Attention: Patch coverage is 77.55102% with 66 lines in your changes missing coverage. Please review.

Project coverage is 53.86%. Comparing base (cf807b6) to head (2f595d2).
Report is 25 commits behind head on main.

Files with missing lines Patch % Lines
pkg/collector/stats/node_stats.go 85.37% 35 Missing and 2 partials ⚠️
pkg/node/node.go 0.00% 19 Missing ⚠️
pkg/collector/stats/utils.go 52.38% 7 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##           main    #1934       +/-   ##
=========================================
+ Coverage      0   53.86%   +53.86%     
=========================================
  Files         0       39       +39     
  Lines         0     3791     +3791     
=========================================
+ Hits          0     2042     +2042     
- Misses        0     1592     +1592     
- Partials      0      157      +157     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Resolved errors produced by golangci-lint.

Signed-off-by: Kaiyi Liu <[email protected]>
Resolve golangci-lint errors by:
- Fixing variable naming issues.
- Adding missing documentation.
- Correcting function signatures.

Signed-off-by: Kaiyi Liu <[email protected]>
@KaiyiLiu1234 KaiyiLiu1234 marked this pull request as ready for review March 17, 2025 08:32
@KaiyiLiu1234 KaiyiLiu1234 added the kind/feature New feature or request label Mar 17, 2025
-enable-gpu=false
-v "6" \
-enable-gpu=false \
-expose-estimated-idle-power=true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can set env variable in here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not mean to push the compose files. I will revert them. That said, I don't think it works actually. It's bad that we have two sections where config flags are defined. In the case of the expose estimated idle power, the env variable (pkg/config) gets overriden by cmd/exporter (aka expose-estimated-idle-power=true).

@sthaha
Copy link
Collaborator

sthaha commented Mar 18, 2025

@KaiyiLiu1234 Could you also please report the result of the following test

  • have your machine run minimal amount of processes (init 3?)
  • measure the power .. (min W)
  • Increase the load (stress-ng )
  • start kepler
  • see if the idle estimator estimates lower than the min W

Removed -expose-estimated-idle-power=true and reverted -v=6 to -v=8
in compose files.

Signed-off-by: Kaiyi Liu <[email protected]>
Added documentation for CPUCount explaining how the function is
used and how it retrieves the number of CPUs/logical processors
on the node.

Signed-off-by: Kaiyi Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants