feat(idle-power): Update Idle Energy Estimation Calculation #1934

KaiyiLiu1234 · 2025-03-03T12:29:32Z

Changes:

Idle Energy Calculation has been changed to track the minimum known cpu time and absolute energy and maximum known cpu time and absolute energy. Using these two points, kepler calculates the slope and y intercept (idle estimated energy or energy when cpu time == 0).
Idle Energy is only exposed when kepler has a sufficiently wide difference between the minimum and maximum known cpu times, the calculated idle energy converge, and negative idle energy estimations are ignored.

Pending Issue (must address):

CPU Time is provided per node, but idle power and absolute power is provided in per socket.

Before this PR is ready to merge, the above Pending Issue needs to be addressed.

Replaced MinIdlePower calculation with a Linear Regressor model which calculates Idle Power every scrape interval. Signed-off-by: Kaiyi Liu <[email protected]>

Signed-off-by: Kaiyi Liu <[email protected]>

Included checks for sample size (via spread) and history which stores the previous n number of calculated idle energy. Signed-off-by: Kaiyi Liu <[email protected]>

Updated kepler to expose idle power via prometheus using the new idle power calculation with Linear Regression. Signed-off-by: Kaiyi Liu <[email protected]>

KaiyiLiu1234 · 2025-03-03T12:30:31Z

@sthaha @sunya-ch @rootfs Please take a look before the upcoming Community Meeting. Thanks!

sthaha · 2025-03-03T23:28:34Z

pkg/collector/stats/node_stats.go

+	// Modify Manually
+	spreadDiff           = 0.3
+	historyLength        = 10
+	energyTypeToMinSlope = map[string]float64{


Could you please add a comment on how these are used?

sthaha · 2025-03-03T23:29:41Z

pkg/collector/stats/node_stats.go

+	result *IdleEnergyResult
+}
+
+func (ic *IdleEnergyCalculator) UpdateIdleEnergy(newResutilization float64, newEnergyDelta float64, maxTheoreticalCPUTime float64) {


Suggested change

func (ic *IdleEnergyCalculator) UpdateIdleEnergy(newResutilization float64, newEnergyDelta float64, maxTheoreticalCPUTime float64) {

func (ic *IdleEnergyCalculator) UpdateIdleEnergy(newResUtilization float64, newEnergyDelta float64, maxTheoreticalCPUTime float64) {

sthaha · 2025-03-03T23:31:20Z

pkg/collector/stats/node_stats.go

+		klog.V(5).Infof("Excess Datapoint: (%f, %f)", newResutilization, newEnergyDelta)
+		// Record History
+		klog.V(5).Infof("Push Idle Energy to history")
+		appendToSliceWithSizeRestriction(&ic.result.history, historyLength, ic.result.calculatedIdleEnergy)


could you please explain why this is needed? Would this be removed in the final draft ?

This tracks consistency. One of the requirements is to make sure the idle power is consistent. So keeping track of the recent history (say the past 10 calculations) and comparing the average of the history with the current idle power can indicate if idle power is consistent (ex. percentage error difference should be less than 0.1). Note, spread and history are tracked and once the idle power passes both, the idle power is freely reported without restrictions (let me know if this should be changed)

sthaha · 2025-03-03T23:33:32Z

pkg/collector/stats/node_stats.go

+	history              []float64
+}
+
+type IdleEnergyCalculator struct {


@KaiyiLiu1234 , could you please add a unit test for this? It will help us understand the behaviour better.

Understood. I can add this unit test in node_stats_test.go

Added dev/latest dashboards showing differences in idle power calculations and included bug fixes in idle power calculation. Signed-off-by: Kaiyi Liu <[email protected]>

KaiyiLiu1234 · 2025-03-04T00:20:19Z

These images show the idle power calculation between kepler latest and kepler dev. kepler dev contains the new way of calculating idle and dynamic power.

sunya-ch

The logic to get intercept from two points which spread more than x(50)% relative to the theoretical CPU time makes sense to me.
Just leave some comment that we can reduce some unused calculation.

sunya-ch · 2025-03-04T03:09:51Z

pkg/collector/stats/node_stats.go

+	// note minutilization == maxutilization only occurs when we only have one value at the very beginning
+	// in that case, we can rely on the default values provided by NewIdleEnergyCalculator
+	//if ic.minUtilization.X < ic.maxUtilization.X {
+	if newMinUtilizationX < newMaxUtilizationX {


I think we can skip calculating intercept by checking

diff := math.Abs(newMaxUtilization.X/maxTheoreticalCPUTime-newMinUtilization.X/maxTheoreticalCPUTime ) if diff < spreadDiff { klog.V(5).Infof("data spread too small %f", diff) <- please feel free to change log message return }

And then we can use only when ic.result.calculatedIdleEnergy is set to >= ic.minIntercept (which it seems to set to 0).

Simplified UpdateIdleEnergy function to remove excessive if statements. Divided CalcIdleEnergy LR into multiple functions in order to satisfy Single Responsibility Principle (SRP). Signed-off-by: Kaiyi Liu <[email protected]>

KaiyiLiu1234 · 2025-03-10T06:53:51Z

Note Unit test fails due to the removal of CalcDynEnergy from stats.go. I believe it's best to in a separate PR remove this functionality and modify follow up unit tests. This means this PR will not fix the bug of idle energy influencing dynamic energy. Additionally, UpdateIdleEnergy can also be divided into multiple functions to satisfy SRP.

Added Unit Tests for UpdateIdleEnergy and UpdateIdleEnergyWithLinearRegression. Signed-off-by: Kaiyi Liu <[email protected]>

Fixed UpdateIdleEnergyWithLinearRegression typo. Signed-off-by: Kaiyi Liu <[email protected]>

Add Documentation for Idle Energy Calculator, Linear Model Regression Calculations, and UpdateIdleEnergy functions. Signed-off-by: Kaiyi Liu <[email protected]>

Reorganized Idle Energy Validation Grafana Dashboards into rows with consistent coloring. Signed-off-by: Kaiyi Liu <[email protected]>

codecov · 2025-03-17T08:17:30Z

Codecov Report

Attention: Patch coverage is 77.55102% with 66 lines in your changes missing coverage. Please review.

Project coverage is 53.86%. Comparing base (cf807b6) to head (2f595d2).
Report is 25 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/collector/stats/node_stats.go	85.37%	35 Missing and 2 partials ⚠️
pkg/node/node.go	0.00%	19 Missing ⚠️
pkg/collector/stats/utils.go	52.38%	7 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##           main    #1934       +/-   ##
=========================================
+ Coverage      0   53.86%   +53.86%     
=========================================
  Files         0       39       +39     
  Lines         0     3791     +3791     
=========================================
+ Hits          0     2042     +2042     
- Misses        0     1592     +1592     
- Partials      0      157      +157

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Resolved errors produced by golangci-lint. Signed-off-by: Kaiyi Liu <[email protected]>

Resolve golangci-lint errors by: - Fixing variable naming issues. - Adding missing documentation. - Correcting function signatures. Signed-off-by: Kaiyi Liu <[email protected]>

vprashar2929 · 2025-03-17T13:13:32Z

manifests/compose/compose.yaml

-          -enable-gpu=false
+          -v "6" \
+          -enable-gpu=false \
+          -expose-estimated-idle-power=true


We can set env variable in here

I did not mean to push the compose files. I will revert them. That said, I don't think it works actually. It's bad that we have two sections where config flags are defined. In the case of the expose estimated idle power, the env variable (pkg/config) gets overriden by cmd/exporter (aka expose-estimated-idle-power=true).

sthaha · 2025-03-18T01:01:08Z

@KaiyiLiu1234 Could you also please report the result of the following test

have your machine run minimal amount of processes (init 3?)
measure the power .. (min W)
Increase the load (stress-ng )
start kepler
see if the idle estimator estimates lower than the min W

Removed -expose-estimated-idle-power=true and reverted -v=6 to -v=8 in compose files. Signed-off-by: Kaiyi Liu <[email protected]>

Added documentation for CPUCount explaining how the function is used and how it retrieves the number of CPUs/logical processors on the node. Signed-off-by: Kaiyi Liu <[email protected]>

KaiyiLiu1234 added 4 commits February 18, 2025 17:17

feat(idle-power): Add LR Estimation of Idle Power

c6c7636

Replaced MinIdlePower calculation with a Linear Regressor model which calculates Idle Power every scrape interval. Signed-off-by: Kaiyi Liu <[email protected]>

feat(idle-power): Remove aggr idle calculation

619147c

Signed-off-by: Kaiyi Liu <[email protected]>

feat(idle-energy): Added reliability checks for idle energy

916809c

Included checks for sample size (via spread) and history which stores the previous n number of calculated idle energy. Signed-off-by: Kaiyi Liu <[email protected]>

feat(idle-power): Incorporate idle power into kepler

698e3ad

Updated kepler to expose idle power via prometheus using the new idle power calculation with Linear Regression. Signed-off-by: Kaiyi Liu <[email protected]>

KaiyiLiu1234 requested review from sthaha, rootfs and sunya-ch March 3, 2025 12:29

KaiyiLiu1234 requested a review from marceloamaral March 3, 2025 13:19

sthaha reviewed Mar 3, 2025

View reviewed changes

feat(idle-power): Added Dev/Latest Dashboards for Comparing Idle power

c0667b3

Added dev/latest dashboards showing differences in idle power calculations and included bug fixes in idle power calculation. Signed-off-by: Kaiyi Liu <[email protected]>

sunya-ch reviewed Mar 4, 2025

View reviewed changes

feat(idle-energy): Refactor UpdateIdleEnergy

81f9c48

Simplified UpdateIdleEnergy function to remove excessive if statements. Divided CalcIdleEnergy LR into multiple functions in order to satisfy Single Responsibility Principle (SRP). Signed-off-by: Kaiyi Liu <[email protected]>

KaiyiLiu1234 added 4 commits March 17, 2025 03:13

feat(idle-energy): Add Idle Energy Unit Tests

b2b9c51

Added Unit Tests for UpdateIdleEnergy and UpdateIdleEnergyWithLinearRegression. Signed-off-by: Kaiyi Liu <[email protected]>

feat(idle-energy): Fix typo

a1fead1

Fixed UpdateIdleEnergyWithLinearRegression typo. Signed-off-by: Kaiyi Liu <[email protected]>

feat(idle-energy): Add Documentation for Idle Energy Calculator

01a49af

Add Documentation for Idle Energy Calculator, Linear Model Regression Calculations, and UpdateIdleEnergy functions. Signed-off-by: Kaiyi Liu <[email protected]>

feat(idle-energy): Reorganize Idle Energy Dev Dashboards

554adc7

Reorganized Idle Energy Validation Grafana Dashboards into rows with consistent coloring. Signed-off-by: Kaiyi Liu <[email protected]>

KaiyiLiu1234 added 2 commits March 17, 2025 04:26

feat(lint): fix Golint errors

4460e8f

Resolved errors produced by golangci-lint. Signed-off-by: Kaiyi Liu <[email protected]>

feat(lint): resolve golangci-lint errors

410abd8

Resolve golangci-lint errors by: - Fixing variable naming issues. - Adding missing documentation. - Correcting function signatures. Signed-off-by: Kaiyi Liu <[email protected]>

KaiyiLiu1234 force-pushed the add-idle-power branch from 81e964d to 410abd8 Compare March 17, 2025 08:29

KaiyiLiu1234 marked this pull request as ready for review March 17, 2025 08:32

KaiyiLiu1234 requested review from sthaha and vprashar2929 March 17, 2025 08:32

KaiyiLiu1234 added the kind/feature New feature or request label Mar 17, 2025

vprashar2929 reviewed Mar 17, 2025

View reviewed changes

fix(compose): remove idle energy flag and revert verbosity level

b9d3a10

Removed -expose-estimated-idle-power=true and reverted -v=6 to -v=8 in compose files. Signed-off-by: Kaiyi Liu <[email protected]>

KaiyiLiu1234 requested a review from vprashar2929 March 24, 2025 11:31

feat(idle-energy): document CPUCount

2f595d2

Added documentation for CPUCount explaining how the function is used and how it retrieves the number of CPUs/logical processors on the node. Signed-off-by: Kaiyi Liu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(idle-power): Update Idle Energy Estimation Calculation #1934

feat(idle-power): Update Idle Energy Estimation Calculation #1934

Uh oh!

KaiyiLiu1234 commented Mar 3, 2025

Uh oh!

KaiyiLiu1234 commented Mar 3, 2025

Uh oh!

sthaha Mar 3, 2025

Uh oh!

sthaha Mar 3, 2025

Uh oh!

sthaha Mar 3, 2025

Uh oh!

KaiyiLiu1234 Mar 4, 2025

Uh oh!

sthaha Mar 3, 2025

Uh oh!

KaiyiLiu1234 Mar 4, 2025

Uh oh!

KaiyiLiu1234 commented Mar 4, 2025

Uh oh!

sunya-ch left a comment

Uh oh!

sunya-ch Mar 4, 2025 •

edited

Loading

Uh oh!

KaiyiLiu1234 commented Mar 10, 2025

Uh oh!

codecov bot commented Mar 17, 2025 •

edited

Loading

Uh oh!

vprashar2929 Mar 17, 2025

Uh oh!

KaiyiLiu1234 Mar 18, 2025

Uh oh!

sthaha commented Mar 18, 2025

Uh oh!

Uh oh!

	func (ic *IdleEnergyCalculator) UpdateIdleEnergy(newResutilization float64, newEnergyDelta float64, maxTheoreticalCPUTime float64) {
	func (ic *IdleEnergyCalculator) UpdateIdleEnergy(newResUtilization float64, newEnergyDelta float64, maxTheoreticalCPUTime float64) {

feat(idle-power): Update Idle Energy Estimation Calculation #1934

Are you sure you want to change the base?

feat(idle-power): Update Idle Energy Estimation Calculation #1934

Uh oh!

Conversation

KaiyiLiu1234 commented Mar 3, 2025

Uh oh!

KaiyiLiu1234 commented Mar 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KaiyiLiu1234 commented Mar 4, 2025

Uh oh!

sunya-ch left a comment

Choose a reason for hiding this comment

Uh oh!

sunya-ch Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KaiyiLiu1234 commented Mar 10, 2025

Uh oh!

codecov bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sthaha commented Mar 18, 2025

Uh oh!

Uh oh!

sunya-ch Mar 4, 2025 •

edited

Loading

codecov bot commented Mar 17, 2025 •

edited

Loading