-
Notifications
You must be signed in to change notification settings - Fork 204
feat(idle-power): Update Idle Energy Estimation Calculation #1934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(idle-power): Update Idle Energy Estimation Calculation #1934
Conversation
Replaced MinIdlePower calculation with a Linear Regressor model which calculates Idle Power every scrape interval. Signed-off-by: Kaiyi Liu <[email protected]>
Signed-off-by: Kaiyi Liu <[email protected]>
Included checks for sample size (via spread) and history which stores the previous n number of calculated idle energy. Signed-off-by: Kaiyi Liu <[email protected]>
Updated kepler to expose idle power via prometheus using the new idle power calculation with Linear Regression. Signed-off-by: Kaiyi Liu <[email protected]>
pkg/collector/stats/node_stats.go
Outdated
// Modify Manually | ||
spreadDiff = 0.3 | ||
historyLength = 10 | ||
energyTypeToMinSlope = map[string]float64{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a comment on how these are used?
pkg/collector/stats/node_stats.go
Outdated
result *IdleEnergyResult | ||
} | ||
|
||
func (ic *IdleEnergyCalculator) UpdateIdleEnergy(newResutilization float64, newEnergyDelta float64, maxTheoreticalCPUTime float64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (ic *IdleEnergyCalculator) UpdateIdleEnergy(newResutilization float64, newEnergyDelta float64, maxTheoreticalCPUTime float64) { | |
func (ic *IdleEnergyCalculator) UpdateIdleEnergy(newResUtilization float64, newEnergyDelta float64, maxTheoreticalCPUTime float64) { |
pkg/collector/stats/node_stats.go
Outdated
klog.V(5).Infof("Excess Datapoint: (%f, %f)", newResutilization, newEnergyDelta) | ||
// Record History | ||
klog.V(5).Infof("Push Idle Energy to history") | ||
appendToSliceWithSizeRestriction(&ic.result.history, historyLength, ic.result.calculatedIdleEnergy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please explain why this is needed? Would this be removed in the final draft ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tracks consistency. One of the requirements is to make sure the idle power is consistent. So keeping track of the recent history (say the past 10 calculations) and comparing the average of the history with the current idle power can indicate if idle power is consistent (ex. percentage error difference should be less than 0.1). Note, spread and history are tracked and once the idle power passes both, the idle power is freely reported without restrictions (let me know if this should be changed)
history []float64 | ||
} | ||
|
||
type IdleEnergyCalculator struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KaiyiLiu1234 , could you please add a unit test for this? It will help us understand the behaviour better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I can add this unit test in node_stats_test.go
Added dev/latest dashboards showing differences in idle power calculations and included bug fixes in idle power calculation. Signed-off-by: Kaiyi Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to get intercept from two points which spread more than x(50)% relative to the theoretical CPU time makes sense to me.
Just leave some comment that we can reduce some unused calculation.
// note minutilization == maxutilization only occurs when we only have one value at the very beginning | ||
// in that case, we can rely on the default values provided by NewIdleEnergyCalculator | ||
//if ic.minUtilization.X < ic.maxUtilization.X { | ||
if newMinUtilizationX < newMaxUtilizationX { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can skip calculating intercept by checking
diff := math.Abs(newMaxUtilization.X/maxTheoreticalCPUTime-newMinUtilization.X/maxTheoreticalCPUTime )
if diff < spreadDiff {
klog.V(5).Infof("data spread too small %f", diff) <- please feel free to change log message
return
}
And then we can use only when ic.result.calculatedIdleEnergy is set to >= ic.minIntercept (which it seems to set to 0).
Simplified UpdateIdleEnergy function to remove excessive if statements. Divided CalcIdleEnergy LR into multiple functions in order to satisfy Single Responsibility Principle (SRP). Signed-off-by: Kaiyi Liu <[email protected]>
Note Unit test fails due to the removal of CalcDynEnergy from stats.go. I believe it's best to in a separate PR remove this functionality and modify follow up unit tests. This means this PR will not fix the bug of idle energy influencing dynamic energy. Additionally, UpdateIdleEnergy can also be divided into multiple functions to satisfy SRP. |
Added Unit Tests for UpdateIdleEnergy and UpdateIdleEnergyWithLinearRegression. Signed-off-by: Kaiyi Liu <[email protected]>
Fixed UpdateIdleEnergyWithLinearRegression typo. Signed-off-by: Kaiyi Liu <[email protected]>
Add Documentation for Idle Energy Calculator, Linear Model Regression Calculations, and UpdateIdleEnergy functions. Signed-off-by: Kaiyi Liu <[email protected]>
Reorganized Idle Energy Validation Grafana Dashboards into rows with consistent coloring. Signed-off-by: Kaiyi Liu <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1934 +/- ##
=========================================
+ Coverage 0 53.86% +53.86%
=========================================
Files 0 39 +39
Lines 0 3791 +3791
=========================================
+ Hits 0 2042 +2042
- Misses 0 1592 +1592
- Partials 0 157 +157 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Resolved errors produced by golangci-lint. Signed-off-by: Kaiyi Liu <[email protected]>
Resolve golangci-lint errors by: - Fixing variable naming issues. - Adding missing documentation. - Correcting function signatures. Signed-off-by: Kaiyi Liu <[email protected]>
81e964d
to
410abd8
Compare
manifests/compose/compose.yaml
Outdated
-enable-gpu=false | ||
-v "6" \ | ||
-enable-gpu=false \ | ||
-expose-estimated-idle-power=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can set env variable in here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not mean to push the compose files. I will revert them. That said, I don't think it works actually. It's bad that we have two sections where config flags are defined. In the case of the expose estimated idle power, the env variable (pkg/config) gets overriden by cmd/exporter (aka expose-estimated-idle-power=true).
@KaiyiLiu1234 Could you also please report the result of the following test
|
Removed -expose-estimated-idle-power=true and reverted -v=6 to -v=8 in compose files. Signed-off-by: Kaiyi Liu <[email protected]>
Added documentation for CPUCount explaining how the function is used and how it retrieves the number of CPUs/logical processors on the node. Signed-off-by: Kaiyi Liu <[email protected]>
Changes:
Pending Issue (must address):
Before this PR is ready to merge, the above Pending Issue needs to be addressed.