Skip to content

Commit

Permalink
adding full env vars to README, improving documentation, and improvin…
Browse files Browse the repository at this point in the history
…g log output and exception tracing
  • Loading branch information
AndrewFarley committed Jan 23, 2022
1 parent 71a553f commit 2df6aca
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 17 deletions.
28 changes: 24 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ Additionally, since cloud providers don't let you just constantly resize disks,
<img src="./.github/screenshot.recently-scaled.png" alt="Screenshot of usage">


## Per-Volume Configuration / Annotations
## Per-Volume Configuration via Annotations

This controller also supports tweaking your volume-autoscaler configuration per-PVC with annotations. The annotations supported are...

Expand Down Expand Up @@ -190,6 +190,8 @@ spec:
### [Release: 1.0.3 - Jan 23, 2022](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/releases/tag/1.0.3)
```
Handle signal from Kubernetes to kill/restart properly/quickly
Add full env vars as documentation markdown table, inside notes for development below
Adding better exception logs via traceback, and more readable/reasonable log output especially when VERBOSE is enabled
```

### [Release: 1.0.2 - Jan 15, 2022](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/releases/tag/1.0.2)
Expand Down Expand Up @@ -218,7 +220,6 @@ Current Release: 1.0.2
This todo list is mostly for the Author(s), but any contributions are also welcome. Please [submit an Issue](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/issues) for issues or requests, or an [Pull Request](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/pulls) if you added some code.

* Add full helm chart values documentation markdown table
* Add full env vars as documentation markdown table, inside notes for development below
* Push to helm repo in a Github Action and push the static yaml as well
* Add tests coverage to ensure the software works as intended moving forward
* Do some load testing to see how well this software deals with scale, document how much resources needed for each interval. (10 PVCs, 100+ PVC, 500 PVC)
Expand All @@ -229,7 +230,8 @@ This todo list is mostly for the Author(s), but any contributions are also welco
* Make per-PVC annotations to (re)direct Slack to different webhooks and/or different channel(s)
* Discuss what the ideal "default" amount of time before scaling. Currently is 5 minutes (5, 60 minute intervals)
* Discuss what the ideal "default" scale up size is, currently 50%. Suggestion has been made to lower this to around 20%
* Auto-detect (or let user) choose a different provider (eg: AWS/Google) and set different per-provider defaults (eg: wait time, max disk size, etc)
* Auto-detect (or let user choose) a different provider (eg: AWS/Google) and set different per-provider defaults (eg: wait time, min/max disk size, min disk increment, etc)
* Check if storage class has the ALLOWVOLUMEEXPANSION to (help?) ensure the expansion will succeed

# Notes for Development

Expand All @@ -246,6 +248,24 @@ curl http://10.100.57.102
curl https://prometheus.mycompany.com
# Once you have established a functioning URL to prometheus, put it in the following command
# and you'll be off and running in a safe way that won't affect anything because of DryRun
VERBOSE=true SCALE_AFTER_INTERVALS=1 DRY_RUN=true PROMETHEUS_URL=http://10.100.57.102 ./main.py
VERBOSE=true DRY_RUN=true PROMETHEUS_URL=http://10.100.57.102 ./main.py
# Of course, remove DRY_RUN above if you want to actually have the software try to scale your disks by patching the PVC desired storage resources
```

The follow environment variables are settable during development to alter the default logic. These are also settable via the Helm Chart in values, and overridable [per-PVC in Annotations](#per-volume-configuration-via-annotations)

| Variable Name | Default | Description |
|------------------------|----------------|-------------|
| INTERVAL_TIME | 60 | How often (in seconds) to scan Prometheus for checking if we need to resize |
| SCALE_ABOVE_PERCENT | 80 | What percent out of 100 the volume must be consuming before considering to scale it |
| SCALE_AFTER_INTERVALS | 5 | How many intervals of INTERVAL_TIME a volume must be above SCALE_ABOVE_PERCENT before we scale |
| SCALE_UP_PERCENT | 50 | How much percent of the current volume size to scale up by. (100 == (if disk is 10GB, scale to 20GB), eg: 50 == (if disk is 10GB, scale to 15GB) |
| SCALE_UP_MIN_INCREMENT | 1000000000 | How many bytes is the minimum that we can resize up by, default is 1GB (in bytes, so 1000000000) |
| SCALE_UP_MAX_INCREMENT | 16000000000000 | How many bytes is the maximum that we can resize up by, default is 16TB (in bytes, so 16000000000000) |
| SCALE_UP_MAX_SIZE | 16000000000000 | How many bytes is the maximum disk size that we can resize up, default is 16TB for EBS volumes in AWS (in bytes) |
| SCALE_COOLDOWN_TIME | 22200 | How long (in seconds) we must wait before scaling this volume again. For AWS EBS, this is 6 hours which is 21600 seconds but for good measure we add an extra 10 minutes to this, so 22200 |
| PROMETHEUS_URL | `auto-detect` | Where prometheus is, if not provided it can auto-detect it if it's in the same namespace as this Volume Autoscaler |
| DRY_RUN | false | If we want to dry-run this, aka don't do any actions, only report (for dev/testing/poc purposes) |
| PROMETHEUS_LABEL_MATCH | | A PromQL label query to restrict volumes for this to see and scale, without braces. eg: 'namespace="dev"' |
| HTTP_TIMEOUT | 15 | Allows to set the timeout for calls to Prometheus and Kubernetes. Adjust this if your Prometheus or Kubernetes is over a remote WAN link with high latency and/or is heavily loaded |
| VERBOSE | false | If we want to verbose mode, prints out the raw data from each PVC and its status/state instead of the default "" |
29 changes: 16 additions & 13 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from helpers import convert_bytes_to_storage, scale_up_pvc, testIfPrometheusIsAccessible, describe_all_pvcs
from helpers import fetch_pvcs_from_prometheus, printHeaderAndConfiguration, calculateBytesToScaleTo, GracefulKiller
import slack
import sys, traceback

# Other globals
IN_MEMORY_STORAGE = {}
Expand Down Expand Up @@ -37,19 +38,19 @@
# In every loop, fetch all our pvcs state from Kubernetes
try:
pvcs_in_kubernetes = describe_all_pvcs(simple=True)
except Exception as e:
except Exception:
print("Exception while trying to describe all PVCs")
print(e)
traceback.print_exc()
time.sleep(MAIN_LOOP_TIME)
continue

# Fetch our volume usage from Prometheus
try:
pvcs_in_prometheus = fetch_pvcs_from_prometheus(url=PROMETHEUS_URL)
print("Querying and found {} valid PVCs to asses in prometheus".format(len(pvcs_in_prometheus)))
except Exception as e:
except Exception:
print("Exception while trying to fetch PVC metrics from prometheus")
print(e)
traceback.print_exc()
time.sleep(MAIN_LOOP_TIME)
continue

Expand All @@ -68,7 +69,9 @@

if VERBOSE:
print("Volume {} is {}% in-use of the {} available".format(volume_description,volume_used_percent,pvcs_in_kubernetes[volume_description]['volume_size_status']))
print(pvcs_in_kubernetes[volume_description])
print(" VERBOSE DETAILS:")
for key in pvcs_in_kubernetes[volume_description]:
print(" {}: {}".format(key, pvcs_in_kubernetes[volume_description][key]))

# Check if we are NOT in an alert condition
if volume_used_percent < pvcs_in_kubernetes[volume_description]['scale_above_percent']:
Expand All @@ -92,20 +95,20 @@

# Check if we are NOT in a possible scale condition
if IN_MEMORY_STORAGE[volume_description] < pvcs_in_kubernetes[volume_description]['scale_after_intervals']:
print(" AND need to wait {} seconds to scale".format( abs(pvcs_in_kubernetes[volume_description]['last_resized_at'] + pvcs_in_kubernetes[volume_description]['scale_cooldown_time']) - int(time.mktime(time.gmtime())) ))
print(" HAS desired_size {} and current size {}".format( convert_bytes_to_storage(pvcs_in_kubernetes[volume_description]['volume_size_spec_bytes']), convert_bytes_to_storage(pvcs_in_kubernetes[volume_description]['volume_size_status_bytes'])))
print(" BUT need to wait for {} intervals in alert before considering to scale".format( pvcs_in_kubernetes[volume_description]['scale_after_intervals'] ))
print(" FYI this has desired_size {} and current size {}".format( convert_bytes_to_storage(pvcs_in_kubernetes[volume_description]['volume_size_spec_bytes']), convert_bytes_to_storage(pvcs_in_kubernetes[volume_description]['volume_size_status_bytes'])))
continue

# If we are in a possible scale condition, check if we recently scaled it and handle accordingly
if pvcs_in_kubernetes[volume_description]['last_resized_at'] + pvcs_in_kubernetes[volume_description]['scale_cooldown_time'] >= int(time.mktime(time.gmtime())):
print(" AND we recently scaled it {} seconds ago so we will not resize it".format(pvcs_in_kubernetes[volume_description]['last_resized_at'] + pvcs_in_kubernetes[volume_description]['scale_cooldown_time']))
print(" BUT need to wait {} seconds to scale since the last scale time {} seconds ago".format( abs(pvcs_in_kubernetes[volume_description]['last_resized_at'] + pvcs_in_kubernetes[volume_description]['scale_cooldown_time']) - int(time.mktime(time.gmtime())), abs(pvcs_in_kubernetes[volume_description]['last_resized_at'] - int(time.mktime(time.gmtime()))) ))
continue

# If we reach this far then we will be scaling the disk, all preconditions were passed from above
if pvcs_in_kubernetes[volume_description]['last_resized_at'] == 0:
print(" AND we need to scale it, it has never been scaled previously")
print(" AND we need to scale it immediately, it has never been scaled previously")
else:
print(" AND we need to scale it, it last scaled {} seconds ago".format( abs((pvcs_in_kubernetes[volume_description]['last_resized_at'] + pvcs_in_kubernetes[volume_description]['scale_cooldown_time']) - int(time.mktime(time.gmtime()))) ))
print(" AND we need to scale it immediately, it last scaled {} seconds ago".format( abs((pvcs_in_kubernetes[volume_description]['last_resized_at'] + pvcs_in_kubernetes[volume_description]['scale_cooldown_time']) - int(time.mktime(time.gmtime()))) ))

# Calculate how many bytes to resize to based on the parameters provided globally and per-this pv annotations
resize_to_bytes = calculateBytesToScaleTo(
Expand All @@ -115,7 +118,7 @@
max_increment = pvcs_in_kubernetes[volume_description]['scale_up_max_increment'],
maximum_size = pvcs_in_kubernetes[volume_description]['scale_up_max_size'],
)
# TODO: Check if storage class has the ALLOWVOLUMEEXPANSION flag set to true, read the SC from pvcs_in_kubernetes[volume_description]['storage_class'] ?
# TODO: Check here if storage class has the ALLOWVOLUMEEXPANSION flag set to true, read the SC from pvcs_in_kubernetes[volume_description]['storage_class'] ?

# If our resize bytes failed for some reason, eg putting invalid data into the annotations on the PV
if resize_to_bytes == False:
Expand Down Expand Up @@ -165,10 +168,10 @@
if SLACK_WEBHOOK_URL and len(SLACK_WEBHOOK_URL) > 0:
slack.send(status_output, severity="error")

except Exception as e:
except Exception:
print("Exception caught while trying to process record")
print(item)
print(e)
traceback.print_exc()

# Wait until our next interval
time.sleep(MAIN_LOOP_TIME)
Expand Down

0 comments on commit 2df6aca

Please sign in to comment.