adding full env vars to README, improving documentation, and improvin…

…g log output and exception tracing
DevOps-Nirvana · Jan 23, 2022 · 2df6aca · 2df6aca
1 parent 71a553f
commit 2df6aca
Show file tree

Hide file tree

Showing 2 changed files with 40 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -146,7 +146,7 @@ Additionally, since cloud providers don't let you just constantly resize disks,
 <img src="./.github/screenshot.recently-scaled.png" alt="Screenshot of usage">
 
 
-## Per-Volume Configuration / Annotations
+## Per-Volume Configuration via Annotations
 
 This controller also supports tweaking your volume-autoscaler configuration per-PVC with annotations.  The annotations supported are...
 
@@ -190,6 +190,8 @@ spec:
 ### [Release: 1.0.3 - Jan 23, 2022](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/releases/tag/1.0.3)
 ```
 Handle signal from Kubernetes to kill/restart properly/quickly
+Add full env vars as documentation markdown table, inside notes for development below
+Adding better exception logs via traceback, and more readable/reasonable log output especially when VERBOSE is enabled
 ```
 
 ### [Release: 1.0.2 - Jan 15, 2022](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/releases/tag/1.0.2)
@@ -218,7 +220,6 @@ Current Release: 1.0.2
 This todo list is mostly for the Author(s), but any contributions are also welcome.  Please [submit an Issue](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/issues) for issues or requests, or an [Pull Request](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/pulls) if you added some code.
 
 * Add full helm chart values documentation markdown table
-* Add full env vars as documentation markdown table, inside notes for development below
 * Push to helm repo in a Github Action and push the static yaml as well
 * Add tests coverage to ensure the software works as intended moving forward
 * Do some load testing to see how well this software deals with scale, document how much resources needed for each interval.  (10 PVCs, 100+ PVC, 500 PVC)
@@ -229,7 +230,8 @@ This todo list is mostly for the Author(s), but any contributions are also welco
 * Make per-PVC annotations to (re)direct Slack to different webhooks and/or different channel(s)
 * Discuss what the ideal "default" amount of time before scaling.  Currently is 5 minutes (5, 60 minute intervals)
 * Discuss what the ideal "default" scale up size is, currently 50%.  Suggestion has been made to lower this to around 20%
-* Auto-detect (or let user) choose a different provider (eg: AWS/Google) and set different per-provider defaults (eg: wait time, max disk size, etc)
+* Auto-detect (or let user choose) a different provider (eg: AWS/Google) and set different per-provider defaults (eg: wait time, min/max disk size, min disk increment, etc)
+* Check if storage class has the ALLOWVOLUMEEXPANSION to (help?) ensure the expansion will succeed
 
 # Notes for Development
 
@@ -246,6 +248,24 @@ curl http://10.100.57.102
 curl https://prometheus.mycompany.com
 # Once you have established a functioning URL to prometheus, put it in the following command
 # and you'll be off and running in a safe way that won't affect anything because of DryRun
-VERBOSE=true SCALE_AFTER_INTERVALS=1 DRY_RUN=true PROMETHEUS_URL=http://10.100.57.102 ./main.py
+VERBOSE=true DRY_RUN=true PROMETHEUS_URL=http://10.100.57.102 ./main.py
 # Of course, remove DRY_RUN above if you want to actually have the software try to scale your disks by patching the PVC desired storage resources
 ```
+
+The follow environment variables are settable during development to alter the default logic.  These are also settable via the Helm Chart in values, and overridable [per-PVC in Annotations](#per-volume-configuration-via-annotations)
+
+| Variable Name          | Default        | Description |
+|------------------------|----------------|-------------|
+| INTERVAL_TIME          | 60             | How often (in seconds) to scan Prometheus for checking if we need to resize |
+| SCALE_ABOVE_PERCENT    | 80             | What percent out of 100 the volume must be consuming before considering to scale it |
+| SCALE_AFTER_INTERVALS  | 5              | How many intervals of INTERVAL_TIME a volume must be above SCALE_ABOVE_PERCENT before we scale |
+| SCALE_UP_PERCENT       | 50             | How much percent of the current volume size to scale up by.  (100 == (if disk is 10GB, scale to 20GB), eg: 50 == (if disk is 10GB, scale to 15GB) |
+| SCALE_UP_MIN_INCREMENT | 1000000000     | How many bytes is the minimum that we can resize up by, default is 1GB (in bytes, so 1000000000) |
+| SCALE_UP_MAX_INCREMENT | 16000000000000 | How many bytes is the maximum that we can resize up by, default is 16TB (in bytes, so 16000000000000) |
+| SCALE_UP_MAX_SIZE      | 16000000000000 | How many bytes is the maximum disk size that we can resize up, default is 16TB for EBS volumes in AWS (in bytes) |
+| SCALE_COOLDOWN_TIME    | 22200          | How long (in seconds) we must wait before scaling this volume again.  For AWS EBS, this is 6 hours which is 21600 seconds but for good measure we add an extra 10 minutes to this, so 22200 |
+| PROMETHEUS_URL         | `auto-detect`  | Where prometheus is, if not provided it can auto-detect it if it's in the same namespace as this Volume Autoscaler |
+| DRY_RUN                | false          | If we want to dry-run this, aka don't do any actions, only report (for dev/testing/poc purposes) |
+| PROMETHEUS_LABEL_MATCH |                | A PromQL label query to restrict volumes for this to see and scale, without braces.  eg: 'namespace="dev"' |
+| HTTP_TIMEOUT           | 15             | Allows to set the timeout for calls to Prometheus and Kubernetes.  Adjust this if your Prometheus or Kubernetes is over a remote WAN link with high latency and/or is heavily loaded |
+| VERBOSE                | false          | If we want to verbose mode, prints out the raw data from each PVC and its status/state instead of the default "" |
diff --git a/main.py b/main.py
@@ -5,6 +5,7 @@
 from helpers import convert_bytes_to_storage, scale_up_pvc, testIfPrometheusIsAccessible, describe_all_pvcs
 from helpers import fetch_pvcs_from_prometheus, printHeaderAndConfiguration, calculateBytesToScaleTo, GracefulKiller
 import slack
+import sys, traceback
 
 # Other globals
 IN_MEMORY_STORAGE = {}
@@ -37,19 +38,19 @@
         # In every loop, fetch all our pvcs state from Kubernetes
         try:
             pvcs_in_kubernetes = describe_all_pvcs(simple=True)
-        except Exception as e:
+        except Exception:
             print("Exception while trying to describe all PVCs")
-            print(e)
+            traceback.print_exc()
             time.sleep(MAIN_LOOP_TIME)
             continue
 
         # Fetch our volume usage from Prometheus
         try:
             pvcs_in_prometheus = fetch_pvcs_from_prometheus(url=PROMETHEUS_URL)
             print("Querying and found {} valid PVCs to asses in prometheus".format(len(pvcs_in_prometheus)))
-        except Exception as e:
+        except Exception:
             print("Exception while trying to fetch PVC metrics from prometheus")
-            print(e)
+            traceback.print_exc()
             time.sleep(MAIN_LOOP_TIME)
             continue
 
@@ -68,7 +69,9 @@
 
                 if VERBOSE:
                     print("Volume {} is {}% in-use of the {} available".format(volume_description,volume_used_percent,pvcs_in_kubernetes[volume_description]['volume_size_status']))
-                    print(pvcs_in_kubernetes[volume_description])
+                    print("  VERBOSE DETAILS:")
+                    for key in pvcs_in_kubernetes[volume_description]:
+                        print("    {}: {}".format(key, pvcs_in_kubernetes[volume_description][key]))
 
                 # Check if we are NOT in an alert condition
                 if volume_used_percent < pvcs_in_kubernetes[volume_description]['scale_above_percent']:
@@ -92,20 +95,20 @@
 
                 # Check if we are NOT in a possible scale condition
                 if IN_MEMORY_STORAGE[volume_description] < pvcs_in_kubernetes[volume_description]['scale_after_intervals']:
-                    print("  AND need to wait {} seconds to scale".format( abs(pvcs_in_kubernetes[volume_description]['last_resized_at'] + pvcs_in_kubernetes[volume_description]['scale_cooldown_time']) - int(time.mktime(time.gmtime())) ))
-                    print("  HAS desired_size {} and current size {}".format( convert_bytes_to_storage(pvcs_in_kubernetes[volume_description]['volume_size_spec_bytes']), convert_bytes_to_storage(pvcs_in_kubernetes[volume_description]['volume_size_status_bytes'])))
+                    print("  BUT need to wait for {} intervals in alert before considering to scale".format( pvcs_in_kubernetes[volume_description]['scale_after_intervals'] ))
+                    print("  FYI this has desired_size {} and current size {}".format( convert_bytes_to_storage(pvcs_in_kubernetes[volume_description]['volume_size_spec_bytes']), convert_bytes_to_storage(pvcs_in_kubernetes[volume_description]['volume_size_status_bytes'])))
                     continue
 
                 # If we are in a possible scale condition, check if we recently scaled it and handle accordingly
                 if pvcs_in_kubernetes[volume_description]['last_resized_at'] + pvcs_in_kubernetes[volume_description]['scale_cooldown_time'] >= int(time.mktime(time.gmtime())):
-                    print("  AND we recently scaled it {} seconds ago so we will not resize it".format(pvcs_in_kubernetes[volume_description]['last_resized_at'] + pvcs_in_kubernetes[volume_description]['scale_cooldown_time']))
+                    print("  BUT need to wait {} seconds to scale since the last scale time {} seconds ago".format( abs(pvcs_in_kubernetes[volume_description]['last_resized_at'] + pvcs_in_kubernetes[volume_description]['scale_cooldown_time']) - int(time.mktime(time.gmtime())), abs(pvcs_in_kubernetes[volume_description]['last_resized_at'] - int(time.mktime(time.gmtime()))) ))
                     continue
 
                 # If we reach this far then we will be scaling the disk, all preconditions were passed from above
                 if pvcs_in_kubernetes[volume_description]['last_resized_at'] == 0:
-                    print("  AND we need to scale it, it has never been scaled previously")
+                    print("  AND we need to scale it immediately, it has never been scaled previously")
                 else:
-                    print("  AND we need to scale it, it last scaled {} seconds ago".format( abs((pvcs_in_kubernetes[volume_description]['last_resized_at'] + pvcs_in_kubernetes[volume_description]['scale_cooldown_time']) - int(time.mktime(time.gmtime()))) ))
+                    print("  AND we need to scale it immediately, it last scaled {} seconds ago".format( abs((pvcs_in_kubernetes[volume_description]['last_resized_at'] + pvcs_in_kubernetes[volume_description]['scale_cooldown_time']) - int(time.mktime(time.gmtime()))) ))
 
                 # Calculate how many bytes to resize to based on the parameters provided globally and per-this pv annotations
                 resize_to_bytes = calculateBytesToScaleTo(
@@ -115,7 +118,7 @@
                     max_increment     = pvcs_in_kubernetes[volume_description]['scale_up_max_increment'],
                     maximum_size      = pvcs_in_kubernetes[volume_description]['scale_up_max_size'],
                 )
-                # TODO: Check if storage class has the ALLOWVOLUMEEXPANSION flag set to true, read the SC from pvcs_in_kubernetes[volume_description]['storage_class'] ?
+                # TODO: Check here if storage class has the ALLOWVOLUMEEXPANSION flag set to true, read the SC from pvcs_in_kubernetes[volume_description]['storage_class'] ?
 
                 # If our resize bytes failed for some reason, eg putting invalid data into the annotations on the PV
                 if resize_to_bytes == False:
@@ -165,10 +168,10 @@
                     if SLACK_WEBHOOK_URL and len(SLACK_WEBHOOK_URL) > 0:
                         slack.send(status_output, severity="error")
 
-            except Exception as e:
+            except Exception:
                 print("Exception caught while trying to process record")
                 print(item)
-                print(e)
+                traceback.print_exc()
 
         # Wait until our next interval
         time.sleep(MAIN_LOOP_TIME)