Skip to content

Commit

Permalink
Sending kubernetes events on modification
Browse files Browse the repository at this point in the history
  • Loading branch information
AndrewFarley committed Jan 24, 2022
1 parent 2df6aca commit 004cbf9
Show file tree
Hide file tree
Showing 3 changed files with 108 additions and 40 deletions.
22 changes: 12 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,8 +190,9 @@ spec:
### [Release: 1.0.3 - Jan 23, 2022](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/releases/tag/1.0.3)
```
Handle signal from Kubernetes to kill/restart properly/quickly
Add full env vars as documentation markdown table, inside notes for development below
Add full env vars as documentation markdown table, inside notes for development section below
Adding better exception logs via traceback, and more readable/reasonable log output especially when VERBOSE is enabled
Generate Kubernetes events so everyone viewing the event stream knows when actions occur. AKA, Be an responsible controller
```

### [Release: 1.0.2 - Jan 15, 2022](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/releases/tag/1.0.2)
Expand All @@ -217,21 +218,22 @@ Some basic documentation and installation help/guides

Current Release: 1.0.2

This todo list is mostly for the Author(s), but any contributions are also welcome. Please [submit an Issue](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/issues) for issues or requests, or an [Pull Request](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/pulls) if you added some code.
This todo list is mostly for the Author(s), but any contributions are also welcome. Please [submit an Issue](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/issues) for issues/requests, or an [Pull Request](https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/pulls) if you added some code. These items are generally put in order of most-important first.

* Add full helm chart values documentation markdown table
* Push to helm repo in a Github Action and push the static yaml as well
* Listen/watch to events of the PV/PVC, or listen/read from Prometheus to monitor and ensure the resizing happens, log and/or slack it accordingly
* Catch WHY resizing failed (try?) and make sure to log/send to slack/k8s events the why
* Check if storage class has the ALLOWVOLUMEEXPANSION to (help?) ensure the expansion will succeed, do something about it and/or report it
* Add full helm chart values documentation markdown table (tie into adding docs for Universal Helm Charts)
* Push to helm repo in a Github Action and push the static yaml as well, make things "easier" and automated
* Add badges to the README regarding Github Actions success/failures
* Add tests coverage to ensure the software works as intended moving forward
* Do some load testing to see how well this software deals with scale, document how much resources needed for each interval. (10 PVCs, 100+ PVC, 500 PVC)
* Generate kubernetes EVENTS when we resize volumes so everyone knows we are doing things, to be a good controller
* Add badges to the README regarding Github Actions success/failures
* Listen/watch to events of the PV/PVC, or listen/read from Prometheus to monitor and ensure the resizing happens, log and/or slack it accordingly
* Test it and add working examples of using this on other cloud providers (Azure / Google Cloud)
* Make per-PVC annotations to (re)direct Slack to different webhooks and/or different channel(s)
* Make per-PVC annotations to (re)direct Slack to different webhooks and/or different channel(s)?
* Discuss what the ideal "default" amount of time before scaling. Currently is 5 minutes (5, 60 minute intervals)
* Discuss what the ideal "default" scale up size is, currently 50%. Suggestion has been made to lower this to around 20%
* Add better examples for "best practices" when using Helm (aka, subchart)
* Test it and add working examples of using this on other cloud providers (Azure / Google Cloud)
* Auto-detect (or let user choose) a different provider (eg: AWS/Google) and set different per-provider defaults (eg: wait time, min/max disk size, min disk increment, etc)
* Check if storage class has the ALLOWVOLUMEEXPANSION to (help?) ensure the expansion will succeed

# Notes for Development

Expand Down
82 changes: 69 additions & 13 deletions helpers.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
from os import getenv # Environment variable handling
import time # Sleep
import time # Sleep/time
import datetime
import requests # For making HTTP requests to Prometheus
import kubernetes # For talking to the Kubernetes API
from kubernetes.client import ApiException
from packaging import version # For checking if prometheus version is new enough to use a new function present_over_time()
import signal # For sigkill handling
import random # Random string generation
import traceback # Debugging/trace outputs

# Used below in init variables
def detectPrometheusURL():
Expand All @@ -12,7 +16,7 @@ def detectPrometheusURL():
prometheus_ip_address = getenv('PROMETHEUS_SERVER_SERVICE_HOST')
prometheus_port = getenv('PROMETHEUS_SERVER_SERVICE_PORT_HTTP')

# TODO: If there's other ways to also detect prometheus (eg: try http://prometheus-server) please put them here...
# TODO: If there's other ways to also detect prometheus (eg: try http://prometheus-server) please put them here...?

if not prometheus_ip_address or not prometheus_port:
print("ERROR: PROMETHEUS_URL was not set, and can not auto-detect where prometheus is")
Expand All @@ -35,7 +39,7 @@ def detectPrometheusURL():
PROMETHEUS_VERSION = "Unknown" # Used to detect the availability of a new function called present_over_time only available on Prometheus v2.30.0 or newer, this is auto-detected and updated, not set by a user
VERBOSE = True if getenv('VERBOSE', "false").lower() == "true" else False # If we want to verbose mode

# This handler helps handle sigkill gracefully
# This handler helps handle sigint/term gracefully (not in the middle of an runloop)
class GracefulKiller:
kill_now = False
def __init__(self):
Expand Down Expand Up @@ -69,7 +73,7 @@ def printHeaderAndConfiguration():
print(" Volume Autoscaler - Configuration ")
print("---------------------------------------------------------------")
print(" Prometheus URL: {}".format(PROMETHEUS_URL))
print(" Prometheus Version: {}".format(PROMETHEUS_VERSION))
print(" Prometheus Version: {}{}".format(PROMETHEUS_VERSION," (upgrade to >= 2.30.0 to prevent some false positives)" if version.parse(PROMETHEUS_VERSION) < version.parse("2.30.0") else ""))
print(" Prometheus Labels: {{{}}}".format(PROMETHEUS_LABEL_MATCH))
print(" Interval to query usage: every {} seconds".format(INTERVAL_TIME))
print(" Scale up after: {} intervals ({} seconds total)".format(SCALE_AFTER_INTERVALS, SCALE_AFTER_INTERVALS * INTERVAL_TIME))
Expand Down Expand Up @@ -247,6 +251,14 @@ def convert_pvc_to_simpler_dict(pvc):
return_dict['storage_class'] = pvc.spec.storage_class_name
except:
return_dict['storage_class'] = ""
try:
return_dict['resource_version'] = pvc.metadata.resource_version
except:
return_dict['resource_version'] = ""
try:
return_dict['uid'] = pvc.metadata.uid
except:
return_dict['uid'] = ""

# Set our defaults
return_dict['last_resized_at'] = 0
Expand Down Expand Up @@ -324,6 +336,7 @@ def scale_up_pvc(namespace, name, new_size):
print(e)
return False


# Test if prometheus is accessible, and gets the build version so we know which function(s) are available or not, primarily for present_over_time below
def testIfPrometheusIsAccessible(url):
global PROMETHEUS_VERSION
Expand All @@ -337,6 +350,7 @@ def testIfPrometheusIsAccessible(url):
print(e)
exit(-1)


# Get a list of PVCs from Prometheus with their metrics of disk usage
def fetch_pvcs_from_prometheus(url, label_match=PROMETHEUS_LABEL_MATCH):

Expand All @@ -357,15 +371,57 @@ def fetch_pvcs_from_prometheus(url, label_match=PROMETHEUS_LABEL_MATCH):
return response_object['data']['result']


# Describe an specific PVC
def describe_pvc(namespace, name, simple=False):
api_response = kubernetes_core_api.list_namespaced_persistent_volume_claim(namespace, limit=1, field_selector="metadata.name=" + name, timeout_seconds=HTTP_TIMEOUT)
# print(api_response)
for item in api_response.items:
# If the user wants pre-parsed, making it a bit easier to work with than a huge map of map of maps
if simple:
return convert_pvc_to_simpler_dict(item)
return item
raise Exception("No PVC found for {}:{}".format(namespace,name))


# Convert an PVC to an involved object for Kubernetes events
def get_involved_object_from_pvc(pvc):
return kubernetes.client.V1ObjectReference(
api_version="v1",
kind="PersistentVolumeClaim",
name=pvc.metadata.name,
namespace=pvc.metadata.namespace,
resource_version=pvc.metadata.resource_version,
uid=pvc.metadata.uid,
)

# Send events to Kubernetes. This is used when we modify PVCs
def send_kubernetes_event(namespace, name, reason, message, type="Normal"):

# Unused at the moment
# def describe_pvc(namespace, name, simple=False):
# api_response = kubernetes_core_api.list_namespaced_persistent_volume_claim(namespace, limit=1, field_selector="metadata.name=" + name, timeout_seconds=HTTP_TIMEOUT)
# # print(api_response)
# for item in api_response.items:
# # If the user wants pre-parsed, making it a bit easier to work with than a huge map of map of maps
# if simple:
# return convert_pvc_to_simpler_dict(item)
# return item
try:
# Lookup our PVC
pvc = describe_pvc(namespace, name)

# Generate our metadata and object relation for this event
involved_object = get_involved_object_from_pvc(pvc)
source = kubernetes.client.V1EventSource(component="volume-autoscaler")
metadata = kubernetes.client.V1ObjectMeta(
namespace=namespace,
name=name + ''.join([random.choice('123456789abcdef') for n in range(16)]),
)

# Generate our event body with the reason and message set
body = kubernetes.client.CoreV1Event(
involved_object=involved_object,
metadata=metadata,
reason=reason,
message=message,
type=type,
source=source,
first_timestamp=datetime.datetime.now(datetime.timezone.utc).isoformat()
)

api_response = kubernetes_core_api.create_namespaced_event(namespace, body, field_manager="volume_autoscaler")
except ApiException as e:
print("Exception when calling CoreV1Api->create_namespaced_event: %s\n" % e)
except:
traceback.print_exc()
44 changes: 27 additions & 17 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import os
import time
from helpers import INTERVAL_TIME, PROMETHEUS_URL, DRY_RUN, VERBOSE
from helpers import convert_bytes_to_storage, scale_up_pvc, testIfPrometheusIsAccessible, describe_all_pvcs
from helpers import convert_bytes_to_storage, scale_up_pvc, testIfPrometheusIsAccessible, describe_all_pvcs, send_kubernetes_event
from helpers import fetch_pvcs_from_prometheus, printHeaderAndConfiguration, calculateBytesToScaleTo, GracefulKiller
import slack
import sys, traceback
Expand Down Expand Up @@ -143,28 +143,38 @@

# If we aren't dry-run, lets resize
print(" RESIZING disk from {} to {}".format(convert_bytes_to_storage(pvcs_in_kubernetes[volume_description]['volume_size_status_bytes']), convert_bytes_to_storage(resize_to_bytes)))
status_output = "to scale up `{}` by `{}%` from `{}` to `{}`, it was using more than `{}%` disk space over the last `{} seconds`".format(
volume_description,
pvcs_in_kubernetes[volume_description]['scale_up_percent'],
convert_bytes_to_storage(pvcs_in_kubernetes[volume_description]['volume_size_status_bytes']),
convert_bytes_to_storage(resize_to_bytes),
pvcs_in_kubernetes[volume_description]['scale_above_percent'],
IN_MEMORY_STORAGE[volume_description] * INTERVAL_TIME
)
# Send event that we're starting to request a resize
send_kubernetes_event(
name=volume_name, namespace=volume_namespace, reason="VolumeResizeRequested",
message="Requesting {}".format(status_output)
)

if scale_up_pvc(volume_namespace, volume_name, resize_to_bytes):
status_output = "Successfully scaled up `{}` by `{}%` from `{}` to `{}`, it was using more than `{}%` disk space over the last `{} seconds`".format(
volume_description,
pvcs_in_kubernetes[volume_description]['scale_up_percent'],
convert_bytes_to_storage(pvcs_in_kubernetes[volume_description]['volume_size_status_bytes']),
convert_bytes_to_storage(resize_to_bytes),
pvcs_in_kubernetes[volume_description]['scale_above_percent'],
IN_MEMORY_STORAGE[volume_description] * INTERVAL_TIME
)
status_output = "Successfully requested {}".format(status_output)
# Print success to console
print(status_output)
# Intentionally skipping sending an event to Kubernetes on success, the above event is enough for now until we detect if resize succeeded
# Print success to Slack
if slack.SLACK_WEBHOOK_URL and len(slack.SLACK_WEBHOOK_URL) > 0:
slack.send(status_output)
else:
status_output = "FAILED Scaling up `{}` by `{}%` from `{}` to `{}`, it was using more than `{}%` disk space over the last `{} seconds`".format(
volume_description,
pvcs_in_kubernetes[volume_description]['scale_up_percent'],
convert_bytes_to_storage(pvcs_in_kubernetes[volume_description]['volume_size_status_bytes']),
convert_bytes_to_storage(resize_to_bytes),
pvcs_in_kubernetes[volume_description]['scale_above_percent'],
IN_MEMORY_STORAGE[volume_description] * INTERVAL_TIME,
)
status_output = "FAILED requesting {}".format(status_output)
# Print failure to console
print(status_output)
# Print failure to Kubernetes Events
send_kubernetes_event(
name=volume_name, namespace=volume_namespace, reason="VolumeResizeRequestFailed",
message=status_output, type="Warning"
)
# Print failure to Slack
if SLACK_WEBHOOK_URL and len(SLACK_WEBHOOK_URL) > 0:
slack.send(status_output, severity="error")

Expand Down

0 comments on commit 004cbf9

Please sign in to comment.