This folder provides a very simple prototype demonstration to implement online machine learning for streaming data and train and predict performance bottleneck events based feature engineered performance traces. Please have a look there first. For this purpose, we use scikit-multiflow and implement an Adaptive Random Forest Classifier that is updated for each event in the data stream.
The incoming data stream is in Apache Avro registered at a Schema Registry. As Kafka Streams is not available in Python, we use Faust to consume hopping windows of those performance traces. Data a deserialized using Dataclasses Avro Schema Generator and Python Schema Registry Client.
The easiest way to run the project is to use Python3 and/or Docker Compose.
To generate test data, have a look at the integration test.
Please make sure to have the following technologies installed:
- Docker (>= 3.2 to support profiles)
To build our app, run:
docker-compose build
To start all default containers, that is, Zookeeper, Kafka, Schema Registry and this project, run the following command:
docker-compose --profile ml --profile infrastructure up
To stop and remove everything, we recommend using the following command to prevent future errors with Apache Kafka:
docker-compose rm -sfv
Please make sure to have the following technologies installed:
Start dependent instances of Zookeeper, Kafka, Schema Registry either natively or using Docker Compose:
docker-compose --profile infrastructure up
Install dependencies:
pip3 install -e .
Export settings:
export SIMPLE_SETTINGS=settings
Run using Faust:
faust -A pg_streaming_machine_learning.app worker -l INFO