- The data used is available here.
Provide an accessible path to the csv file in the
config.yamlfile indata_url. Please ensure that the file can be downloaded usingcurl. Or you can provide the url in the environment variableDATA_URL - Update the python environment in
.envfile - Install
poetryif not already installed - Install the dependencies using poetry
poetry install - update the config and model parameters in the
config.yamlfile - Add
./srcto thePYTHONPATH-export PYTHONPATH="${PYTHONPATH}:./src" - Run
poetry run python src/main.py
The below manual steps are automated using the data ingestion dag in the DAGs repo
dvc initfrom the root of the repo to set the repo as a dvc repo if it is not already done- Add dvc remote
dvc remote add -f <dvc-remote-name> <dvc-remote-path>
- Add the files that needs to be tracked to dvc
dvc add artefacts/test_data.csv artefacts/train_data.csv artefacts/val_data.csv
- Add the dvc files to git
git add artefacts/test_data.csv.dvc artefacts/train_data.csv.dvc artefacts/val_data.csv.dvc
- Push the data to dvc remote
dvc push -r <dvc-remote-name>
- Git push and tag the repo with version of data for future use
- Build the docker image -
docker build -t data-ingestion . - Run the container with the correct
DATA_URLandDVC_REMOTEas environment variables. (Refer to the following Environment Variables table for complete list)
docker run -e DVC_REMOTE=s3:some/remote -e DATA_URL=https://raw.githubusercontent.com/renjith-digicat/random_file_shares/main/HousingData.csv --rm data-ingestion
- Set up the kubernetes cluster and infrastructure required using Infrastructure repo
- Access the airflow UI made available using the above infra repo
- Update the airflow variables accordingly
- Trigger the
data_ingestion_dag
Once the DAG execution completed, the data ingestion repo will be updated with a new data version in the specified branch of the repo.
The following environment variables can be set to configure the training:
| Variable | Default Value | Description |
|---|---|---|
| DATA_URL | https://raw.githubusercontent.com/renjith-digicat/random_file_shares/main/HousingData.csv |
Url to the raw data CSV data used for training |
| CONFIG_PATH | ./config.yaml |
File path to the data cleansing, versioning and other configuration file |
| LOG_LEVEL | INFO |
The logging level for the application. Valid values are DEBUG, INFO, WARNING, ERROR, and CRITICAL. |
| DVC_REMOTE | /tmp/test-dvc-remote |
A DVC remote path |
| DVC_ENDPOINT_URL | http://minio |
The URL endpoint for the DVC storage backend. This is typically the URL of an S3-compatible service, such as MinIO, used to store and manage datasets and model files. |
| DVC_REMOTE_NAME | regression-model-remote |
The name for the dvc remote |
| DVC_ACCESS_KEY_ID | None | The access key id for dvc remote endpoint url (default value is embedded in the infra repo) |
| DVC_SECRET_ACCESS_KEY | None | The secret access key for dvc remote endpoint url (default value is embedded in the infra repo) |
| AWS_DEFAULT_REGION | eu-west-2 |
The dvc remote s3 bucket region |
| GITHUB_USERNAME | None | Github username using which new data version files will be pushed to github (default value is embedded in the infra repo) |
| GITHUB_PASSWORD | None | Github token for the above username (default value is embedded in the infra repo) |
Ensure that you have the project requirements already set up by following the Data Ingestion and versioning instructions
- Ensure
pytestis installed.poetry installwill install it as a dependency.
- Run the tests with
poetry run pytest ./tests