Skip to content
shamik edited this page Apr 25, 2024 · 1 revision

DVC

User Guide Tutorial

Installation

Creating a new env and activating the same

conda create --name dvc python=3.8.2 -y
conda activate dvc 
conda install dvc scikit-learn scikit-image pandas numpy 

Installing with pip

pip install dvc scikit-learn scikit-image pandas numpy

Cloning an already existing repo for the dvc tutorial

git clone [email protected]:shamik-biswas-rft/data-version-control-tutorial.git

Downloading the data

curl https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz \ 
    -O imagenette2-160.tgz 
# Or  
https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz

Extract and move the train and validation data under the <data/raw> path in the repo

tar -xzvf imagenette2-160.tgz 
mv imagenette2-160/train data/raw/train 
mv imagenette2-160/val data/raw/val

Removing the unnecessary files

rm -rf imagenette2-160 
rm imagenette2-160.tgz

Create a new branch in the repo and get started 

git checkout -b dev

Initialising dvc in the parent folder of the repo.

Note: Initialisation HAS TO BE in the PARENT DIR of the REPO

dvc init

Disable collection of usage analytics by DVC

dvc config core.analytics false

Large files or the data must be stored remotely and can be configured to store in a different directory so the directory or the remote storage path must be added to DVC

dvc remote add -d remote_storage path/to/your/dvc_remote

Adding data to be tracked with DVC, which adds the files to cache

dvc add data/raw/train data/raw/val

Doing the above adds the folders to dvc and adds the .dvc and .gitignore files to git, this way all the data is stored away from git to DVC.

One can add directly to the remote location instead of adding the files to cache.

dvc add –to-remote data/raw/train 
dvc add –to-remote data/raw/val

To undo dvc add we have to do dvc remove, for the above example it will be

dvc remove data/raw/train data/raw/val 

Staging all the changes to git

git add --all

Commit the changes

git commit -m "whatever message"

Uploading files from cache to remote storage

dvc push

Pushing the changes to git, this doesn't push the data

git push -u origin dev

Since there was no dev branch in the remote, this commands uploads the branch and sets it to be tracked from remote

Restoring validation data

If you have accidentally removed the data but the data had already been added to dvc and the cache isn't empty, can be checked with <ls .dvc/cache> then it can be restored by

dvc checkout <filename or directory path> 
dvc checkout data/raw/val.dvc 

To restore everything in the entire repo

dvc checkout

If the cache is empty and the repo has been just cloned from git then the data can be downloaded by

dvc fetch data/raw/val.dvc

To retrieve all data for the entire repo

dvc fetch 

To do the above in a single command I.e. pull from remote to the cache and into the local staging area

dvc pull

To pull a single folder or file mention it after dvc pull

dvc pull <file/dir>

Pushing changes to remote

After performing a set of experiments, which has created multiple new files all the code files and model/data files must be pushed to the remote by

git push 
dvc push 

Commit DVC

If there are changes to the already tracked file then we can commit the necessary files

dvc commit

DVC commit is different from git as it will not overwrite the existing file but will create a new one instead

DVC pipeline

All tasks such as preparing the data, training the model and running the evaluation can be fused into a pipeline for easy execution.

Pipeline can be executed by dvc run and it has 3 stages

  • Inputs
  • Outputs
  • Command

DVC uses the term dependencies for inputs and outs for outputs.

The below code is for preparing the train and test datasets 

dvc run -n prepare \ 
-d src/prepare.py -d data/raw \ 
-o data/prepared/train.csv -o data/prepared/test.csv \ 
python src/prepare.py 

The -n switch gives the stage a name.
The -d switch passes the dependencies to the command.
The -o switch defines the outputs of the command.
Line 1: You want to run a pipeline stage and call it prepare.
Line 2: The pipeline needs the prepare.py file and the data/raw folder.
Line 3: The pipeline will produce the train.csv and test.csv files.
Line 4: The command to execute is python src/prepare.py.

To remove a stage in the pipeline

Remove Stage

dvc remove <stage name>

Post creation of each pipeline the files tracked by git will change and the changes must be committed before proceeding with a new pipeline.

For the evaluation pipeline there is a different switch for the outputs, it's -M instead of -o

dvc run -n evaluate \ 
-d src/evaluate.py -d model/model.joblib \ 
-M metrics/accuracy.json \ 
python src/evaluate.py

This tells DVC that the file is a metrics file and dvc can show the metrics by

dvc metrics show 

Commit all the changes post running the pipeline and pushing the changes to the dvc

dvc commit 
dvc push

Running the pipeline with a different model

Post changing the model and the training script, wherein the model is instantiated, reproduce the pipeline by

dvc repro

Compare the metrics between commits

dvc metrics diff 
dvc metrics diff --md --targets metrics/accuracy.json -- 542813b bc9ddb2 

Show all the metrics between all the models by tags

dvc metrics show –T

The same as above but for specific metrics files

dvc metrics show nlp/resources/evaluation/output/dvc_gold_metrics.json nlp/resources/evaluation/output/dvc_test_metrics.json -T –q

The same as above but for specific metrics with their commits

dvc metrics show nlp/resources/evaluation/output/dvc_gold_metrics.json nlp/resources/evaluation/output/dvc_test_metrics.json -A -q

Commit all the changes post running the pipeline and pushing the changes to the dvc

dvc commit 
dvc push

Add/modify dependencies and outputs to the already existing pipeline and not executing the pipeline with the flag --no-exec

Dependencies

Adding raw.csv as dependency and validate as output

dvc run -n prepare \ 
-f --no-exec \ 
-d src/prepare.py \ 
-d data/raw.csv \ 
-o data/train \ 
-o data/validate \ 
python src/prepare.py data/raw.csv 
dvc commit

Setting up remote storage, installation of optional dependencies, for each storage

Remote Storage setup

If you installed DVC via pip and plan to use cloud services as remote storage, you might need to install these optional dependencies: [s3], [azure], [gdrive], [gs], [oss], [ssh]. Alternatively, use [all] to include them all.

In our case we need s3, therefore dvc[s3] or conda install -c conda-forge dvc-s3

To be able to use s3 bucket one needs to add the remote container

dvc remote add -d s3_remote s3://summ-models/dvc/ 

The remote URI can be retrieved from AWS

For accessing the s3 we further need to add the access key and secret access key in the following manner

dvc remote modify s3_remote access_key_id <AWS_ACCESS_KEY_ID> 
dvc remote modify s3_remote secret_access_key <AWS_SECRET_ACCESS_KEY> 
dvc push 

EASIER WAY TO PULL and PUSH CHANGES THROUGH AWS CLI

  • Install AWS CLI
  • Use the credentialpath instead of the access keys in dvc
dvc remote modify s3_remote credentialpath ~/.aws/credentials

Before pushing the changes to git remove the remote access keys and they need to be added by each user when they are setting up the project

dvc remote modify s3_remote access_key_id None 
dvc remote modify s3_remote secret_access_key None
Clone this wiki locally