-
Notifications
You must be signed in to change notification settings - Fork 0
DVC
Creating a new env and activating the same
conda create --name dvc python=3.8.2 -y
conda activate dvc
conda install dvc scikit-learn scikit-image pandas numpy
pip install dvc scikit-learn scikit-image pandas numpy
git clone [email protected]:shamik-biswas-rft/data-version-control-tutorial.git
curl https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz \
-O imagenette2-160.tgz
# Or
https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz
tar -xzvf imagenette2-160.tgz
mv imagenette2-160/train data/raw/train
mv imagenette2-160/val data/raw/val
rm -rf imagenette2-160
rm imagenette2-160.tgz
git checkout -b dev
Note: Initialisation HAS TO BE in the PARENT DIR of the REPO
dvc init
dvc config core.analytics false
Large files or the data must be stored remotely and can be configured to store in a different directory so the directory or the remote storage path must be added to DVC
dvc remote add -d remote_storage path/to/your/dvc_remote
dvc add data/raw/train data/raw/val
Doing the above adds the folders to dvc and adds the .dvc and .gitignore files to git, this way all the data is stored away from git to DVC.
One can add directly to the remote location instead of adding the files to cache.
dvc add –to-remote data/raw/train
dvc add –to-remote data/raw/val
dvc remove data/raw/train data/raw/val
git add --all
git commit -m "whatever message"
dvc push
git push -u origin dev
Since there was no dev branch in the remote, this commands uploads the branch and sets it to be tracked from remote
If you have accidentally removed the data but the data had already been added to dvc and the cache isn't empty, can be checked with <ls .dvc/cache> then it can be restored by
dvc checkout <filename or directory path>
dvc checkout data/raw/val.dvc
dvc checkout
dvc fetch data/raw/val.dvc
dvc fetch
To do the above in a single command I.e. pull from remote to the cache and into the local staging area
dvc pull
dvc pull <file/dir>
After performing a set of experiments, which has created multiple new files all the code files and model/data files must be pushed to the remote by
git push
dvc push
If there are changes to the already tracked file then we can commit the necessary files
dvc commit
DVC commit is different from git as it will not overwrite the existing file but will create a new one instead
All tasks such as preparing the data, training the model and running the evaluation can be fused into a pipeline for easy execution.
Pipeline can be executed by dvc run and it has 3 stages
- Inputs
- Outputs
- Command
DVC uses the term dependencies for inputs and outs for outputs.
The below code is for preparing the train and test datasets
dvc run -n prepare \
-d src/prepare.py -d data/raw \
-o data/prepared/train.csv -o data/prepared/test.csv \
python src/prepare.py
The -n switch gives the stage a name.
The -d switch passes the dependencies to the command.
The -o switch defines the outputs of the command.
Line 1: You want to run a pipeline stage and call it prepare.
Line 2: The pipeline needs the prepare.py file and the data/raw folder.
Line 3: The pipeline will produce the train.csv and test.csv files.
Line 4: The command to execute is python src/prepare.py.
dvc remove <stage name>
Post creation of each pipeline the files tracked by git will change and the changes must be committed before proceeding with a new pipeline.
dvc run -n evaluate \
-d src/evaluate.py -d model/model.joblib \
-M metrics/accuracy.json \
python src/evaluate.py
dvc metrics show
dvc commit
dvc push
Post changing the model and the training script, wherein the model is instantiated, reproduce the pipeline by
dvc repro
dvc metrics diff
dvc metrics diff --md --targets metrics/accuracy.json -- 542813b bc9ddb2
dvc metrics show –T
dvc metrics show nlp/resources/evaluation/output/dvc_gold_metrics.json nlp/resources/evaluation/output/dvc_test_metrics.json -T –q
dvc metrics show nlp/resources/evaluation/output/dvc_gold_metrics.json nlp/resources/evaluation/output/dvc_test_metrics.json -A -q
dvc commit
dvc push
Add/modify dependencies and outputs to the already existing pipeline and not executing the pipeline with the flag --no-exec
dvc run -n prepare \
-f --no-exec \
-d src/prepare.py \
-d data/raw.csv \
-o data/train \
-o data/validate \
python src/prepare.py data/raw.csv
dvc commit
If you installed DVC via pip and plan to use cloud services as remote storage, you might need to install these optional dependencies: [s3], [azure], [gdrive], [gs], [oss], [ssh]. Alternatively, use [all] to include them all.
In our case we need s3, therefore
dvc[s3]
or conda install -c conda-forge dvc-s3
dvc remote add -d s3_remote s3://summ-models/dvc/
The remote URI can be retrieved from AWS
For accessing the s3 we further need to add the access key and secret access key in the following manner
dvc remote modify s3_remote access_key_id <AWS_ACCESS_KEY_ID>
dvc remote modify s3_remote secret_access_key <AWS_SECRET_ACCESS_KEY>
dvc push
- Install AWS CLI
- Use the credentialpath instead of the access keys in dvc
dvc remote modify s3_remote credentialpath ~/.aws/credentials
Before pushing the changes to git remove the remote access keys and they need to be added by each user when they are setting up the project
dvc remote modify s3_remote access_key_id None
dvc remote modify s3_remote secret_access_key None