-
Notifications
You must be signed in to change notification settings - Fork 0
DVC
Creating a new env and activating the same
conda create --name dvc python=3.8.2 -y
conda activate dvc
conda install dvc scikit-learn scikit-image pandas numpy pip install dvc scikit-learn scikit-image pandas numpy
git clone [email protected]:shamik-biswas-rft/data-version-control-tutorial.git
curl https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz \
-O imagenette2-160.tgz
# Or
https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgztar -xzvf imagenette2-160.tgz
mv imagenette2-160/train data/raw/train
mv imagenette2-160/val data/raw/valrm -rf imagenette2-160
rm imagenette2-160.tgzgit checkout -b dev
Note: Initialisation HAS TO BE in the PARENT DIR of the REPO
dvc initdvc config core.analytics falseLarge files or the data must be stored remotely and can be configured to store in a different directory so the directory or the remote storage path must be added to DVC
dvc remote add -d remote_storage path/to/your/dvc_remotedvc add data/raw/train data/raw/valDoing the above adds the folders to dvc and adds the .dvc and .gitignore files to git, this way all the data is stored away from git to DVC.
One can add directly to the remote location instead of adding the files to cache.
dvc add –to-remote data/raw/train
dvc add –to-remote data/raw/valdvc remove data/raw/train data/raw/val git add --allgit commit -m "whatever message"dvc pushgit push -u origin devSince there was no dev branch in the remote, this commands uploads the branch and sets it to be tracked from remote
If you have accidentally removed the data but the data had already been added to dvc and the cache isn't empty, can be checked with <ls .dvc/cache> then it can be restored by
dvc checkout <filename or directory path>
dvc checkout data/raw/val.dvc dvc checkoutdvc fetch data/raw/val.dvcdvc fetch To do the above in a single command I.e. pull from remote to the cache and into the local staging area
dvc pulldvc pull <file/dir>After performing a set of experiments, which has created multiple new files all the code files and model/data files must be pushed to the remote by
git push
dvc push If there are changes to the already tracked file then we can commit the necessary files
dvc commitDVC commit is different from git as it will not overwrite the existing file but will create a new one instead
All tasks such as preparing the data, training the model and running the evaluation can be fused into a pipeline for easy execution.
Pipeline can be executed by dvc run and it has 3 stages
- Inputs
- Outputs
- Command
DVC uses the term dependencies for inputs and outs for outputs.
The below code is for preparing the train and test datasets
dvc run -n prepare \
-d src/prepare.py -d data/raw \
-o data/prepared/train.csv -o data/prepared/test.csv \
python src/prepare.py The -n switch gives the stage a name.
The -d switch passes the dependencies to the command.
The -o switch defines the outputs of the command.
Line 1: You want to run a pipeline stage and call it prepare.
Line 2: The pipeline needs the prepare.py file and the data/raw folder.
Line 3: The pipeline will produce the train.csv and test.csv files.
Line 4: The command to execute is python src/prepare.py.
dvc remove <stage name>Post creation of each pipeline the files tracked by git will change and the changes must be committed before proceeding with a new pipeline.
dvc run -n evaluate \
-d src/evaluate.py -d model/model.joblib \
-M metrics/accuracy.json \
python src/evaluate.pydvc metrics show dvc commit
dvc pushPost changing the model and the training script, wherein the model is instantiated, reproduce the pipeline by
dvc reprodvc metrics diff
dvc metrics diff --md --targets metrics/accuracy.json -- 542813b bc9ddb2 dvc metrics show –Tdvc metrics show nlp/resources/evaluation/output/dvc_gold_metrics.json nlp/resources/evaluation/output/dvc_test_metrics.json -T –qdvc metrics show nlp/resources/evaluation/output/dvc_gold_metrics.json nlp/resources/evaluation/output/dvc_test_metrics.json -A -qdvc commit
dvc pushAdd/modify dependencies and outputs to the already existing pipeline and not executing the pipeline with the flag --no-exec
dvc run -n prepare \
-f --no-exec \
-d src/prepare.py \
-d data/raw.csv \
-o data/train \
-o data/validate \
python src/prepare.py data/raw.csv
dvc commitIf you installed DVC via pip and plan to use cloud services as remote storage, you might need to install these optional dependencies: [s3], [azure], [gdrive], [gs], [oss], [ssh]. Alternatively, use [all] to include them all.
In our case we need s3, therefore
dvc[s3]
or conda install -c conda-forge dvc-s3
dvc remote add -d s3_remote s3://summ-models/dvc/ The remote URI can be retrieved from AWS
For accessing the s3 we further need to add the access key and secret access key in the following manner
dvc remote modify s3_remote access_key_id <AWS_ACCESS_KEY_ID>
dvc remote modify s3_remote secret_access_key <AWS_SECRET_ACCESS_KEY>
dvc push - Install AWS CLI
- Use the credentialpath instead of the access keys in dvc
dvc remote modify s3_remote credentialpath ~/.aws/credentialsBefore pushing the changes to git remove the remote access keys and they need to be added by each user when they are setting up the project
dvc remote modify s3_remote access_key_id None
dvc remote modify s3_remote secret_access_key None