Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated archiving script #109

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions scripts/odc-product-archive/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Archive ODC Product/Datasets (in beta)

- This is repository contains scripts for archiving odc datasets. According to the
[ODC API REFERENCE DOCS](https://datacube-core.readthedocs.io/en/latest/api/indexed-data/dataset-querying.html) ODC datasets can be archived and later restored.

## WARNING
- ODC datasets can be restored using **dataset ids**, that means, **datasets ids** must stored before the archiving process.
- The script in this repository stores the archived **dataset ids** in a ***csv file*** in a format `PRODUCT_NAME_archived.csv`


## Running archive_product script
- Runs on ODC environment. [To configure ODC environment](https://datacube-core.readthedocs.io/en/latest/installation/setup/ubuntu.html#)
- Input : existing **PRODUCT-NAME**
- Output : `PRODUCT_NAME_archived.csv` containing a list of archived **dataset ids**
- `s3 Bucket` : `archive-deafrica-datasets`
- ***NOTE*** : `s3.Bucket("archive-deafrica-datasets").upload_file(Filename=PRODUCT_NAME + "_archived.csv", Key="crop-mask/"+PRODUCT_NAME + "_archived.csv")` Key specifies the folder exiting inside the S3 bucket.

## Environment variables
``` bash
export DB_DATABASE=dbname
export DB_HOSTNAME=localhost
export DB_USERNAME=example
export DB_PASSWORD=secretexample

os.environ["AWS_ACCESS_KEY_ID"] = ""
os.environ["AWS_SECRET_ACCESS_KEY"] = ""
os.environ["AWS_DEFAULT_REGION"] = "af-south-1"
```
53 changes: 53 additions & 0 deletions scripts/odc-product-archive/archive_odc_product.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
import datacube
import pandas as pd
import sys
import boto3
import os

s3 = boto3.client("s3")
dc = datacube.Datacube()


s3 = boto3.resource(
service_name="s3",
region_name=os.environ["AWS_DEFAULT_REGION"],
aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
)

# store dataset list
datasets_list = []

# Empty array to store datasets ids
dataset_ids = []

# input product name
PRODUCT_NAME = sys.argv[1]

# search datasets using product name
try:

datasets_list = dc.find_datasets(product=PRODUCT_NAME)

# Storing dataset ids
for dataset_id in datasets_list:
dataset_ids.append(dataset_id.id)
except:
print("Product name " + PRODUCT_NAME + " does not exist")
sys.exit(1)

# check datasets
if not dataset_ids:
print("No datasets to archive................................................")
else:
df = pd.DataFrame(dataset_ids)
df.to_csv(PRODUCT_NAME + "_archived.csv")

# Upload to AWS s3 bucket
s3.Bucket("archive-deafrica-datasets").upload_file(
Filename=PRODUCT_NAME + "_archived.csv",
Key="crop-mask/" + PRODUCT_NAME + "_archived.csv",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we avoid the specific reference to "crop-mask"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid referencing to the folder "crop-mask", key must include the file name only. For example in this case : if
Key=PRODUCT_NAME + "_archived.csv" , the file will uploaded to the root folder of the S3 bucket. The reason why I created the crop-mask folder is to store related product in one folder.

)
dc.index.datasets.archive(dataset_ids)

print("Done")
16 changes: 16 additions & 0 deletions scripts/odc-product-archive/archive_product.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash


#check database connection
datacube system check

#for local test
#eval "$(conda shell.bash hook)"
#conda activate odc_env

echo "Product Name"
read PRODUCT_NAME

echo "start archiving datasets"
python3 archive_odc_product.py $PRODUCT_NAME