Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Three datasets not in Wasabi #2012

Closed
pli888 opened this issue Aug 20, 2024 · 4 comments
Closed

Three datasets not in Wasabi #2012

pli888 opened this issue Aug 20, 2024 · 4 comments

Comments

@pli888
Copy link
Member

pli888 commented Aug 20, 2024

Datasets 100157 (9.6 TB; #1271), 100242 (4.3 TB) and 100443 (3.7 TB) are not in Wasabi because it was agreed that they would be kept in cold storage in Tencent due to little or no downloads of them.

We have been requested to not use Tencent. Therefore, these 3 datasets will be copied to Wasabi.

@pli888
Copy link
Member Author

pli888 commented Feb 20, 2025

All files in dataset 100157 in tencent backup have been transferred into AWS Glacier:

# Check total size of files in tencent backup
$ rclone size cos:cngbdb-share-backup-2-1255501786/cngbdb/giga/gigadb/pub/10.5524/100001_101000/100157/
Total objects: 396
Total size: 9.522 TiB (10470025946289 Byte)

# Check total size of files in glacier
$ rclone --s3-profile aws-transfer size gigadb-datasetfiles:gigadb-datasetfiles-backup/live/pub/10.5524/100001_101000/100157/
Total objects: 396
Total size: 9.522 TiB (10470025946289 Byte)

Confirm dataset 100157 files in tencent backup are the same in AWS Glacier:

# Get list of files in Glacier
$ rclone --s3-profile aws-transfer lsf -R --files-only gigadb-datasetfiles:gigadb-datasetfiles-backup/live/pub/10.5524/100001_101000/100157/ | sort > glacier_source_100157

# Get list of files in tencent backup
$ rclone lsf -R --files-only cos:cngbdb-share-backup-2-1255501786/cngbdb/giga/gigadb/pub/10.5524/100001_101000/100157/ | sort > cos_source_100157

# Compare file lists
$ comm -23 cos_source_100157 glacier_source_100157

Dataset 100157 has been removed from tencent backup.

@pli888
Copy link
Member Author

pli888 commented Feb 20, 2025

Files in dataset 100443 from tencent backup have been transferred into AWS Glacier. However, the files in Glacier are not the same as in tencent backup because changes have been made to the contents of this dataset.

# List dataset 100443 files in Tencent backup
$ rclone tree --size --human-readable cos:cngbdb-share-backup-2-1255501786/cngbdb/giga/gigadb/pub/10.5524/100001_101000/100443/
[7.1T]  /
├── [1.3K]  100443.md5
├── [4.5G]  Annotations
│   ├── [1.6G]  Transcriptome_GTFs
│   │   ├── [809M]  Homo_sapiens.GRCh37.75.gtf
│   │   ├── [809M]  Homo_sapiens.GRCh37.75.noSelenocysteine.gtf
│   │   └── [383K]  Homo_sapiens.GRCh37.75.noSelenocysteine.rRNA.MTrRNA.MTtRNA.gtf
│   └── [2.9G]  genome.fa
├── [7.1T]  SequencingData
│   ├── [6.9T]  VPC
│   │   ├── [3.5T]  CuffLINKS
│   │   │   └── [3.5T]  CuffLINKS.tar
│   │   ├── [3.5T]  CuffLINKS.tar.gz
│   │   ├── [1.0G]  CuffMERGE
│   │   │   └── [1.0G]  merged.gtf
│   │   ├── [970M]  CuffNORM
│   │   │   └── [970M]  CuffNORM.tar.gz
│   │   └── [4.3G]  CuffQUANT
│   │       └── [4.3G]  CuffQUANT.tar.gz
│   └── [129G]  WCM
│       ├── [127G]  CuffLINKS
│       │   └── [127G]  CuffLINKS.tar.gz
│       ├── [816M]  CuffMERGE
│       │   └── [816M]  merged.gtf
│       ├── [380M]  CuffNORM
│       │   └── [380M]  CuffNORM.tar.gz
│       └── [1.1G]  CuffQUANT
│           └── [1.1G]  CuffQUANT.tar.gz
├── [ 21M]  images.tar.gz
├── [124K]  prostate.png
└── [ 11K]  readme.txt

13 directories, 17 files

Identify file differences between glacier and tencent backup:

$ rclone --s3-profile aws-transfer lsf -R --files-only gigadb-datasetfiles:gigadb-datasetfiles-backup/live/pub/10.5524/100001_101000/100443/ | sort > glacier_source_100443

$ rclone lsf -R --files-only cos:cngbdb-share-backup-2-1255501786/cngbdb/giga/gigadb/pub/10.5524/100001_101000/100443/ | sort > cos_source_100443

# Identify files in tencent backup not in glacier
$ comm -23 cos_source_100443 glacier_source_100443
SequencingData/VPC/CuffLINKS.tar.gz
SequencingData/VPC/CuffLINKS/CuffLINKS.tar

Two files were not transferred into Glacier:

  • SequencingData/VPC/CuffLINKS.tar.gz - this is not required [see Chris email 25 Nov 2024]
  • SequencingData/VPC/CuffLINKS/CuffLINKS.tar - has been uncompressed and its V*.tar.gz files individually stored in Glacier [see Chris email 4 Dec 2024]
  • These V*.tar.gz files need to be added into GigaDB ticket Add dataset 100443 file names into GigaDB #2213

Total size of dataset 100433 files in glacier:

$ rclone --s3-profile aws-transfer size gigadb-datasetfiles:gigadb-datasetfiles-backup/live/pub/10.5524/100001_101000/100443/
Total objects: 98
Total size: 3.599 TiB (3957020398296 Byte)

@pli888
Copy link
Member Author

pli888 commented Feb 20, 2025

All files in dataset 100242 in tencent backup have been transferred into Wasabi:

# Check total size of files in Wasabi
rclone size wasabi:gigadb-datasets/live/pub/10.5524/100001_101000/100242/ 
Total objects: 118
Total size: 4.199 TiB (4616348898057 Byte)

# Check total size of files in tencent backup
rclone size cos:cngbdb-share-backup-2-1255501786/cngbdb/giga/gigadb/pub/10.5524/100001_101000/100242/
Total objects: 117
Total size: 4.199 TiB (4616348883135 Byte)
  • readme_100242.txt is the extra file in wasabi

Compare files in tencent backup and wasabi:

# List 100242 files in wasabi
$ rclone lsf -R --files-only wasabi:gigadb-datasets/live/pub/10.5524/100001_101000/100242/  | sort > wasabi_source_100242

# List 100242 files in tencent backup
$ rclone lsf -R --files-only cos:cngbdb-share-backup-2-1255501786/cngbdb/giga/gigadb/pub/10.5524/100001_101000/100242/ | sort > cos_source_100242

# All 100242 files in tencent backup are in wasabi
$ comm -23 cos_source_100242 wasabi_source_100242

Dataset 100242 has been removed from tencent backup.

@pli888
Copy link
Member Author

pli888 commented Feb 20, 2025

Closing this ticket because:

  • Dataset 100157 has been copied into AWS S3 Glacier
  • Dataset 100443 has been copied into AWS S3 Glacier
  • Dataset 100242 has been copied into Wasabi

@pli888 pli888 closed this as completed Feb 20, 2025
@github-project-automation github-project-automation bot moved this from Move GigaDB to cloud to Done in Portfolio Backlog Feb 20, 2025
@github-project-automation github-project-automation bot moved this from 👷 Work In Progress 👷‍♀️ to 🛑 Blocked in Current sprint board Feb 20, 2025
@pli888 pli888 moved this from 🛑 Blocked to 👏 Tasks Done in Current sprint board Feb 20, 2025
@pli888 pli888 removed the Peter label Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 👏 Tasks Done
Status: Done
Development

No branches or pull requests

2 participants