Purpose: This series of test datasets have been prepared emulating real Illumina and Nanopore sequencing data. These datasets are designed for use in:
a) Evaluating software changes prior to implementation
b) Proficiency testing of personnel performing NGS analysis of SARS-CoV-2
Note: These test datasets are focused solely on the analysis and interpretation of sequence data and excludes contextual data.
For information and test datasets on training and quality control of contextual data, please see here. This resource contains "Gold Standard", "Common Errors", and a "Scenarios dataset".
Resources
Sequencing Test Datasets
-
Location of in-house training set: 10.5281/zenodo.13748104 soon-to-be-released
- Datasets included:
- Illumina
-
Training Set A
- Contains dehosted Illumina paired-end sequencing data generated using the Freed primer scheme for 25 samples and 3 controls
-
Training Set B
- Contains dehosted Illumina paired-end sequencing data generated using the Freed primer scheme for 25 samples and 5 controls
-
Training Set C
- Contains dehosted Illumina paired-end sequencing data generated using the freed_V2_nml primer scheme for 92 samples and 5 controls
-
Training Set D
- Contains dehosted Illumina paired-end sequencing data generated using the freed_V2_nml primer scheme for 92 samples and 7 controls
-
Permission to use these datasets was provided by the Roy Romanow Provincial Laboratory (Saskatchewan). It should not be used for applications outside those outlined in this document without the express written consent of this group.
Authors: Anna Majer, Shari Tyson, Grace Seo, Philip Mabon, Elsie Grudeski, Rhiannon Huzarewich, Russell Mandes, Anneliese Landgraff, Jennifer Tanner, Natalie Knox, Morag Graham, Gary Van Domselaar, Ryan McDonald, Keith MacKenzie, Meredith Faires, Kara Loos, Stefani Kary, Rachel DePaulo, Alanna Senecal, Amanda Lang, Jessica Minion, Nathalie Bastien, Yan Li, Timothy Booth, Darian Hole, Madison Chapel, Kirsten Biggar, CanCOGeN's metadata curation team, Public Health Agency of Canada CanCOGeN team
-
- Nanopore
- Contains dehosted nanopore fastq files (and possibly fast5 files?) generated using the freed_V2_nml primer scheme for 93 samples and 3 controls
- Illumina
- Datasets included:
-
CDC public dataset repo git link / publication
- What's included:
- QC failures
- host contamination
- outbreak data
- samples generated using different wet-lab approaches
- datasets across different lineages
- What's included:
Test datasets coming soon!
Recipes - how datasets are created
-
The Illumina Training Set A is composed of clinical samples from Saskatchewan sequenced by the National Microbiology Laboratory in April 2021 on a MiSeq instrument with the freed primer scheme.
- A run with clean negative controls was selected for use and a subset of samples representing various Ct values and final genome completeness scores. - Clinical positive control sample was also included - An artificial mixture sample was produced by combining reads from two existing samples to ensure that at least one sample would generate an EXCESS_AMBIGUITY flag
-
Illumina Training Set B was created by combining Training Set A with a series of additional negative controls, some of which will fail QC checks.
-
Illumina Training Set C is composed of clinical samples from Saskatchewan sequenced by the National Microbiology Laboratory in October 2022 on a MiSeq instrument with the freed_V2_nml primer scheme. This dataset also contains the same 2 negative contols found in Training Set A, as well as 3 additional controls: (i) a no templale control, (ii) a SARS-CoV-2 Alpha variant positive control, and (iii) a SARS-CoV-2 Delta variant positive control.
-
Illumina Training Set D was created by combining the samples from Training Set C with the 4 negative contols found in Training Set B, as well as 3 additional controls: (i) a no templale control, (ii) a SARS-CoV-2 Alpha variant positive control, and (iii) a SARS-CoV-2 Delta variant positive control (these 3 additional controls are the same 3 additional controls fouund in Training set C).
-
The Nanopore Training Set was generated by the Roy Romanow Provincial Laboratory in Saskatchewan on an Mk1c device in January 2022. - Run was selected to try and closely match the Illumina run in terms of composition, that is clean negative controls, a variety of Ct values, and final genome completeness scores.
Since Illumina Training Sets A and B were generated during major SARS-CoV-2 variant waves, the lineage composition of these datasets are homogeneous. The Nanopore dataset and Illumina Training Sets C and D contain samples with a variety of SARS-CoV-2 Omicron sublineages.
Assessing operator proficiency:
- Data interpretation criteria must address the below considerations
- Do Negative (negative-ctrl-#) controls pass specified internal QC metrics? If you detect warnings, what are the reasons and how will this impact the run?
- Does the positive control (positive-ctrl-1) pass specified internal QC metrics?
- Based on the above, is the run valid and can interpretations be made about the samples included? Provide rationale
- How many samples meet minimum quality standards? List which do and do not.
- Which (if any) samples display excess ambiguity? Should the data from these samples be excluded?
SIGNAL_output_comparison_v2.Rmd
can be used to compare SIGNAL pipeline results generated by two different users on the same test Illumina data set.
This script requires the following seven files (from both analyses) as input:
- SIGNAL results file:
.qc.csv
- Ncov-tools quality control files:
_tree.nwk
_ambiguous_position_report.tsv
_mixture_report.tsv
_ncov_watch_variants.tsv
_negative_control_report.tsv
_summary_qc.tsv
The current version of the script requires that the arrangement of these files adhere to a specific directory structure. For example:
Training_Set_A
├── User_1_SIGNAL_A
│ └── SIGNAL_analyses
│ ├── ncov-tools
│ │ ├── qc_analysis
│ │ │ └── nml_tree.nwk
│ │ └── qc_reports
│ │ ├── nml_ambiguous_position_report.tsv
│ │ ├── nml_mixture_report.tsv
│ │ ├── nml_ncov_watch_variants.tsv
│ │ ├── nml_negative_control_report.tsv
│ │ └── nml_summary_qc.tsv
│ └── nml.qc.csv
└── User_2_SIGNAL_A
└── SIGNAL_analyses
├── ncov-tools
│ ├── qc_analysis
│ │ └── nml_tree.nwk
│ └── qc_reports
│ ├── nml_ambiguous_position_report.tsv
│ ├── nml_mixture_report.tsv
│ ├── nml_ncov_watch_variants.tsv
│ ├── nml_negative_control_report.tsv
│ └── nml_summary_qc.tsv
└── nml.qc.csv
A configuration (.yaml) file is used to identify the SIGNAL results directories that are to be compared, and to load the necessary R packages for the R markdown script.
The condfiguration file requires the following:
Base_path
- absolute path to the directory that contains the Data_folder
(see below)
Data_folder
- parent directory of per-user analyses folders; Name_of_folder_one
and Name_of_folder_two
(see below)
Name_of_folder_one
- Parent directory for user 1 SIGNAL analyses on a particular Illumina dataset
Name_of_folder_two
- Parent directory for user 2 SIGNAL analyses on a particular Illumina dataset
Using the directory structure example above, the configuration file entries would be as follows:
Base_path: <ABSOLUTE/PATH/TO/Training_Set_A>
- e.g. "C:/Zenodo_data/SARS_CoV_2/Illumina"
Data_folder: "Training_Set_A"
Name_of_folder_one: "User_1_SIGNAL_A"
Name_of_folder_two: "User_2_SIGNAL_A"
The following R packages must also be listed in the configuration file as follows:
Packages:
- dplyr
- data.table
- readxl
- tidyverse
- stringr
- R.utils
- writexl
- lubridate
- stringi
- ggplot2
- gridExtra
- tinytex
- kableExtra
- DT
- extraoperators
- hablar
- ape
- phylogram
- dendextend
- data.tree
- here
The absolute path to the configuration file must be entered on line 24 of SIGNAL_output_comparison_v2.Rmd
.
config <- yaml::yaml.load_file("<ABSOLUTE/PATH/TO/Configuration_file.yaml>")
Once the configuration file has been created and it's absolute path has been added to the script on line 24, knit the script in RStudio to generate HTML output.
Contains a detailed comparison of per-user SIGNAL/ncov-tools output.