Emergency department indoor-air hybrid-capture metagenomics complements wastewater by adding a human-focused respiratory virus perspective
Thank you for your interest in our study!
Preprint: A Tale of Two Lenses: Emergency department indoor-air hybrid-capture metagenomics complements wastewater by adding a human-focused respiratory virus perspective https://www.medrxiv.org/content/10.64898/2026.03.13.26348311v1
This repository contains processed datasets and R scripts to reproduce the analyses and figures from the study. It also documents how the raw reads were processed so the workflow can be reproduced end-to-end.
BioProject: PRJNA1431177 (SRA). Human-associated reads were removed before deposition.
Raw reads can be downloaded using the SRA Toolkit (e.g., prefetch / fasterq-dump). See:
https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/
Processed inputs for figure reproduction are included in data/ so you can regenerate figures without re-running the raw-read pipeline.
Raw reads were processed with EsViritu v0.2.3 using the Virus Pathogen Database v2.0.2 (GenBank content through Nov 2022; Zenodo 7876309) as the reference. EsViritu links:
- GitHub: https://github.com/cmmr/EsViritu/
- Paper (Tisza et al., Nat Commun 2023): https://www.nature.com/articles/s41467-023-42064-1 If you use EsViritu, please cite the related paper.
Workflow summary:
- Quality filtering and adapter trimming with fastp, with deduplication enabled.
- Reference mapping at >=90% nucleotide identity and >=90% read coverage.
- Consensus sequence generation with samtools; near-duplicate consensus sequences removed at >95% similarity.
- The pipeline reports all findings (we have removed internal reporting cutoff of EsViritu); downstream filtering was done in RStudio with cutoffs mentioned below.
These steps produce the processed datasets in data/ that are used by the figure scripts.
The pre-filtering step is implemented in scripts/filter_data.R. It removes known wet-lab contaminants and low-confidence hits to improve certainty of detections and downstream genomic assignments.
A single set of filtering criteria is applied to the raw data:
- Baseline Requirements: >= 500 bp coverage OR >= 50% genome completeness.
- Minimum Abundance: > 10 reads aligned AND > 1 read per million (RPM).
- Large Genomes: References with > 100kb genomes must have > 3000 bp coverage.
- Contaminant Removal: Specific contaminants (e.g., Parvovirus NIH-CQV, Alphamesonivirus) and known artifactual signals (e.g., low-coverage Human mastadenovirus C) are excluded.
Clinical datasets used in the study are included in this repository for reproducibility. These datasets are publicly available — sources and access details are described in the manuscript. Data files are available under data/ (e.g., uzleuven_pathogens_weekly_long.csv).
For Influenza A resistance analysis (Figure 4), we implemented a specific pipeline (scripts/figure4_2_pipeline.R).
- Pipeline:
scripts/figure4_2_pipeline.R - Methodology:
- Step 1: Strict consensus generation (min depth >100x, min allele frequency >0.9).
- Step 2: Resistance mutation calling (H1N1pdm09 reference).
- Filtering: Only primary alignments with mapping quality > 30 are used. Rare variants (<90% frequency) are excluded from consensus sequences to ensure high confidence calls.
- Input: BAM files (not included in repo due to size) or cached intermediate files (included in repo).
- Output: Mutation analysis and coverage plots.
See README_figure4_2_pipeline.md for detailed documentation.
.
├── data/ # Processed datasets (TSV, CSV, etc.)
│ ├── air_wastewater2.detected_virus.combined.tax.tsv
│ ├── metadata_matched.csv
│ ├── figure4_2_data.csv
│ ├── uzleuven_pathogens_weekly_long.csv
│ └── ...
├── scripts/ # R scripts for generating figures
│ ├── figure1_panels.R # Overview analysis
│ ├── figure2_panels.R # Comparative analysis
│ ├── figure3_panels.R # Time series analysis
│ ├── figure4_panels.R # Influenza subtype analysis
│ ├── figure4_2_pipeline.R # Influenza resistance analysis
│ ├── figure5_panels.R # Genus-specific composition
│ └── filter_data.R # Data filtering logic
├── figures_pdf/ # Output figures in PDF format (generated by scripts)
└── README.md # Project documentation
System requirements:
- Figure reproduction (this repository): R >= 4.0.0 on a standard local machine (analysis scripts were run on macOS).
- Raw-read metagenomics processing (EsViritu): HPC environment with a SLURM scheduler and required bioinformatics software/modules.
- No special hardware is required for figure reproduction.
R dependencies:
install.packages(c(
"tidyverse",
"lubridate",
"ggplot2",
"patchwork",
"RColorBrewer",
"gridExtra",
"scales",
"grid",
"ggpubr",
"ggrepel",
"viridis",
"zoo",
"cowplot", # For figure 4 pipeline
"jsonlite" # For figure 4 pipeline
))-
Clone the repository:
git clone https://github.com/Matthijnssenslab/air_wastewater_leuven.git cd air_wastewater_leuven -
Set the working directory in R/RStudio to the repository root:
setwd("/path/to/air_wastewater_leuven") -
Run the analysis scripts:
# First, ensure data is filtered (if not loading pre-filtered) source("scripts/filter_data.R") # Run figure generation scripts source("scripts/figure1_panels.R") source("scripts/figure2_panels.R") # ... and so on
air_wastewater2.detected_virus.combined.tax.tsv: Main dataset containing taxonomically assigned viral reads.metadata_matched.csv: Sample metadata linking air and wastewater samples.uzleuven_pathogens_weekly_long.csv: Clinical surveillance data for correlation analysis.
MIT License. See LICENSE.