Emergency department indoor-air hybrid-capture metagenomics complements wastewater by adding a human-focused respiratory virus perspective

Thank you for your interest in our study!

Preprint: A Tale of Two Lenses: Emergency department indoor-air hybrid-capture metagenomics complements wastewater by adding a human-focused respiratory virus perspective https://www.medrxiv.org/content/10.64898/2026.03.13.26348311v1

This repository contains processed datasets and R scripts to reproduce the analyses and figures from the study. It also documents how the raw reads were processed so the workflow can be reproduced end-to-end.

Dataset (raw reads)

BioProject: PRJNA1431177 (SRA). Human-associated reads were removed before deposition.

Raw reads can be downloaded using the SRA Toolkit (e.g., prefetch / fasterq-dump). See: https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/

Processed inputs for figure reproduction are included in data/ so you can regenerate figures without re-running the raw-read pipeline.

Metagenomics analyses (EsViritu v0.2.3)

Raw reads were processed with EsViritu v0.2.3 using the Virus Pathogen Database v2.0.2 (GenBank content through Nov 2022; Zenodo 7876309) as the reference. EsViritu links:

GitHub: https://github.com/cmmr/EsViritu/
Paper (Tisza et al., Nat Commun 2023): https://www.nature.com/articles/s41467-023-42064-1 If you use EsViritu, please cite the related paper.

Workflow summary:

Quality filtering and adapter trimming with fastp, with deduplication enabled.
Reference mapping at >=90% nucleotide identity and >=90% read coverage.
Consensus sequence generation with samtools; near-duplicate consensus sequences removed at >95% similarity.
The pipeline reports all findings (we have removed internal reporting cutoff of EsViritu); downstream filtering was done in RStudio with cutoffs mentioned below.

These steps produce the processed datasets in data/ that are used by the figure scripts.

Pre-filtering in R (filter_data.R)

The pre-filtering step is implemented in scripts/filter_data.R. It removes known wet-lab contaminants and low-confidence hits to improve certainty of detections and downstream genomic assignments.

A single set of filtering criteria is applied to the raw data:

Baseline Requirements: >= 500 bp coverage OR >= 50% genome completeness.
Minimum Abundance: > 10 reads aligned AND > 1 read per million (RPM).
Large Genomes: References with > 100kb genomes must have > 3000 bp coverage.
Contaminant Removal: Specific contaminants (e.g., Parvovirus NIH-CQV, Alphamesonivirus) and known artifactual signals (e.g., low-coverage Human mastadenovirus C) are excluded.

Clinical datasets

Clinical datasets used in the study are included in this repository for reproducibility. These datasets are publicly available — sources and access details are described in the manuscript. Data files are available under data/ (e.g., uzleuven_pathogens_weekly_long.csv).

Influenza A Resistance Analysis

For Influenza A resistance analysis (Figure 4), we implemented a specific pipeline (scripts/figure4_2_pipeline.R).

Pipeline: scripts/figure4_2_pipeline.R
Methodology:
- Step 1: Strict consensus generation (min depth >100x, min allele frequency >0.9).
- Step 2: Resistance mutation calling (H1N1pdm09 reference).
Filtering: Only primary alignments with mapping quality > 30 are used. Rare variants (<90% frequency) are excluded from consensus sequences to ensure high confidence calls.
Input: BAM files (not included in repo due to size) or cached intermediate files (included in repo).
Output: Mutation analysis and coverage plots.

See README_figure4_2_pipeline.md for detailed documentation.

Repository structure

.
├── data/                       # Processed datasets (TSV, CSV, etc.)
│   ├── air_wastewater2.detected_virus.combined.tax.tsv
│   ├── metadata_matched.csv
│   ├── figure4_2_data.csv
│   ├── uzleuven_pathogens_weekly_long.csv
│   └── ...
├── scripts/                    # R scripts for generating figures
│   ├── figure1_panels.R        # Overview analysis
│   ├── figure2_panels.R        # Comparative analysis
│   ├── figure3_panels.R        # Time series analysis
│   ├── figure4_panels.R        # Influenza subtype analysis
│   ├── figure4_2_pipeline.R    # Influenza resistance analysis
│   ├── figure5_panels.R        # Genus-specific composition
│   └── filter_data.R           # Data filtering logic
├── figures_pdf/                # Output figures in PDF format (generated by scripts)
└── README.md                   # Project documentation

Prerequisites

System requirements:

Figure reproduction (this repository): R >= 4.0.0 on a standard local machine (analysis scripts were run on macOS).
Raw-read metagenomics processing (EsViritu): HPC environment with a SLURM scheduler and required bioinformatics software/modules.
No special hardware is required for figure reproduction.

R dependencies:

install.packages(c(
  "tidyverse",
  "lubridate",
  "ggplot2",
  "patchwork",
  "RColorBrewer",
  "gridExtra",
  "scales",
  "grid",
  "ggpubr",
  "ggrepel",
  "viridis",
  "zoo",
  "cowplot",    # For figure 4 pipeline
  "jsonlite"    # For figure 4 pipeline
))

Usage (figure reproduction)

Clone the repository:

git clone https://github.com/Matthijnssenslab/air_wastewater_leuven.git
cd air_wastewater_leuven

Set the working directory in R/RStudio to the repository root:
```
setwd("/path/to/air_wastewater_leuven")
```

Run the analysis scripts:

# First, ensure data is filtered (if not loading pre-filtered)
source("scripts/filter_data.R")

# Run figure generation scripts
source("scripts/figure1_panels.R")
source("scripts/figure2_panels.R")
# ... and so on

Data description

air_wastewater2.detected_virus.combined.tax.tsv: Main dataset containing taxonomically assigned viral reads.
metadata_matched.csv: Sample metadata linking air and wastewater samples.
uzleuven_pathogens_weekly_long.csv: Clinical surveillance data for correlation analysis.

License

MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bam		bam
data		data
figures_pdf		figures_pdf
filtered_data		filtered_data
references		references
resistance_analysis		resistance_analysis
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_figure4_2_pipeline.md		README_figure4_2_pipeline.md
air_wastewater_analysis.Rproj		air_wastewater_analysis.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emergency department indoor-air hybrid-capture metagenomics complements wastewater by adding a human-focused respiratory virus perspective

Dataset (raw reads)

Metagenomics analyses (EsViritu v0.2.3)

Pre-filtering in R (filter_data.R)

Clinical datasets

Influenza A Resistance Analysis

Repository structure

Prerequisites

Usage (figure reproduction)

Data description

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Emergency department indoor-air hybrid-capture metagenomics complements wastewater by adding a human-focused respiratory virus perspective

Dataset (raw reads)

Metagenomics analyses (EsViritu v0.2.3)

Pre-filtering in R (filter_data.R)

Clinical datasets

Influenza A Resistance Analysis

Repository structure

Prerequisites

Usage (figure reproduction)

Data description

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages