Skip to content

Matthijnssenslab/air_wastewater_leuven

Repository files navigation

Emergency department indoor-air hybrid-capture metagenomics complements wastewater by adding a human-focused respiratory virus perspective

Thank you for your interest in our study!

Preprint: A Tale of Two Lenses: Emergency department indoor-air hybrid-capture metagenomics complements wastewater by adding a human-focused respiratory virus perspective https://www.medrxiv.org/content/10.64898/2026.03.13.26348311v1

This repository contains processed datasets and R scripts to reproduce the analyses and figures from the study. It also documents how the raw reads were processed so the workflow can be reproduced end-to-end.

Dataset (raw reads)

BioProject: PRJNA1431177 (SRA). Human-associated reads were removed before deposition.

Raw reads can be downloaded using the SRA Toolkit (e.g., prefetch / fasterq-dump). See: https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/

Processed inputs for figure reproduction are included in data/ so you can regenerate figures without re-running the raw-read pipeline.

Metagenomics analyses (EsViritu v0.2.3)

Raw reads were processed with EsViritu v0.2.3 using the Virus Pathogen Database v2.0.2 (GenBank content through Nov 2022; Zenodo 7876309) as the reference. EsViritu links:

Workflow summary:

  • Quality filtering and adapter trimming with fastp, with deduplication enabled.
  • Reference mapping at >=90% nucleotide identity and >=90% read coverage.
  • Consensus sequence generation with samtools; near-duplicate consensus sequences removed at >95% similarity.
  • The pipeline reports all findings (we have removed internal reporting cutoff of EsViritu); downstream filtering was done in RStudio with cutoffs mentioned below.

These steps produce the processed datasets in data/ that are used by the figure scripts.

Pre-filtering in R (filter_data.R)

The pre-filtering step is implemented in scripts/filter_data.R. It removes known wet-lab contaminants and low-confidence hits to improve certainty of detections and downstream genomic assignments.

A single set of filtering criteria is applied to the raw data:

  • Baseline Requirements: >= 500 bp coverage OR >= 50% genome completeness.
  • Minimum Abundance: > 10 reads aligned AND > 1 read per million (RPM).
  • Large Genomes: References with > 100kb genomes must have > 3000 bp coverage.
  • Contaminant Removal: Specific contaminants (e.g., Parvovirus NIH-CQV, Alphamesonivirus) and known artifactual signals (e.g., low-coverage Human mastadenovirus C) are excluded.

Clinical datasets

Clinical datasets used in the study are included in this repository for reproducibility. These datasets are publicly available — sources and access details are described in the manuscript. Data files are available under data/ (e.g., uzleuven_pathogens_weekly_long.csv).

Influenza A Resistance Analysis

For Influenza A resistance analysis (Figure 4), we implemented a specific pipeline (scripts/figure4_2_pipeline.R).

  • Pipeline: scripts/figure4_2_pipeline.R
  • Methodology:
    • Step 1: Strict consensus generation (min depth >100x, min allele frequency >0.9).
    • Step 2: Resistance mutation calling (H1N1pdm09 reference).
  • Filtering: Only primary alignments with mapping quality > 30 are used. Rare variants (<90% frequency) are excluded from consensus sequences to ensure high confidence calls.
  • Input: BAM files (not included in repo due to size) or cached intermediate files (included in repo).
  • Output: Mutation analysis and coverage plots.

See README_figure4_2_pipeline.md for detailed documentation.

Repository structure

.
├── data/                       # Processed datasets (TSV, CSV, etc.)
│   ├── air_wastewater2.detected_virus.combined.tax.tsv
│   ├── metadata_matched.csv
│   ├── figure4_2_data.csv
│   ├── uzleuven_pathogens_weekly_long.csv
│   └── ...
├── scripts/                    # R scripts for generating figures
│   ├── figure1_panels.R        # Overview analysis
│   ├── figure2_panels.R        # Comparative analysis
│   ├── figure3_panels.R        # Time series analysis
│   ├── figure4_panels.R        # Influenza subtype analysis
│   ├── figure4_2_pipeline.R    # Influenza resistance analysis
│   ├── figure5_panels.R        # Genus-specific composition
│   └── filter_data.R           # Data filtering logic
├── figures_pdf/                # Output figures in PDF format (generated by scripts)
└── README.md                   # Project documentation

Prerequisites

System requirements:

  • Figure reproduction (this repository): R >= 4.0.0 on a standard local machine (analysis scripts were run on macOS).
  • Raw-read metagenomics processing (EsViritu): HPC environment with a SLURM scheduler and required bioinformatics software/modules.
  • No special hardware is required for figure reproduction.

R dependencies:

install.packages(c(
  "tidyverse",
  "lubridate",
  "ggplot2",
  "patchwork",
  "RColorBrewer",
  "gridExtra",
  "scales",
  "grid",
  "ggpubr",
  "ggrepel",
  "viridis",
  "zoo",
  "cowplot",    # For figure 4 pipeline
  "jsonlite"    # For figure 4 pipeline
))

Usage (figure reproduction)

  1. Clone the repository:

    git clone https://github.com/Matthijnssenslab/air_wastewater_leuven.git
    cd air_wastewater_leuven
  2. Set the working directory in R/RStudio to the repository root:

    setwd("/path/to/air_wastewater_leuven")
  3. Run the analysis scripts:

    # First, ensure data is filtered (if not loading pre-filtered)
    source("scripts/filter_data.R")
    
    # Run figure generation scripts
    source("scripts/figure1_panels.R")
    source("scripts/figure2_panels.R")
    # ... and so on

Data description

  • air_wastewater2.detected_virus.combined.tax.tsv: Main dataset containing taxonomically assigned viral reads.
  • metadata_matched.csv: Sample metadata linking air and wastewater samples.
  • uzleuven_pathogens_weekly_long.csv: Clinical surveillance data for correlation analysis.

License

MIT License. See LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors