This repository contains a set of software tools to support meta-analyses of 16S rRNA gene survey count data from studies of the human microbiota.
These tools facilitate the creation of a relational database of microbial count data and associated metadata. The design of these tools focused on times series 16S rRNA gene survey count data (and associated metadata), but it could be adapted to work with count data from other gene markers and studies that do not have time series data.
The format of input data is based on that available from studies found on Qiita (free account registration required). Data of a similar format from other data sources should work.
All tools are written in Python. PostgreSQL is used as the relational database management system (RDBMS). SQLAlchemy's object-relational mapper (ORM) is used to perform database manipulations.
Here is a brief overview of the different software and data components that can be found in this repository:
- main.py: A script used to informally test various parts of data cleaning, parsing and database functionality. Functions in this script can be used to insert (Qiita-derived) studies into
- model.py: The current SQLAlchemy model for the PostgreSQL database.
- config.py: A small script that parses database and Qiita configuration from database.ini.
- creator/: A package containing tools to parse data into appropriate objects and create/manipulate database tables and entries. Newer implementations of some scripts found in this file can be found in the wip/ package, but still need to be fully integrated with the rest of the system.
- creator/bib_parser.py: A script to parse bibliographic information from XML files (downloaded from Qiita).
- creator/count_parser.py: A script to parse count, lineage and sequence variant (ASV) data found in BIOM files into Count objects.
- creator/prep_parser.py: A script to parse sample preparation and processing metadata from data files.
- creator/sample_parser.py: A script to parse sample and subject metadata from data files.
- creator/transact.py: Utility script to create and remove tables from the database.
- creator/csv_cleaner.py: Utility script to clean data from CSV files containing sample, subject and preparation metadata.
- downloader/qiita_downloader.py: A web scraper to search Qiita, collect data files, scrape processing metadata and download bibliographic data for studies of interest. This script has been adapted for command-line use and is independent of any functionality in other code in this repository. For further information, see downloader/README.md.
- test/: A package to support testing of various software components (for use with pytest).
- data/: A directory containing example input data (organised by type). These data are sometimes used in tests (found in test/).
- debug_tools/: Tools that may be helpful in debugging some scripts (e.g. to inspect files containing heterogeneous metadata and for parser profiling).
- wip/: Package containing code that is work in progress (wip).
There is currently no way to install code in this repository as a Python package. A user is advised to create a virtual environment, e.g. using conda
, and install the following Python packages before testing functionality of the provided scripts:
- selenium
- biom-format
- pandas
- sqlalchemy
- pint
- biopython
- networkx
To use the Qiita Downloader, only the selenium
package is required. For further details please refer to Qiita Downloader documentation (downloader/README.md).
To execute test scripts, pytest
is also required.
- Integrate perturbation fact parsing and population.
- Integrate
wip
code.
This code was written to accompany a master's thesis and is currently maintained by William Roberts-Sengier.