To run the whole pipeline you will need the following programs installed:
During the workflow, contamination is removed using a combination of
centrifuge and a custom script,
remove_contamination.py
. This pipeline assumes that a
centrifuge database already exists. If you don't have one then you can build/download
one by following the instructions here.
The centrifuge database I use internally for this was built on 28/07/2019 as part of a
pipeline for another project. The code used to generate this database can be
found here.
In remove_contamination.py
, a requested parameter is --taxtree
. I have provided one
of these files in resources/taxonomy/mtbc.taxonlist
for the
Mycobacterium tuberculosis Complex (NCBI:txid77643). To create your own
taxtree you will either need to create a file with a taxon ID you class as not
contamination on each line, or generate one using taxonkit
. For example,
to create the MTBC one, I ran the following command:
mtbc_taxid=77643
taxonkit list --show-name --show-rank --ids "$mtbc_taxid" > mtbc.taxonlist
For more information about this command, refer to the documentation.
Refer to the assessment notebook for plots.
Ultimately, for the work in this project, we use the PacBio assemblies generated by Flye, without any polishing with Illumina.