Project for Genomics Course 2021 (MSc Bioinformatics for Computational Genomics) helded by Prof. Aureliano Gomez Bombarely at Università degli Studi di Milano.
The project aims to investigate and compare the de novo assemblies of the chloroplast genome using both short-reads of Illumina and long-reads coming from PacBio and Oxford Nanopore.
The chloroplast genome of Prunus avium was chosen for the assembly.
The sequencing raw data were retrieved from the SRA repository in NCBI under the accession number:
- SRR10362958: Illumina
- SRR4280451: PacBio SMRT
- SRR7786091: Oxford Nanopore
The reference chloroplast genome of Prunus avium (MK622380.1) and Prunus apetala (NC_053693.1) were taken from NCBI Nucleotide Database.
The raw data were downloaded and converted into FASTQ files using fastq-dump
.
The statistics of the fastq were obtained using fastq-stats
.
The mapping was performed both to the prunus avium and prunus apetala using:
Illumina reads were pre-processed by using fastq-mcf
to remove the adapters before the mapping.
The reads mapping to the chloroplast genome were extracted using samtoos view
.
The mapping stats were evaluated with samtools stats
.
The mapped reads were sorted by position using samtool sort
.
bedtools genomecov
with the option -d
was used to obtain the depth at each genome position.
The BAM
files obtained from the mapping to Prunus avium were converted into FASTQ with bedtools bamtofast
.
In order to find the best assembly, several subsets of reads were generated by sektq sample
.
The assemblies were obtained choosing:
The statistics of each assembly were obtained with FastaSeqStats
.
The longest contigs for each dataset were selected using FastaExtract
and aligned to the reference using BLASTN.
The annotation was performed using GeSeq.