Project for Genomics Course 2021 (MSc Bioinformatics for Computational Genomics) helded by Prof. Aureliano Gomez Bombarely at Università degli Studi di Milano.
The project aims to investigate and compare the de novo assemblies of the chloroplast genome using both short-reads of Illumina and long-reads coming from PacBio and Oxford Nanopore.
The chloroplast genome of Prunus avium was chosen for the assembly.
The sequencing raw data were retrieved from the SRA repository in NCBI under the accession number:
- SRR10362958: Illumina
- SRR4280451: PacBio SMRT
- SRR7786091: Oxford Nanopore
The reference chloroplast genome of Prunus avium (MK622380.1) and Prunus apetala (NC_053693.1) were taken from NCBI Nucleotide Database.
The raw data were downloaded and converted into FASTQ files using fastq-dump
The statistics of the fastq were obtained using fastq-stats
The mapping was performed both to the prunus avium and prunus apetala using:
Illumina reads were pre-processed by using fastq-mcf
to remove the adapters before the mapping.
The reads mapping to the chloroplast genome were extracted using samtoos view
The mapping stats were evaluated with samtools stats
The mapped reads were sorted by position using samtool sort
bedtools genomecov
with the option -d
was used to obtain the depth at each genome position.
files obtained from the mapping to Prunus avium were converted into FASTQ with bedtools bamtofast
In order to find the best assembly, several subsets of reads were generated by sektq sample
The assemblies were obtained choosing:
The statistics of each assembly were obtained with FastaSeqStats
The longest contigs for each dataset were selected using FastaExtract
and aligned to the reference using BLASTN.
The annotation was performed using GeSeq.