Instructions, python and R code for generating lava lamp plots of kmer coverage as the kmers themselves, or based on kmer-coverage averages on the raw reads.
This tool was used in the analysis of the Hoilungia hongkongensis genome (Figure S2). It is unlikely I will make a dedicated paper, so if these scripts are used, please cite the paper:
Eitel, M., Francis, W.R., Varoqueaux, F., Daraspe, J., Osigus, H-J., Krebs, S., Vargas, S., Blum, H., Williams, G.A., Schierwater, B., Wörheide, G. (2018) Comparative genomics and the nature of placozoan species. PLoS Biology 16(7):e2005359.
- Dependencies
- Operation to make kmer or read-based lavalamp plots
- Blob plots for metagenomes
- Shiny app for interactive binning
- Usage tips including suggestions on data display
- Troubleshooting
This includes a few scripts to generate a "lava lamp" plot of kmer coverage separated by GC content of the kmers. This is a heat-map of frequency of reads or kmers with a particular coverage and GC content, giving a sense of the coverage of the target species and any symbionts or contaminants.
Such plots were intended for use on metagenomic data (whole-genome shotgun, not amplicon), and can be generated with raw data, meaning that no assembly or other processing of the reads is needed. This is usually sufficient to reveal heterozygosity (for eukaryotes) or the presence of symbionts with very different GC content than the host, and instruct whether to even use the data or not. For instance, distinct "spots" suggest an good extraction, while a smear would suggest that there were some biases in the extraction or sequencing and that assembly may be highly problematic.
Below is an example plot from Hydra vulgaris SRR1032106 using a kmer of 31. The putative symbiont is visible in the plot with a high GC content. The second plot counts based on GC of the full read and median coverage kmers in that read, rather than the kmers alone.
This is conceptually similar to what is done in the "blob" plots by blobtools, (paper by Kumar et al 2013) though that process makes use of assembled contigs while this one considers the raw reads directly. However, instructions for generating such plots are also below. Some assemblers output contig coverage information with the contigs, and these can be used without mapping the reads.
When making such plots for metagenomics, be aware of the implications of contig coverage for many potential analyses. The coverage is calculated by reads per base, but for the blobs, should reflect the relative number of copies of contigs or chromosomes. For something like a bacterial sample, this could be considered to be a relative counts of cells.
Mapping the original reads back to the assembled contigs should make a mostly even coverage within each scaffold. The raw total counts per gene or per contig will vary by the length, and therefore must be normalized in order to be meaningful (such normalization is already expressed by the coverage). For example, if two bacterial species had equal abundance, but one has a genome 2x larger than the other, this will account for 2x the number of reads, even if the coverage is the same.
This means longer genes, contigs, or chromosomes will have more counts, but also means that total counts to homologous operons between species will vary by both coverage and length. If the objective is to make a statement about the abundance of particular genes or pathways in the sample, the contig coverage could be simplistically applied to the operon. However, it may be better to quantify the actual expression (qPCR or metatranscriptomics) or protein activity instead.
Download the jellyfish kmer counter (or another preferred kmer counter). This used to be supplied with Trinity for transcriptome assembly, and was previously in the trinity-plugins/
folder.
-
Run jellyfish on the raw genomic data, i.e. paired end reads.
jellyfish count -m 31 -s 2G -C -o fastq.counts -U 1000 -t 8 all_reads.fastq
data can also be piped, in which case you must provide the input
/dev/fd/0
(or/dev/stdin
):cat *.fastq | jellyfish count -m 31 -s 2G -C -o fastq.counts -U 1000 -t 8 /dev/fd/0
or unzipped in real time:
gzip -dc *.fastq.gz | jellyfish count -m 31 -s 2G -C -o fastq.counts -U 1000 -t 8 /dev/fd/0
-s
is the memory buffer, 2G (2 billion, or 2000000000) is safe for many analyses, though the array will expand if it is maxed out. This can consume a lot of memory (30-50Gb), so it is not advisable to run on a laptop.-U
is the max coverage to count. 1000 will catch mostly transposons.-t
is the thread count / number of CPUs.-C
refers to canonical counting, meaning both + and - strands.-m
is the kmer size. If downstream steps involving Trinity will be used (Step 4 and after) then do not select a kmer greater than 32, as this is the upper limit for thefastaToKmerCoverageStats
program.
Optionally, a lower limit can be set with
-L
. For generating kmer plots (step 3 below) this would affect the data and not be advised, but for the read plots (steps 4 and 5), the Trinity programfastaToKmerCoverageStats
assumes that any values below 2 are 1. Since the bulk of the kmers have a count of 1 (66% in the Hydra sample), this makes a much smaller file and will run faster at subsequent steps. -
Write out kmers with coverage counts. This file can be massive (easily 50Gb+, 83G for Hydra SRR1032106) so it is instead piped through the python script.
-k
and-u
are kmer size and max coverage from jellyfish. This step is a single process, so can take some time for very large files with 1 billion+ kmers (20-30 minutes). This generates a matrix of GC vs coverage across all kmers, where the value at each position is the number of unique kmers with that GC and coverage. This is run through python so that R does not have to deal with the counting in addition to the graphing.jellyfish dump fastq.counts | fastqdumps2histo.py -k 31 -u 1000 - > fastq.gc_cov.histo.csv
If the Trinity mode steps (4 and 5) will eventually be carried out, then the jellyfish kmer dump is required. A filtered version where kmers with a count of 1 have been removed can be generated using the
-j
option.jellyfish dump fastq.counts | fastqdumps2histo.py -k 31 -u 1000 -j fastq.dumps - > fastq.gc_cov.histo.csv
-
Configure the R script for this .csv file, and pdf output. Add text to annotate the plot as needed with the
text(x,y,"important things")
command. The X and Y values correspond to the center position of the X and the GC count (not the percentage). Thus a GC of 25% would be something like 7 or 8 for a kmer of 31. This script can be run interactively in an R environment (such as RKWard or RCommander) or in the command line. For the command line, an optional 3rd argument can be given to change the X-axis scale (default is 1000).Rscript jellyfish_gc_coverage_blob_plot_v2.R fastq.gc_cov.histo.csv counts_k31_gc_cov.pdf
-
Slice out sections of reads based on kmer coverage. This requires finally generating the jellyfish dump file, so that will be redone first if it was not done at the earlier step.
jellyfish dump fastq.counts > fastq.dumps
Then
-T
(Trinity mode) is called with kmersorter. This generates a table of median coverage per read, which would be used in Trinity to randomly normalize the data based on coverage. Instead, here it uses the same method to deterministically keep reads with coverage above or below some specified value (-a
and-b
, respectively). Specify the kmer used with-k
. The Trinity base directory is given with-D
. Some of the subprocesses can use multithreads, which is specified by-p
. Note that the most memory intensive step, fastaToKmerCoverageStats, has no intrinsic memory limit as it reads the entire fastq.dumps file into memory; this is why the reduced dump is generated with the-j
option in step 2.kmersorter.py -T -k 31 -p 8 -a 100 -b 200 -D ~/trinityrnaseq/ -1 reads_1.fq -2 reads_2.fq fastq.dumps
-
Generate a more precise coverage to GC map using the entire read rather than kmers. This is run similarly as before with some alternate options in fastqdumps2histo. As above, Trinity mode is specified with
-T
. The intermediate stats files from kmersorter are then used with the raw reads to count the coverage and GC. The read length should be detected automatically, though can be specified with the-r
option.fastqdumps2histo.py -s reads_1.stats reads_2.stats -f reads_1.fq reads_2.fq -u 1000 -T - > reads.gc_cov.histo.csv
-
Run the R script on this .csv file as above.
-
With the stats files already generated from step 5, cut out sections of reads from the plot by running kmersorter again with different options. Use
-s
and-w
to set the lower and upper bounds of GC percent cutoff. For example:kmersorter.py -T -k 31 -p 8 -a 50 -b 100 -s 0.5 -w 0.7 -D ~/trinityrnaseq/ -1 reads_1.fq -2 reads_2.fq fastq.dumps
kmersorter.py -T -k 31 -p 8 -a 45 -b 80 -s 0.35 -w 0.45 -D ~/trinityrnaseq/ -1 reads_1.fq -2 reads_2.fq fastq.dumps
-
To count kmer coverage of long reads (such as Sanger, PacBio, Moleculo, etc.) or contigs, use the
-L
option in kmersorter. This will calculate the median kmer coverage of the Illumina reads on each long read, that is, even though the coverage might be lower in the long read library, individual spots should become even more distinct on the plot because they are using the Illumina kmer coverage. In this example, the-S
option is used to only calculate the stats at this step, however this can be excluded and used with filtering options. The original paired reads can be ignored for this step.kmersorter.py -T -k 31 -p 8 -S -D ~/trinityrnaseq/ -L long_reads.fq fastq.dumps
After stats are generated, the filter options can be applied as above to select groups of reads by coverage or GC.
kmersorter.py -T -k 31 -p 8 -a 45 -b 80 -s 0.35 -w 0.45 -D ~/trinityrnaseq/ -L long_reads.fq fastq.dumps
-
Generate a long-read coverage to GC map, similar to the above step, except with the addition of the
-p
option. This option indicates that reads are not a standard length, and percentage GC should be calculated for each read.fastqdumps2histo.py -s long_reads.stats -f long_reads.fq -u 1000 -T -p - > long_reads.gc_cov.histo.csv
This script assumes that fastq files are always 4 lines and fasta files are always 2 lines. If your fasta format files (say for assembled contigs) are more than 2 lines per sequence (which would be header line and sequence line), then convert them first using the fasta2twoline.py script.
fasta2twoline.py contigs.fasta > twoline_contigs.fasta
The above steps 8 and 9 can be applied to assembled contigs/scaffolds, however, it may be conceptually easier to generate something like a blob-plot to view the contigs and/or contamination directly.
-
Map reads with a normal read mapper, like hisat2. Convert
.sam
file to a sorted.bam
, using samtools~/samtools-1.8/samtools view -Su twilhelma_2014_vs_scaffolds_v1.sam | ~/samtools-1.8/samtools sort - -o twilhelma_2014_vs_scaffolds_v1.sorted.bam
-
Extract the counts using the
sort_reads_from_bam.py
script.~/samtools-1.8/samtools view twilhelma_2014_vs_scaffolds_v1.sorted.bam | ~/git/lavaLampPlot/sort_reads_from_bam.py -i - > twilhelma_2014_vs_scaffolds_v1.sorted.hits_from_bam.txt
-
Collect the coverage, length, and GC content using the
hits_to_coverage.py
script.~/git/lavaLampPlot/hits_to_coverage.py -f twilhelma_scaffolds_v1.fasta -b twilhelma_2014_vs_scaffolds_v1.sorted.hits_from_bam.txt > twilhelma_2014_vs_scaffolds_v1.coverage.tab
-
Generate the plot with the R script
contig_gc_coverage.R
, which will automatically name the output file as.pdf
.Rscript ~/git/lavaLampPlot/contig_gc_coverage.R twilhelma_2014_vs_scaffolds_v1.coverage.tab
Some assemblers report the contig coverage directly within the fasta header, such as SPAdes. Another script can convert this directly into a compatible table.
-
Run spadescontigstocovgc.py directly on the assembly fasta file. This should work for assembly files from SPAdes, platanus, and MEGAHIT, possibly others.
spadescontigstocovgc.py scaffolds.fasta > scaffolds.stats.tab
-
Generate the plot as above with the R script
contig_gc_coverage.R
:Rscript ~/git/lavaLampPlot/contig_gc_coverage.R scaffolds.stats.tab
This is the first attempt to convert the plotting into an interactive ShinyApp, with 3 slide bars to control what scaffolds are shown. Here, using the example data from Ephydatia muelleri, the pink spot on the left shows a low-coverage bacterial partial-chromosome. Mouse-clicks displays the stats for that scaffold, and others nearby, below the plot.
The input file is declared in the first lines of the app source. This should be changed to either the output of hits_to_coverage.py
or spadescontigstocovgc.py
to view the contigs interactively.
Due to the connection between kmer length and coverage, there is necessarily a balance between longer kmers, which will resolve the y-axis better, and higher coverage, which will resolve the x-axis better. Kmers between 31 and 41 tend to perform fairly well. Because the Trinity steps cannot use a kmer larger than 32, setting the kmer to 31 is advisable (odd numbers are preferable to prevent double counting from palindromic sequences).
Sequencing coverage is also important. Environmental metagenomes (not amplicon sequencing) of fairly low coverage (5 gigbases) did not display any useful complexity. For metazoan genomes of 100-200Mb, around 20-30Gb of sequence data usually produced high quality plots displaying blobs for both animal and potential bacterial symbionts (see the example plot of Hydra, using one SRA of 25Gb of raw sequence).
Jellyfish can easily max out the memory on a system. For a hash size of 1G (1 billion kmers), for a k-mer of 31, it takes 6Gb of memory on my system; 2G takes 12Gb and 3-4G takes 23Gb (probably scales binarily). The hash buffer (-s
) will expand if jellyfish counts more kmers than the number initially provided. Some systems have safeguards to kill any process demanding 99% of the memory, but otherwise this may crash a computer. In those cases, it is advisable to split the dataset or just run jellyfish count
on a faction of the reads, such as the left or right reads alone. In both cases, the counts are then combined using the jellyfish merge
command:
jellyfish merge -o combined.counts left.counts right.counts
followed by running jellyfish dump
and the python script as above:
jellyfish dump combined.counts | fastqdumps2histo.py -k 31 -u 2000 - > combined.gc_cov.histo.csv
Note that when using merge, the -u
option for fastqdumps2histo must be changed accordingly, probably multiplied for each file that is merged. If this was two files with -U 1000
, then set the value to 2000 rather than 1000. Also, the hash size (-s
) must be the same for all files in the merge, so pick something large enough, maybe 4G (4 billion) if you have the memory.
The merging generates counts that are slightly different from counting them as a single set, on the order of 0.001% more when separate sets were merged than when counted as a single set. This is after excluding those above the counts cutoff for the single set (which might account for 0.01%). I have no explanation for this.
The manual from jellyfish v1 implies that if the output file is not specified with -o
, jellyfish should create an extra output file with the default name ("output_") when the hash size is full. I have not tested if this works for sparing memory usage.
The previous version of Trinity mode in kmersorter had a problem with the SRA headers generated from the SRA Toolkit fastq-dump (due to problems in the called Trinity scripts). The stats file generation would work normally, but then the stats cannot be sorted correctly and ultimately the process will fail merging the stats. Brian Haas suggested this alternate command to generate the header (which one would also use during Trinity normalization).
fastq-dump --split-files --defline-seq '@$sn[_$rn]/$ri' SRR1032106.sra
This may no longer be a problem since sort
is no longer called in the Trinity mode for kmersorter. This program was removed since the purpose was to organize the reads in a pair so they could be merged. The merging caused problems for extraction because the merge takes the average of the two coverage medians in a pair, so read coverage of 1 and 99 in a pair would make 50. This does not generate a count on the lava lamp plot, since the stats are counted separately in fastqdumps2histo.py
. However, during extraction of regions in kmersorter, additional/junk reads outside of the defined extraction boundary were collected when neither read was in the defined region.
Two problems have come up a few times, and sample plots are shown in the error_plots folder. For the case of Acropora, there is a strip at the bottom. This was caused by the sequence in the read spanning multiple lines per read, instead of just one. For this situation, convert the reads to two-line fasta with the fasta2twoline.py
script. For the other case of Aiptasia, the plot is distorted since the reads were trimmed and are not the same length. To solve this, use the -p
option in fastqdumps2histo.py
to calculate the length and GC for each sequence.