Skip to content

Commit b0d7aef

Browse files
ESPRESSO alpha1.2.2
Co-authored-by: gy-chop <[email protected]>
0 parents  commit b0d7aef

37 files changed

+7979
-0
lines changed

LICENSE

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
Use of this software is available to academic and non-profit institutions for research purposes subject to the terms of the 2-Clause BSD License (see copy below). For use or transfers of the software to commercial entities, please inquire with Dr. Yi Xing at [email protected]. © 2021 CHOP.
2+
3+
The FreeBSD Copyright
4+
5+
Copyright 1992-2021 The FreeBSD Project.
6+
7+
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
8+
9+
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
10+
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
11+
THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
12+
13+
The views and conclusions contained in the software and documentation are those of the authors and should not be interpreted as representing official policies, either expressed or implied, of the FreeBSD Project.

README.md

Lines changed: 355 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,355 @@
1+
# ESPRESSO (version alpha1.2.2)
2+
3+
[![Latest Release](https://img.shields.io/github/release/Xinglab/espresso.svg?label=Latest%20Release)](https://github.com/Xinglab/espresso/releases/latest)
4+
[![Total GitHub Downloads](https://img.shields.io/github/downloads/Xinglab/espresso/total.svg?label=Total%20GitHub%20Downloads)](https://github.com/Xinglab/espresso/releases)
5+
6+
## About
7+
8+
ESPRESSO (Error Statistics PRomoted Evaluator of Splice Site Options) is a novel method for processing alignment of long read RNA-seq data, which can effectively improve splice junction accuracy and isoform quantification.
9+
10+
## Table of contents
11+
12+
* [Dependencies](#dependencies)
13+
* [Usage](#usage)
14+
+ [Snakemake](#snakemake)
15+
+ [Basic Usage](#basic-usage)
16+
+ [Example](#example)
17+
+ [Preparing Input Files](#preparing-input-files)
18+
+ [All Arguments](#all-arguments)
19+
* [Output](#output)
20+
* [Visualization](#visualization)
21+
+ [Visualization Arguments](#visualization-arguments)
22+
+ [IGV](#igv)
23+
* [References](#references)
24+
25+
## Dependencies
26+
27+
ESPRESSO requires the following to be installed and available on $PATH:
28+
29+
* [Perl](https://www.perl.org/) >= 5.8 built with threading enabled
30+
+ Check for thread support with `perl -e 'use threads; print("ok\n")'`
31+
* [hmmer](http://www.hmmer.org/) >= 3.3.1
32+
* [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) >= 2.8.1
33+
* [samtools](http://www.htslib.org/) >= 1.6
34+
35+
ESPRESSO requires a sorted SAM or BAM file as input. If the data is in another format then other tools can be used to create the sorted alignment file:
36+
37+
* [minimap2](https://github.com/lh3/minimap2)
38+
* ONT guppy basecaller
39+
40+
The [visualization](visualization/) scripts require
41+
42+
* [UCSC](http://hgdownload.soe.ucsc.edu/admin/exe/) tools
43+
+ bedGraphToBigWig
44+
+ faToTwoBit
45+
+ twoBitInfo
46+
* Python 3
47+
+ [NumPy](https://numpy.org/)
48+
49+
If using the [Snakemake](#snakemake) then its installation script can install the dependencies (except for guppy which requires a login).
50+
51+
## Usage
52+
53+
Please ensure that enough memory is available for the input data and number of threads. The memory usage (in GB) is estimated to be:
54+
55+
* `ESPRESSO_S`: `num_threads * (4 + read_count_of_largest_input/1,000,000)`
56+
* `ESPRESSO_C`: `num_threads * (4 + read_count/2,500,000)`
57+
* `ESPRESSO_Q`: `2 + total_read_count/4,000,000`
58+
59+
### Snakemake
60+
61+
A Snakemake workflow is provided for running ESPRESSO. See the [README](snakemake/README.md) in [snakemake/](snakemake/) for more details.
62+
63+
### Basic Usage
64+
65+
ESPRESSO has three steps. The basic usage is:
66+
67+
* `perl ESPRESSO_S.pl -L samples.tsv -F ref.fasta -A anno.gtf -O work_dir`
68+
+ samples.tsv is a tab separated file like:
69+
```
70+
/path/to/first.sorted.sam first_sample_name
71+
/path/to/second.sorted.sam second_sample_name
72+
```
73+
* `perl ESPRESSO_C.pl -I work_dir -F ref.fasta -X 0`
74+
+ If there are multiple inputs then `ESPRESSO_C` needs to be run once for each input. The -X parameter identifies which input is being processed (-X 0, -X 1, ...)
75+
* `perl ESPRESSO_Q.pl -L work_dir/samples.tsv.updated -A anno.gtf`
76+
77+
### Example
78+
79+
A small test data set is provided in [test_data/test_data_espresso_sirv.tar.gz](test_data/test_data_espresso_sirv.tar.gz). The unpacked files are:
80+
81+
* SIRV2_3.sort.sam: sorted aligned reads
82+
* SIRV2.fasta: reference sequence for SIRV2
83+
* SIRV_C.gtf: annotation of SIRV isoforms
84+
85+
Create a tab separated file that describes the ESPRESSO input. Call that file samples.tsv:
86+
```
87+
/path/to/SIRV2_3.sort.sam test_sample_name
88+
```
89+
90+
In this example there is only one SAM file. If there were multiple SAM files then each would be on a separate line in samples.tsv. The second column is the sample name and can be used to group multiple input SAM files together in the final output.
91+
92+
93+
Run `ESPRESSO_S` to pre-process the inputs to determine high confidence splice junctions and other info needed for later steps:
94+
```
95+
perl ESPRESSO_S.pl -A SIRV_C.gtf -L samples.tsv -F SIRV2.fasta -O test_sirv
96+
```
97+
98+
This will produce the file `test_sirv/samples.tsv.updated` which assigns target IDs to SAM files. In this example there is only one SAM with ID 0:
99+
```
100+
/path/to/SIRV2_3.sort.sam test_sample_name 0
101+
```
102+
103+
Run `ESPRESSO_C` to correct and recover splice junctions for each read:
104+
```
105+
perl ESPRESSO_C.pl -I test_sirv -F SIRV2.fasta -X 0 -T 5
106+
```
107+
108+
The -X parameter identifies which input is being processed. In this example there is only a single target ID (0). If there were multiple inputs then `ESPRESSO_C` would be run separately for each ID in samples.tsv.updated.
109+
110+
111+
Run `ESPRESSO_Q` to identify and quantify all isoforms from the reads:
112+
```
113+
perl ESPRESSO_Q.pl -A SIRV_C.gtf -L test_sirv/samples.tsv.updated -V test_sirv/samples_N2_R0_compatible_isoform.tsv
114+
```
115+
116+
The three main output files are in `test_sirv/` and provide details of the three detected isoforms (SIRV201,202,203). See [Output](#output) for a description of the files. The output file `test_sirv/samples_N2_R0_abundance.esp` should be similar to [test_data/expected_sirv_abundance.esp](test_data/expected_sirv_abundance.esp).
117+
118+
The visualization files can be generated with
119+
```
120+
python3 visualization/visualize.py --genome-fasta SIRV2.fasta --updated-gtf test_sirv/samples_N2_R0_updated.gtf --abundance-esp test_sirv/samples_N2_R0_abundance.esp --target-gene SIRV2 --minimum-count 1 --descriptive-name SIRV --output-dir test_sirv/visualization
121+
```
122+
123+
![SIRV result visualization](test_data/visualization_sirv.png)
124+
125+
### Preparing Input Files
126+
127+
#### Base Calling
128+
129+
The following is one possible procedure for base calling. Other base calling procedures may be preferable for different data sets. If using Nanopore data then the base calling could be done using guppy with a command similar to:
130+
```
131+
guppy_basecaller --input_path /path/to/fast5 --save_path /path/to/fastq/ --cpu_threads_per_caller 20 --config dna_r9.4.1_450bps_fast.cfg
132+
```
133+
134+
That command assumes that the data is from a Nanopore r9.4.1 cDNA library and it uses the fast mode of guppy with 20 threads. For a Nanopore r9.4.1 direct RNA library `--config rna_r9.4.1_70bps_fast.cfg` could be used.
135+
136+
137+
All the separate .fastq files can be combined into a single file with:
138+
```
139+
cat /path/to/fastq/*.fastq > combined.fastq
140+
```
141+
142+
#### Alignment
143+
144+
One possible alignment tool is Minimap2[1]. This command can output a SAM file without secondary alignments:
145+
```
146+
minimap2 -ax splice -ub --secondary=no ref.fasta combined.fastq > in.sam
147+
```
148+
149+
For noisy 1D Nanopore data the developer of Minimap2 suggests adding -k 14 and -w 4
150+
```
151+
minimap2 -ax splice -ub -k14 -w 4 --secondary=no ref.fasta combined.fastq > in.sam
152+
```
153+
154+
A BED file can be provided to assist the mapping:
155+
```
156+
paftools.js gff2bed anno.gtf > junctions.bed
157+
minimap2 -ax splice -ub -k14 -w 4 --junc-bed junctions.bed --secondary=no ref.fasta combined.fastq > in.sam
158+
```
159+
160+
Only the common format of CIGAR values is accepted by ESPRESSO, so please do **NOT** use parameters to output a specific format of CIGAR values (e.g. --cs in minimap2).
161+
162+
#### Sorted Alignment
163+
164+
The input SAM files need to be sorted by samtools before running ESPRESSO:
165+
```
166+
samtools sort -o sorted.sam unsorted.sam
167+
```
168+
169+
### All Arguments
170+
171+
```
172+
perl ESPRESSO_S.pl --help
173+
174+
Arguments:
175+
176+
-L, --list_samples
177+
tsv list of sample(s) (each file in a line with 1st column as sorted
178+
BAM/SAM file and 2nd column as sample name; required)
179+
-F, --fa
180+
FASTA file of all reference sequences. Please make sure this file is
181+
the same one provided to mapper. (required)
182+
-A, --anno
183+
input annotation file in GTF format (optional)
184+
-B, --SJ_bed
185+
input custom reliable splice junctions in BED format (optional; each
186+
reliable SJ in one line, with the 1st column as chromosome, the 2nd
187+
column as upstream splice site 0-base coordinate, the 3rd column as
188+
downstream splice site and 6th column as strand)
189+
-O, --out
190+
work directory (existing files in this directory may be OVERWRITTEN;
191+
default: ./)
192+
193+
-H, --help
194+
show this help information
195+
196+
-N, --read_num_cutoff
197+
min perfect read count for denovo detected candidate splice junctions
198+
(default: 2)
199+
-R, --read_ratio_cutoff
200+
min perfect read ratio for denovo detected candidate splice junctions:
201+
Set this as 1 for completely GTF-dependent processing (default: 0)
202+
203+
-C, --cont_del_max
204+
max continuous deletion allowed; intron will be identified if longer
205+
(default: 50)
206+
-M, --chrM
207+
tell ESPRESSO the ID of mitochondrion in reference file (default:
208+
chrM)
209+
210+
-T, --num_thread
211+
thread number (default: minimum of 5 and sam file number)
212+
-Q, --mapq_cutoff
213+
min mapping quality for processing (default: 1)
214+
```
215+
216+
```
217+
perl ESPRESSO_C.pl --help
218+
219+
Arguments:
220+
221+
-I, --in
222+
work directory (generated by ESPRESSO_S)
223+
-F, --fa
224+
FASTA file of all reference sequences. Please make sure this file is
225+
the same one provided to mapper. (required)
226+
-X, --target_ID
227+
ID of sample to process (required)
228+
229+
-H, --help
230+
show this help information
231+
232+
-T, --num_thread
233+
thread number (default: 5)
234+
```
235+
236+
```
237+
perl ESPRESSO_Q.pl --help
238+
239+
Arguments:
240+
241+
-L, --list_samples
242+
tsv list of multiple samples (each bam in a line with 1st column as
243+
sorted bam file, 2nd column as sample name in output, 3rd column as
244+
directory of ESPRESSO_C results; this list can be generated by
245+
ESPRESSO_S according to the initially provided tsv list; required)
246+
-A, --anno
247+
input annotation file in GTF format (optional)
248+
-O, --out_dir
249+
output directory (default: directory of -L)
250+
-V, --tsv_compt
251+
output tsv for compatible isoform(s) of each read (optional)
252+
253+
-H, --help
254+
show this help information
255+
256+
-N, --read_num_cutoff
257+
min perfect read count for all splice junctions of novel isoform
258+
(default: 2)
259+
-R, --read_ratio_cutoff
260+
min perfect read ratio for all splice junctions of novel isoform
261+
(default: 0)
262+
```
263+
264+
## Output
265+
266+
The three main output files are:
267+
* `sample_N2_R0_updated.gtf` is an updated GTF annotation for detected isoforms.
268+
+ Each detected isoform is reported as a transcript.
269+
+ The source column indicates for each isoform whether it is a `"novel_isoform"` or an `"annotated_isoform"`.
270+
* `sample_N2_R0_abundance.esp` is a tsv file for expression of detected isoforms.
271+
+ Each detected isoform is reported on a separate line.
272+
+ The first columns are `transcript_ID`, `transcript_name`, `gene_ID`.
273+
+ There is an additional column for each sample name provided in samples.tsv. Those columns show the number of reads from that sample which were counted toward this isoform.
274+
+ The counts are assigned by expectation maximization. Each input read contributes at most 1 count, either to a single isoform or distributed as fractional counts to multiple isoforms.
275+
* `sample_N2_R0_compatible_isoform.tsv` is a tsv file for compatible isoforms of each read.
276+
+ This file is only produced if the -V parameter of ESPRESSO_Q is used.
277+
+ The columns are `read_id`, `sample_name`, `read_classification`, `compatible_isoforms`.
278+
+ The possible classifications are NIC/NNC, NCD, ISM, FSM, single-exon.
279+
280+
## Visualization
281+
282+
Visualizations can be created for the ESPRESSO output files similar to [test_data/visualization_cd44.png](test_data/visualization_cd44.png) and [test_data/visualization_sirv.png](test_data/visualization_sirv.png). The script [visualization/visualize.py](visualization/visualize.py) can generate files that can then be viewed in a genome browser
283+
284+
### Visualization Arguments
285+
286+
```
287+
python3 visualization/visualize.py --help
288+
289+
usage: visualize.py [-h] --genome-fasta GENOME_FASTA --updated-gtf UPDATED_GTF
290+
--abundance-esp ABUNDANCE_ESP --target-gene TARGET_GENE
291+
--minimum-count MINIMUM_COUNT [--normalize-counts-to-cpm]
292+
--descriptive-name DESCRIPTIVE_NAME --output-dir
293+
OUTPUT_DIR
294+
295+
Generate files for visualizing ESPRESSO output
296+
297+
optional arguments:
298+
-h, --help show this help message and exit
299+
--genome-fasta GENOME_FASTA
300+
the fasta input to ESPRESSO
301+
--updated-gtf UPDATED_GTF
302+
the *_updated.gtf file output by ESPRESSO
303+
--abundance-esp ABUNDANCE_ESP
304+
the *_abundance.esp file output by ESPRESSO
305+
--target-gene TARGET_GENE
306+
the name of the gene to visualize. transcripts with
307+
name like {target-gene}-{number} or gene_id like
308+
{target-gene}.* will have output generated. Use the
309+
gene_id to match novel isoforms output by ESPRESSO
310+
--minimum-count MINIMUM_COUNT
311+
only isoforms where the (normalized) count for a
312+
sample meets the minimum count are considered
313+
--normalize-counts-to-cpm
314+
convert raw counts to counts per million
315+
--descriptive-name DESCRIPTIVE_NAME
316+
used as a label in the visualization and for filenames
317+
--output-dir OUTPUT_DIR
318+
where to write visualization files
319+
```
320+
321+
### IGV
322+
323+
* The visualization output can be viewed in IGV: [https://igv.org](https://igv.org)
324+
+ Download the desktop app.
325+
* In IGV:
326+
+ Genomes -> Load Genome from File -> {file}.fasta and {file}.fasta.fai
327+
+ File -> Load from File -> {file}.gtf
328+
- Right click gtf track.
329+
- Select "squished".
330+
- Set track height to a large enough value.
331+
+ File -> Load from File -> {sample_name}.bw for each sample
332+
- Right click track.
333+
- Select "Autoscale".
334+
- Select "Maximum" under "Windowing Function".
335+
- "Change Track Color".
336+
+ View -> Add New Panel
337+
- Click and drag panel borders to show the new panel.
338+
+ File -> Load from File -> target_genes/{file}.bed
339+
- Click name of new track and drag to the appropriate panel.
340+
+ Adjust the coordinates to show the data.
341+
+ File -> Save Image ->
342+
- Change file extension to .svg
343+
* Some changes were made manually to the example images.
344+
+ [https://inkscape.org/](https://inkscape.org/) can be used to edit .svg files.
345+
+ Remove extra details such as borders and header info.
346+
+ Change the color of the baseline for each density plot.
347+
+ Add a horizontal center line for each isoform.
348+
+ Move one expression value for each isoform to the left axis and remove the duplicate values.
349+
+ Add horizontal lines between panels.
350+
+ Edit the text on the left axis.
351+
+ Add a text title at the bottom.
352+
353+
## References
354+
355+
1. Li H. Minimap2: pairwise alignment for nucleotide sequences[J]. Bioinformatics, 2018, 34(18): 3094-3100.

snakemake/.gitignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
/.snakemake/
2+
/alignment/
3+
/conda_env/
4+
/espresso_out/
5+
/fastq_dir/
6+
/references/
7+
/visualization/
8+
__pycache__/

0 commit comments

Comments
 (0)