##Obtaining and installing StringTie
The current version of StringTie can be downloaded from http://ccb.jhu.edu/software/stringtie/ In order to build StringTie from the source package, the following steps should be taken:
- Unpack the downloaded StringTie source archive in a directory of your choice, e.g.:
cd ~/src/
tar xvfz ~/Downloads/stringtie-N.NN.tar.gz
A directory called stringtie-N.NN (where N.NN is the current numeric version of the program) will be created in the current directory.
- Change to that directory and build the stringtie executable:
cd stringtie-N.NN
make release
- Optionally, the stringtie executable can be copied to one of the shell's PATH directories for easy access, e.g.:
cp stringtie ~/bin/
##Running StringTie
Run stringtie from the command line like this: stringtie <aligned_reads.bam> [options] The main input of the program is a SAMtools BAM file with RNA-Seq mappings sorted by genomic location (for example the accepted_hits.bam file produced by TopHat).
The following optional parameters can be specified (use -h/--help to get the usage message):
-G reference annotation to use for guiding the assembly process (GTF/GFF3)
-l name prefix for output transcripts (default: STRG)
-f minimum isoform fraction (default: 0.1)
-m minimum assembled transcript length to report (default 200bp)
-o output path/file name for the assembled transcripts GTF (default: stdout)
-a minimum anchor length for junctions (default: 10)
-j minimum junction coverage (default: 1)
-t disable trimming of predicted transcripts based on coverage
(default: coverage trimming is enabled)
-c minimum reads per bp coverage to consider for transcript assembly (default: 2.5)
-s coverage saturation threshold; further read alignments will be
ignored in a region where a local coverage depth of <maxcov>
is reached (default: 1,000,000);
-v verbose (log bundle processing details)
-g gap between read mappings triggering a new bundle (default: 50)
-C output file with reference transcripts that are covered by reads
-M fraction of bundle allowed to be covered by multi-hit reads (default:0.95)
-p number of threads (CPUs) to use (default: 1)
-B enable output of Ballgown table files which will be created in the
same directory as the output GTF (requires -G, -o recommended)
-b enable output of Ballgown table files but these files will be
created under the directory path given as <dir_path>
-e only estimates the abundance of given reference transcripts (requires -G)
##Input files
StringTie takes as input a binary SAM (BAM) file which must be sorted by reference position. This file contains spliced read alignments such as the ones produced by TopHat. If you have a text file in SAM format you should first convert it to the BAM format using the samtools view command:
samtools view -S -b input.sam > input.bam
Any SAM spliced read alignment (a read alignment across at least one junction) needs to contain the tag XS to indicate which strand of the RNA that produced this read came from. TopHat alignments already include this tag, but if you use a different read mapper you should check that this tag is also included for all spliced alignment records. You would also need to check that the reads are properly sorted. For a SAM file, this can be done with the command:
sort -k 3,3 -k 4,4n input.sam > input.sorted.sam
For BAM files, the samtools program can be used to sort the alignments. Optionally, a reference annotation file in GTF/GFF3 format can be provided to StringTie. In this case, StringTie will check to see if the reference transcripts are expressed in the RNA-Seq data, and for the ones that are expressed it will compute coverage and FPKM values. Note that the reference transcripts need to be fully covered by reads in order to be included in StringTie's output. Other transcripts assembled from the data by StringTie and not present in the reference file will be printed as well.