add RASTtk tutorials

PATRIC3 · Nov 2, 2017 · 17c1b72 · 17c1b72
1 parent 721c7c4
commit 17c1b72
Show file tree

Hide file tree

Showing 7 changed files with 689 additions and 3 deletions.
diff --git a/docroot/Makefile b/docroot/Makefile
@@ -17,4 +17,4 @@ help:
 # Catch-all target: route all unknown targets to Sphinx using the new
 # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
 %: Makefile
-	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+	$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docroot/cli_tutorial/images/Job_Details.png b/docroot/cli_tutorial/images/Job_Details.png
diff --git a/docroot/cli_tutorial/images/RASTtk_options.jpg b/docroot/cli_tutorial/images/RASTtk_options.jpg
diff --git a/docroot/cli_tutorial/rasttk_batch_mode.rst b/docroot/cli_tutorial/rasttk_batch_mode.rst
@@ -0,0 +1,211 @@
+.. _rasttk-batch-mode:
+
+Using the RAST tool kit in batch mode
+=====================================
+
+Welcome to the tutorial on submitting genomes to the RASTtk service in
+PATRIC in batch mode.  You can follow this tutorial by using the
+PATRIC Command Line Interface. We recommend that you first familiarize
+yourself with the RASTtk commands in the tutorials,
+:ref:`rasttk-getting-started` and :ref:`rasttk-incremental-commands`.
+
+
+How to perform a default pipeline batch submission
+--------------------------------------------------
+
+We will start by logging in to IRIS or by opening the PATRIC Command
+Line Interface.  You will need to log into PATRIC by typing the
+following command::
+
+    p3-login   username
+
+The username and password are your PATRIC account username and password.
+
+Navigate to a directory where you are comfortable working.
+
+In order to demonstrate the submission of a batch of genomes we will
+start by making a new directory. Type::
+
+    mkdir ToSubmit
+
+Please remain in the current working directory. We will populate
+"ToSubmit" momentarily.
+
+Next we will download the contig files for *E. coli* and *B. subtilis*
+from PATRIC into our current working directory. Type::
+
+    p3-genome-fasta --contig 511145.12 > E_coli.contig    
+    p3-genome-fasta --contig 224308.43  >B_subtilis.contig
+
+Then we will will convert these to genome typed objects and place them
+into "ToSubmit". Type::
+
+    rast-create-genome --scientific-name "Escherichia coli K-12" --genetic-code 11 --domain Bacteria --contigs E_coli.contig > ToSubmit/E_coli 
+
+    rast-create-genome --scientific-name "Bacillus subtilis 168" --genetic-code 11 --domain Bacteria --contigs B_subtilis.contig > ToSubmit/B_subtilis 
+
+Now we have two files called "E\_coli" and "B\_subtilis" which have been
+converted to genome typed objects and are waiting to be submitted. If
+you were annotating your own genomes, the process would be the same. You
+would first need to convert your contigs in fasta format into genome
+typed objects with the appropriate metadata. It is the directory of
+genome typed objects that is then sent to the RAST server.
+
+Now we will submit the entire directory, "ToSubmit" to be processed by
+RASTtk. This command uses the RASTtk default pipeline::
+
+    rast-process-genome-batch ToSubmit 
+
+This command returns a job id, which you should save. Mine is:
+9f0d6d9a-c686-4b55-af6f-e0df59b0bb01
+
+You can query the status of your job by using the command
+**rast-query-genome-batch** . Try it::
+
+    rast-query-genome-batch  your job id
+
+If your job is running you will get a response from the server that
+looks like this::
+
+    kb|g.220035 in-progress 2014-07-29T08:48:36.733-05:00   0001-01-01T00:00:00Z            
+    kb|g.220034 in-progress 2014-07-29T08:48:36.758-05:00   0001-01-01T00:00:00Z            
+
+If your job is completed you will get a response from the server that
+looks like this::
+
+    kb|g.220035 completed   2014-07-29T08:48:36.733-05:00   2014-07-29T08:49:54.069-05:00   http://redwood.mcs.anl.gov:7078/node/751a0a8c-e13b-420d-84de-b2acdb79dd67?download  http://redwood.mcs.anl.gov:7078/node/dc44ef34-3855-42ea-bbb6-0294a5c91a47?download  http://redwood.mcs.anl.gov:7078/node/6e2be0e9-d3c9-4cc7-9eae-c821e30c2e01?download
+    kb|g.220034 completed   2014-07-29T08:48:36.758-05:00   2014-07-29T08:50:09.486-05:00   http://redwood.mcs.anl.gov:7078/node/f295435a-247f-45e6-87a2-772915561759?download  http://redwood.mcs.anl.gov:7078/node/d7c7ff6f-7ae4-48a2-aea3-11afe89a1805?download  http://redwood.mcs.anl.gov:7078/node/a7abccc6-1534-42b1-abbc-51ae71390907?download
+
+We will create a new directory for the output of our RASTtk job. Type::
+
+    mkdir Output
+
+When your job is completed, you can download it into the directory
+"Output" by typing::
+
+    rast-download-genome-batch  your job id Output
+
+This will download an annotated genome typed object for each genome.
+
+Navigate to the Output directory by typing::
+
+    cd Output
+
+if you type::
+
+    ls
+
+You will see the contents of the output directory.
+
+::
+
+    B_subtilis.gto     
+    B_subtilis.stderr
+    B_subtilis.stdout  
+    E_coli.gto
+    E_coli.stderr
+    E_coli.stdout
+
+The files with the file extension, ".gto" are the annotated genome typed
+objects and the ".stderr" files provide a record of what annotation
+steps the RAST server performed. The .stdout file reports the temporary
+directory where the computation was performed one the server.
+
+Finally, we can convert the genome typed objects to the format we desire
+by using **rast-export-genome**. For instance, I will convert the *E.
+coli* genome to a genbank file by typing::
+
+    rast-export-genome genbank < E_coli.gto > E_coli.gbk
+
+That's all there is to it. Running the default RASTtk pipeline in batch
+mode centers around the commands **rast-process-genome-batch** and
+**rast-download-genome-batch**. The most difficult parts are creating
+the initial directory of genome typed objects and and the conversion of
+the downloaded genome typed objects into useful output using
+"rast-export-genome". When you are annotating a directory with many
+genomes it may be necessary to write a script that does this pre- and
+post-processing for you.
+
+How to perform a customized batch submission
+--------------------------------------------
+
+If you are still in the directory "Output/", please move back a level by
+typing::
+
+    cd ../
+
+To demonstrate a custom batch mode submission, we will reannotate the
+*E. coli* and *B. subtilis* genome typed objects that we originally put
+in the "ToSubmit" directory. We will customize this submission by adding
+the prophage finder, PhiSpy.
+
+Customizing a batch submission is nearly the same process as a default
+batch submission except that that it requires that you submit a special
+file declaring the steps of the custom pipeline that you wish to run. In
+order to do this, we use **rast-process-genome-batch** and we provide
+the workflow file document using the --workflow flag.
+
+Please click the link below to look at the file that we are about to
+use. Notice that it contains a field called "stages" under which every
+every step is declared with a key called "name". Special program options
+such as "condition" and "paramaters" can also appear in this file under
+each named step. Notice that at the bottom of the file one of the named
+steps is "call\_features\_prophage\_phispy".
+
+`A Sample Workflow File <Workflow.txt>`__
+
+Please download this file by right clicking (or control-clicking) the
+link. We will name the file Workflow.txt. Then save it to your current
+working directory. If you are working in IRIS you should upload it to
+your working directory after you have saved it on your computer.
+
+Now lets submit the custom annotation job. Please type::
+
+    rast-process-genome-batch --workflow Workflow.txt ToSubmit
+
+You can check the status of your job with **rast-query-genome-batch**
+the same as before.
+
+When your job is complete, you can download it. First we will make a
+directory called "Customized"::
+
+    mkdir Customized
+
+Download the genome::
+
+    rast-download-genome-batch  your-job-id Customized
+
+If you look at the new directory, it will have the same file names as
+before.
+
+Now we will export a genome::
+
+    rast-export-genome feature_data  < Customized/E_coli.gto >E_coli.tbl
+
+Notice that new file contains the prophage calls.
+
+Generating workflow documents the easy way
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If generating a RASTtk workflow document seems unpleasant to you, then
+you're in luck. It is possible to generate a workflow document using the
+`RAST <http://rast.nmpdr.org>`__ website.
+
+If you submit a genome on the RAST website using the RASTtk option, and
+you select "Customize RASTtk pipeline" it will bring up a table of
+options that looks like this:
+
+|image0|
+
+You can select from the available steps that you want to have in your
+custom pipeline and click "Finish the upload". When you click, "View Job
+Details", You will see a page that looks like this:
+
+|image1|
+
+The workflow file is available by clicking the "Download" button.
+
+.. |image0| image:: images/RASTtk_options.jpg
+   :width: 600px
+.. |image1| image:: images/Job_Details.png
+   :width: 600px
diff --git a/docroot/cli_tutorial/rasttk_getting_started.rst b/docroot/cli_tutorial/rasttk_getting_started.rst
@@ -0,0 +1,162 @@
+.. _rasttk-getting-started:
+
+RASTtk: Getting Started With The Default Pipeline
+=================================================
+
+`RAST <http://rast.nmpdr.org/rast.cgi>`_ is a web-based environment that
+allows users to upload a genome, annotate the genome, edit the
+annotations and compare the genome with other sequenced genomes in the
+SEED database. Since the initial publication of `The RAST Server: rapid
+annotations using subsystems
+technology <http://www.ncbi.nlm.nih.gov/pubmed/18261238>`_ in 2008, over
+150,000 requests for genome annotations have been processed, at a
+current rate of 1,200 jobs per week.
+
+As demand for ever more accurate annotations and the number of
+newly-sequenced genomes increases, there is a growing demand for "the
+next generation" of the RAST technology (RASTtk). This new version of
+the architecture makes it possible to construct custom pipelines,
+integrate new bioinformatic tools, and make the developed pipelines
+easily accessible by a large user community.
+
+In essence, RASTtk is an updated version of the RAST pipline which users
+can modify and customize, but it is not intended to replace the RAST web
+environment.
+
+Getting Started
+~~~~~~~~~~~~~~~
+
+In order to run the RASTtk tools, you will need to install the PATRIC 
+command line utilities package. This is available for the macOS and Ubuntu or Debian 
+Linux `here <https://github.com/TheSEED/RASTtk-Distribution/releases/>`_.
+
+If you want to step through the tutorial, you can download the *E. coli*
+K-12 contig in fasta format from PATRIC using the following command::
+
+         p3-genome-fasta --contig 511145.12 > E_coli.contig    
+
+511145.12 is the PATRIC genome identifier for E. coli K-12.
+The ``p3-genome-fasta`` command returns the DNA data for the contigs 
+of the given genome. 
+
+If you want to annotate batches of genomes, please refer to our
+`tutorial <Batch_Mode.html>`_ on this topic.
+
+The Default RASTtk Pipeline
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The RASTtk environment is designed so that users can compose annotation
+pipelines, and then run those pipelines to annotate genomes. There is a
+rich and growing body of annotation tools that we have either built or
+imported from other groups, and these offer a framework for
+incrementally constructing annotations.
+
+In some cases users would rather execute a minimal set of commands
+representing the currently recommended annotation pipeline. This
+pipeline is composed of three easy scripts that:
+
+1. Format the contigs file
+
+2. Annotate the genome
+
+3. Export the genome
+
+The Concept of the *Genome Typed Object*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+All of the individual commands available in the RASTtk pipeline add data
+to a special file type called a genome typed object (GTO). A GTO is a
+JSON file that is compatible with KBase. Annotations are incrementally
+appended to this file until it is ready for export. By creating the GTO
+and adding annotation data to it, it is possible to export the data in
+multiple file formats without having to reannotate the genome.
+
+To create a GTO from scratch we will use the ``rast-create-genome`` command::
+
+        rast-create-genome 
+
+        options:
+            -o --output            file to which the output is to be written
+            -h --help              print usage message and exit
+               --url               URL for the genome annotation service
+               --genome-id         Genome identifier
+               --scientific-name   Scientific name (Genus species strain) for the genome
+               --domain            Domain (Bacteria/Archaea/Virus/Eukaryota) for the genome
+               --genetic-code      Genetic code for the genome (usually 11 for most organisms or 4 Mycoplasmas etc.)
+               --source            Source (external database) name for this genome
+               --source-id         Identifier for this genome in the source (external source)
+               --contigs           Fasta file containing DNA contig data
+
+We will use this command to create a GTO for the *E. coli* contig that
+we downloaded previously by typing::
+
+    rast-create-genome --scientific-name "Escherichia coli K-12" --genetic-code 11 --domain Bacteria --contigs E_coli.contig > E_coli.gto 
+
+Some processes in the pipeline require that you declare the scientific
+name, domain and genetic code, so these are required fields.
+
+To run the default RASTtk pipeline tool, type::
+
+          rast-process-genome < E_coli.gto > E_coli.gto2 
+
+Here, "E\_coli.gto2" is a second genome typed object with all of the
+RAST annotation data.
+
+This is the RASTtk Default Pipeline:
+
+#.  Calls rRNAs with a custom BLAST-based tool
+#.  Calls tRNAs with tRNAscan
+#.  Calls large repeat regions
+#.  Calls seleno proteins
+#.  Calls pyrrolysyl proteins
+#.  Finds Streptococcus repeat regions (only if the genus is Streptococcus)
+#.  Calls CRISPRs
+#.  Calls the protein-encoding genes with Prodigal and Glimmer3
+#.  Annotates protein-encoding genes with k-mers (version 2),
+#.  Annotates remaining hypothetical proteins with k-mers (version 1),
+#.  Attempts to annotate remaining hypothetical proteins by blasting against close relatives (if possible)
+#.  Performs a basic gene overlap removal
+
+To export the genome in a desired format we will use the ``rast-export-genome`` command::
+
+    rast-export-genome 
+      
+        options:
+        -i --input           file from which the input is to be read
+        -o --output          file to which the output is to be written
+        -h --help            print usage message and exit
+        --url                URL for the genome annotation service
+        --feature-type       Include this feature type in output. If no
+                             feature-types specified, include all feature
+                             types
+                           
+        Supported formats: 
+          genbank         Genbank format
+          genbank_merged  Genbank format as single merged locus, suitable for Artemis
+          feature_data    Tabular form of feature data
+          protein_fasta   Protein translations in fasta format
+          contig_fasta    Contig DNA in fasta format
+          feature_dna     Feature DNA sequences in fasta format
+          gff             GFF format
+          embl            EMBL format
+
+To illustrate how ``rast-export-genome`` is used, we will export our
+genome in genbank format. Type::
+
+    rast-export-genome genbank < E_coli.gto2 > E_coli.gbk
+
+Using the ``--feature-type`` option, it is possible to filter the output.
+For instance if we wanted a fasta file of RNA sequences we would type::
+
+    rast-export-genome feature_dna --feature-type rna < E_coli.gto2 > E_coli.rna.fasta
+
+Other feature types include "CDS", "repeat", "crispr\_array",
+"crispr\_repeat", and "crispr\_spacer". We anticipate that the number of
+features will continue to grow as we add new functionality.
+
+That's it! Three basic commands -- ``rast-create-genome``,
+``rast-process-genome`` and ``rast-export-genome`` --give you the RASTtk
+default pathway. However, this is only a subset of the available RASTtk
+functions. We have designed RASTtk so that it is modular and users can
+build custom annotation pipelines. In order to tap into this capability
+and to learn about individual steps please read the tutorial :ref:`rasttk-incremental-commands`.