Updated README to doc extract_sr_bc_from_lr

baraaorabi · baraaorabi · commit 1e056b695a65 · 2023-01-20T11:35:51.000-08:00
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/sctagger/README.html)
 
 # scTagger
-scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to achieve the information of both datasets. 
+scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to enable relating at the cell level gene expression (from short-reads) and RNA splicing (from the long-reads). 
 
 ## Installation
 
@@ -23,7 +23,7 @@ scTagger has a single python script containing different functions to match long
 
 The whole pipeline contains three steps that you can run each part separately:
 
-#### Extract long-reads segment
+#### *1) Extract long-reads segment*
 The first step of the scTagger pipeline is to extract a segment where the probability of seeing a barcode is more than in other places.
 To run this step, you can use the following command. 
 
@@ -37,23 +37,23 @@ To run this step, you can use the following command.
 * `-g`: Space separated of the ranges of where SR adapter should be found on the LR's (Optional, Default: Detect from data)
 * `-z`: Indicate input is gzipped (Optional, Default: Assume input is gzipped if it ends with \".gz\")
 * `-t`: Number of threads (Optional, Default: 1)
-* `-sa`: Short-read adapter (Optional, Default: "CTACACGACGCTCTTCCGATCT")
+* `-sa`: Short-read adapter (Optional, Default: `CTACACGACGCTCTTCCGATCT`)
 * `--num-bp-afte`: Number of bases after the end of the SR adapter alignment to generate (Optional, Default: 20)
 * `-o`: Path to output file
 * `-p`: Path to plot file (Optional, Default: No plotting)
 
 **Inputs**
-* A list of fastQ files of long reads
+* A list of FASTQ files of long-reads
 
 **Outputs**
 * A Tsv file: 
   * First column is read-id 
   * Second column is the best edit distance with the short-read adapter
   * Third column is the starting point of long-read that matches with the adapter
   * Fourth column is the long-read segment that find. 
-* A plot of optimal alignment locations of the short read adapter to the long reads. 
+* A plot of optimal alignment locations of the short read adapter to the long-reads. 
 
-#### Extract short-reads barcodes
+#### *2) Extract short-reads barcodes*
 
 The second step is to extract the top short-reads barcodes that cover most of the reads.
 
@@ -78,8 +78,35 @@ The second step is to extract the top short-reads barcodes that cover most of th
   * Second column is the number of appearances of the barcode
 * A cumulative plot of SR coverage with batches of 1,000 barcodes 
 
-#### Match long-reads segment with short-reads barcode
-The last step is to match long read segments with selected barcodes from short reads
+#### *Alt. 2) Extract short-reads barcodes directly from long-reads*
+
+This is an alternative to the second step which avoids using the short-reads all together and inteads builds a whiltelist of cellular barcodes from the long-reads directly.
+This is done by looking for exact matches of the 10x Chromium list of cellular barcodes on the long-read barcode segments.
+The barcodes are sorted by frequency and the most frequent barcodes are kept using the strategy as the `extract_sr_bc` module.
+
+```
+./scTagger.py extract_sr_bc_from_lr -i "path/to/long-read-segments" -wl "/path/to/10x-barcode-list.txt" -o "path/to/output.txt"'
+```
+
+**Arguments**
+* `-i`: Input TSV file containing the long-read segments file generated by `extract_lr_bc` step
+* `-o`: Path to output file.
+* `-wl`: Path to 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz). Accepts both txt.gz files and .txt files.
+* `--thresh`: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)
+* `--step-size`: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)
+* `--max-barcode-cnt`: Max number of barcodes to keep (Optional, Default: 25000)
+
+**Input**
+* The output file of the `extract_lr_bc` step
+* 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz)
+
+**Output**
+* A TSV file
+  * First column is barcodes
+  * Second column is the number of appearances of the barcode
+
+#### *3) Match long-reads segment with short-reads barcodes*
+The last step is to match long-read segments with selected barcodes from short reads
 ```
 ./scTagger.py match_trie -lr "path/to/output/extract/long-read/segment" -sr "path/to/output/extract/top/short-read" -o "path/to/output/file" -t "number of threads"
 ```
@@ -96,15 +123,15 @@ The last step is to match long read segments with selected barcodes from short r
 
 
 **Inputs**
-* Use the output of extracting long read segment and selecting top barcodes part as the inputs of this section 
+* Use the output of extracting long-read segment and selecting top barcodes part as the inputs of this section 
 
 **Outputs**
 * A TSV file
   *  First column is the read id
   *  Second column is the minimum edit distance
-  *  Third column is the number of short reads barcodes that match with the long read
+  *  Third column is the number of short reads barcodes that match with the long-read
   *  Fourth column is the long-read segment, and the Fifth column is a list of all short-read barcodes with minimum edit distance 
-* A bar plot that shows the number of long reads by the minimum edit distance of their match barcode
+* A bar plot that shows the number of long-reads by the minimum edit distance of their match barcode
 
 ## Citing scTaggger
 scTagger was first accepted to RECOMB-seq 2022 and is now published by iScience: