Skip to content

Commit 1e056b6

Browse files
committed
Updated README to doc extract_sr_bc_from_lr
1 parent 13038a5 commit 1e056b6

File tree

1 file changed

+38
-11
lines changed

1 file changed

+38
-11
lines changed

README.md

Lines changed: 38 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/sctagger/README.html)
22

33
# scTagger
4-
scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to achieve the information of both datasets.
4+
scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to enable relating at the cell level gene expression (from short-reads) and RNA splicing (from the long-reads).
55

66
## Installation
77

@@ -23,7 +23,7 @@ scTagger has a single python script containing different functions to match long
2323

2424
The whole pipeline contains three steps that you can run each part separately:
2525

26-
#### Extract long-reads segment
26+
#### *1) Extract long-reads segment*
2727
The first step of the scTagger pipeline is to extract a segment where the probability of seeing a barcode is more than in other places.
2828
To run this step, you can use the following command.
2929

@@ -37,23 +37,23 @@ To run this step, you can use the following command.
3737
* `-g`: Space separated of the ranges of where SR adapter should be found on the LR's (Optional, Default: Detect from data)
3838
* `-z`: Indicate input is gzipped (Optional, Default: Assume input is gzipped if it ends with \".gz\")
3939
* `-t`: Number of threads (Optional, Default: 1)
40-
* `-sa`: Short-read adapter (Optional, Default: "CTACACGACGCTCTTCCGATCT")
40+
* `-sa`: Short-read adapter (Optional, Default: `CTACACGACGCTCTTCCGATCT`)
4141
* `--num-bp-afte`: Number of bases after the end of the SR adapter alignment to generate (Optional, Default: 20)
4242
* `-o`: Path to output file
4343
* `-p`: Path to plot file (Optional, Default: No plotting)
4444

4545
**Inputs**
46-
* A list of fastQ files of long reads
46+
* A list of FASTQ files of long-reads
4747

4848
**Outputs**
4949
* A Tsv file:
5050
* First column is read-id
5151
* Second column is the best edit distance with the short-read adapter
5252
* Third column is the starting point of long-read that matches with the adapter
5353
* Fourth column is the long-read segment that find.
54-
* A plot of optimal alignment locations of the short read adapter to the long reads.
54+
* A plot of optimal alignment locations of the short read adapter to the long-reads.
5555

56-
#### Extract short-reads barcodes
56+
#### *2) Extract short-reads barcodes*
5757

5858
The second step is to extract the top short-reads barcodes that cover most of the reads.
5959

@@ -78,8 +78,35 @@ The second step is to extract the top short-reads barcodes that cover most of th
7878
* Second column is the number of appearances of the barcode
7979
* A cumulative plot of SR coverage with batches of 1,000 barcodes
8080

81-
#### Match long-reads segment with short-reads barcode
82-
The last step is to match long read segments with selected barcodes from short reads
81+
#### *Alt. 2) Extract short-reads barcodes directly from long-reads*
82+
83+
This is an alternative to the second step which avoids using the short-reads all together and inteads builds a whiltelist of cellular barcodes from the long-reads directly.
84+
This is done by looking for exact matches of the 10x Chromium list of cellular barcodes on the long-read barcode segments.
85+
The barcodes are sorted by frequency and the most frequent barcodes are kept using the strategy as the `extract_sr_bc` module.
86+
87+
```
88+
./scTagger.py extract_sr_bc_from_lr -i "path/to/long-read-segments" -wl "/path/to/10x-barcode-list.txt" -o "path/to/output.txt"'
89+
```
90+
91+
**Arguments**
92+
* `-i`: Input TSV file containing the long-read segments file generated by `extract_lr_bc` step
93+
* `-o`: Path to output file.
94+
* `-wl`: Path to 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz). Accepts both txt.gz files and .txt files.
95+
* `--thresh`: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)
96+
* `--step-size`: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)
97+
* `--max-barcode-cnt`: Max number of barcodes to keep (Optional, Default: 25000)
98+
99+
**Input**
100+
* The output file of the `extract_lr_bc` step
101+
* 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz)
102+
103+
**Output**
104+
* A TSV file
105+
* First column is barcodes
106+
* Second column is the number of appearances of the barcode
107+
108+
#### *3) Match long-reads segment with short-reads barcodes*
109+
The last step is to match long-read segments with selected barcodes from short reads
83110
```
84111
./scTagger.py match_trie -lr "path/to/output/extract/long-read/segment" -sr "path/to/output/extract/top/short-read" -o "path/to/output/file" -t "number of threads"
85112
```
@@ -96,15 +123,15 @@ The last step is to match long read segments with selected barcodes from short r
96123

97124

98125
**Inputs**
99-
* Use the output of extracting long read segment and selecting top barcodes part as the inputs of this section
126+
* Use the output of extracting long-read segment and selecting top barcodes part as the inputs of this section
100127

101128
**Outputs**
102129
* A TSV file
103130
* First column is the read id
104131
* Second column is the minimum edit distance
105-
* Third column is the number of short reads barcodes that match with the long read
132+
* Third column is the number of short reads barcodes that match with the long-read
106133
* Fourth column is the long-read segment, and the Fifth column is a list of all short-read barcodes with minimum edit distance
107-
* A bar plot that shows the number of long reads by the minimum edit distance of their match barcode
134+
* A bar plot that shows the number of long-reads by the minimum edit distance of their match barcode
108135

109136
## Citing scTaggger
110137
scTagger was first accepted to RECOMB-seq 2022 and is now published by iScience:

0 commit comments

Comments
 (0)