Skip to content

Commit c3b8d91

Browse files
authored
Merge pull request #116 from PNNL-CompBio/functionmotifs
2 parents 155e36c + 64f4fef commit c3b8d91

32 files changed

+3098
-3264
lines changed

.github/workflows/action.yml

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ jobs:
5252
- shell: bash -l {0}
5353
run: mamba install -y -c conda-forge snakemake==7.0 tabulate==0.8.10
5454
- shell: bash -l {0}
55-
run: pip install -e git+https://github.com/PNNL-CompBio/Snekmer@kmer-association#egg=snekmer
55+
run: pip install -e git+https://github.com/PNNL-CompBio/Snekmer@functionmotifs#egg=snekmer
5656

5757
#test clustering step
5858
- name: Snekmer Cluster
@@ -105,3 +105,11 @@ jobs:
105105
source activate snekmer
106106
snekmer apply --configfile .test/config_learnapp.yaml -d .test --cores 1
107107
rm -rf .test/output
108+
109+
# run Snekmer Motif using previously generated model files
110+
- name: Snekmer Motif
111+
run: |
112+
export PATH="/usr/share/miniconda/bin:$PATH"
113+
source activate snekmer
114+
snekmer motif --configfile .test/config.yaml -d .test --cores 1
115+
rm -rf .test/output

.test/config.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,4 +48,6 @@ score_dir: "output/example-model/"
4848
learnapp:
4949
save_apply_associations: False
5050

51-
51+
# motif params
52+
motif:
53+
n: 200

README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ to determine probabilistic annotations.
1515
<img align="center" src="resources/snekmer_workflow.svg">
1616
</p>
1717

18-
There are 5 operation modes for Snekmer: `cluster`, `model`, `search`, `learn`, and `apply`.
18+
There are six operation modes for Snekmer: `cluster`, `model`, and `search`, `learn`, `apply`, and `motif`.
1919

2020
**Cluster mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA).
2121
Snekmer applies the relevant workflow steps and outputs the resulting clustering results in tabular form (.CSV),
@@ -40,6 +40,10 @@ and the outputs received from Learn. Snekmer uses cosine distance to predict the
4040
sequence from the kmer counts matrix. The output is a table for each file containing sequence annotation
4141
predictions with confidence levels.
4242

43+
**Motif mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA)
44+
and the outputs received from Model. Snekmer performs a feature selection workflow to produce a
45+
list of motifs ordered by degree of conservation and a classification model using the selected features (.model).
46+
4347
## How to Use Snekmer
4448

4549
For installation instructions, documentation, and more, refer to

docs/source/getting_started/cli.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ For an overview of Snekmer usage, reference the help command (``snekmer --help``
1515
.. code-block:: console
1616
1717
$ snekmer --help
18-
usage: snekmer [-h] [-v] {cluster,model,search,learn,apply} ...
18+
usage: snekmer [-h] [-v] {cluster,model,search,learn,apply,motif} ...
1919
2020
Snekmer: A tool for kmer-based sequence analysis using amino acid reduction (AAR)
2121
@@ -26,7 +26,7 @@ For an overview of Snekmer usage, reference the help command (``snekmer --help``
2626
mode:
2727
Snekmer mode
2828
29-
{cluster,model,search,learn,apply}
29+
{cluster,model,search,learn,apply,motif}
3030
3131
Tailored references for the individual operation modes can be accessed
3232
via ``snekmer {mode} --help``.
@@ -49,7 +49,7 @@ files. Snekmer also assumes background files, if any, are stored in
4949
is shown below:
5050

5151

52-
Snekmer ``cluster``, ``model``, and ``search`` input
52+
Snekmer ``cluster``, ``model``, ``search``, and ``motif`` input
5353

5454
.. code-block:: console
5555

docs/source/getting_started/config.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,3 +131,13 @@ General parameters related to Snekmer's learn and apply mode (``snekmer learn``,
131131
``seed`` ``int`` Choose any (random) seed for reproducible fragmentation.
132132
============================= ===================== =========================================================================
133133

134+
135+
Motif Parameters
136+
````````````````
137+
The following parameters are required for Snekmer's motif mode (``snekmer motif``), wherein feature selection is performed to find functionally relevant kmers.
138+
139+
======================== ===================== ==================================================================================
140+
Parameter Type Description
141+
======================== ===================== ==================================================================================
142+
``n`` ``int`` Number of label permutation and rescoring iterations to run for each input family.
143+
======================== ===================== ==================================================================================

docs/source/getting_started/usage.rst

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
Using Snekmer
22
=============
33

4-
Snekmer has three modeling operations: ``cluster`` (unsupervised clustering),
5-
``model`` (supervised modeling), and ``search`` (application
6-
of model to new sequences). We will call the first two modes
4+
Snekmer has four modeling operations: ``cluster`` (unsupervised clustering),
5+
``model`` (supervised modeling), ``search`` (application
6+
of model to new sequences), and ``motif`` (feature selection). We will call the first two modes
77
**learning modes** due to their utility in learning relationships
88
between protein family input files. Users may choose a mode to best
99
suit their specific use case.
@@ -233,3 +233,32 @@ and directories in addition to the files described previously.
233233
│ │ ├── Seq-Annotation-Scores-D.csv # (optional) Sequence-annotation cosine similarity scores for D seqs
234234
│ │ ├── kmer-summary-C.csv # Results with annotation predictions and confidence for C seqs
235235
│ │ └── kmer-summary-D.csv # Results with annotation predictions and confidence for D seqs
236+
237+
Snekmer Motif Output Files
238+
::::::::::::::::::::::::::
239+
240+
Snekmer's motif mode produces the following output files and directories in addition to the files described previously.
241+
242+
.. code-block:: console
243+
244+
.
245+
├── output/
246+
│ ├── ...
247+
│ ├── motif/
248+
│ │ ├── kmers/
249+
│ │ │ ├── A.csv # kmers retained for A after recursive feature elimination
250+
│ │ │ ├── B.csv # kmers retained for B after recursive feature elimination
251+
│ │ ├── preselection/
252+
│ │ │ ├── A.csv # kmer weights learned for A after recursive feature elimination
253+
│ │ │ ├── B.csv # kmer weights learned for B after recursive feature elimination
254+
│ │ │ ├── A.model # last (A/not A) classification model trained during RFE
255+
│ │ │ ├── B.model # last (B/not B) classification model trained during RFE
256+
│ │ ├── sequences/
257+
│ │ │ ├── A.csv # Sequence vectors for A using the kmer subset retained after recursive feature elimination
258+
│ │ │ ├── B.csv # Sequence vectors for B using the kmer subset retained after recursive feature elimination
259+
│ │ ├── scores/
260+
│ │ │ ├── A.csv # kmer weight learned for A on each permute/rescore iteration
261+
│ │ │ ├── B.csv # kmer weight learned for B on each permute/rescore iteration
262+
│ │ ├── p_values/
263+
│ │ │ ├── A.csv # Tabulated results for A
264+
│ │ │ └── B.csv # Tabulated results for B

docs/source/index.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ sequences to predict the nearest annotation and generate a confidence score.
1919
:width: 700
2020
:alt: Snekmer workflow overview
2121

22-
There are 5 operation modes for Snekmer: ``cluster``, ``model``, ``search``, ``learn``, and ``apply``.
22+
There are 6 operation modes for Snekmer: ``cluster``, ``model``, ``search``, ``motif``, ``learn``, and ``apply``.
2323

2424
**Cluster mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA).
2525
Snekmer applies the relevant workflow steps and outputs the resulting clustering results in tabular form (.CSV),
@@ -34,6 +34,8 @@ displays K-fold cross validation results in the form of figures (AUC ROC and PR
3434
and the models they wish to search their sequences against. Snekmer applies the relevant workflow steps
3535
and outputs a table for each file containing model annotation probabilities for the given sequences.
3636

37+
**Motif mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA). Snekmer applies the relevant workflow steps and outputs a table (.csv) for each family, which shows the SVM weight and associated p-value for each kmer.
38+
3739

3840
**Learn mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA) as well as an annotation file. Snekmer generates a kmer counts matrix with the summed kmer distribution of each annotation recognized from the sequence ID. Snekmer then performs a self-evaluation to assess confidence levels. There are two outputs, a counts matrix, and a global confidence distribution.
3941

@@ -61,6 +63,8 @@ The output is a table for each file containing sequence annotation predictions w
6163

6264
tutorial/index
6365
tutorial/snekmer_demo
66+
tutorial/snekmer_learnapp_tutorial
67+
tutorial/snekmer_motif_tutorial
6468

6569
.. toctree::
6670
:caption: Background

docs/source/tutorial/snekmer_demo.ipynb

Lines changed: 5 additions & 5 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)