Skip to content

Commit 88b5387

Browse files
committed
updated docs to include build+query mode; removed leftover build parameter
1 parent c7d7ea7 commit 88b5387

File tree

7 files changed

+382
-22
lines changed

7 files changed

+382
-22
lines changed

dev/make_docs.sh

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
#!/bin/bash
22

3-
./metacache help build > docs/mode_build.txt
4-
./metacache help modify > docs/mode_modify.txt
5-
./metacache help query > docs/mode_query.txt
6-
./metacache help merge > docs/mode_merge.txt
7-
./metacache help info > docs/mode_info.txt
8-
./metacache help help > docs/mode_help.txt
3+
./metacache help build > docs/mode_build.txt
4+
./metacache help modify > docs/mode_modify.txt
5+
./metacache help query > docs/mode_query.txt
6+
./metacache help build+query > docs/mode_build_query.txt
7+
./metacache help merge > docs/mode_merge.txt
8+
./metacache help info > docs/mode_info.txt
9+
./metacache help help > docs/mode_help.txt
910

docs/mode_build.txt

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -110,9 +110,6 @@ ADVANCED OPTIONS
110110
contains a separate hash table.
111111
default: 1
112112

113-
-query Run interactive query after building database.
114-
default: off
115-
116113
EXAMPLES
117114

118115
Build database 'mydb' from sequence file 'genomes.fna':

docs/mode_build_query.txt

Lines changed: 367 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,367 @@
1+
SYNOPSIS
2+
3+
metacache build+query -targets <sequence file/directory>... [OPTION]...
4+
5+
metacache build+query [OPTION]... -targets <sequence file/directory>...
6+
7+
metacache build+query -targets <sequence file/directory>... -query <sequence file/directory>... [OPTION]...
8+
9+
metacache build+query -targets <sequence file/directory>... [OPTION]... -query <sequence file/directory>...
10+
11+
metacache build+query [OPTION]... -targets <sequence file/directory>... -query <sequence file/directory>...
12+
13+
14+
DESCRIPTION
15+
16+
Create a new database of reference sequences (usually genomic sequences) and use it to map (other) sequences to their most likely taxon of origin.
17+
18+
19+
REQUIRED PARAMETERS
20+
21+
<sequence file/directory>...
22+
FASTA or FASTQ files containing genomic sequences
23+
(complete genomes, scaffolds, contigs, ...) that shall
24+
beused as representatives of an organism/taxon.
25+
If directory names are given, they will be searched for
26+
sequence files (at most 10 levels deep).
27+
28+
29+
30+
BASIC OPTIONS
31+
32+
-taxonomy <path> directory with taxonomic hierarchy data (see NCBI's
33+
taxonomic data files)
34+
35+
-taxpostmap <file>
36+
Files with sequence to taxon id mappings that are used as
37+
alternative source in a post processing step.
38+
default: 'nucl_(gb|wgs|est|gss).accession2taxid'
39+
40+
-silent|-verbose information level during build:
41+
silent => none / verbose => most detailed
42+
default: neither => only errors/important info
43+
44+
45+
SKETCHING (SUBSAMPLING)
46+
47+
-kmerlen <k> number of nucleotides/characters in a k-mer
48+
default: 16
49+
50+
-sketchlen <s> number of features (k-mer hashes) per sampling window
51+
default: 16
52+
53+
-winlen <w> number of letters in each sampling window
54+
default: 127
55+
56+
-winstride <l> distance between window starting positions
57+
default: 112 (w-k+1)
58+
59+
60+
ADVANCED OPTIONS
61+
62+
-reset-taxa Attempts to re-rank all sequences after the main build
63+
phase using '.accession2taxid' files. This will reset the
64+
taxon id of a reference sequence even if a taxon id could
65+
be obtained from other sources during the build phase.
66+
default: off
67+
68+
-max-locations-per-feature <#>
69+
maximum number of reference sequence locations to be
70+
stored per feature;
71+
If the value is too high it will significantly impact
72+
querying speed. Note that an upper hard limit is always
73+
imposed by the data type used for the hash table bucket
74+
size (set with compilation macro
75+
'-DMC_LOCATION_LIST_SIZE_TYPE').
76+
default: 254
77+
78+
-remove-overpopulated-features
79+
Removes all features that have reached the maximum allowed
80+
amount of locations per feature. This can improve querying
81+
speed and can be used to remove non-discriminative
82+
features.
83+
default: off
84+
85+
-remove-ambig-features <rank>
86+
Removes all features that have more distinct reference
87+
sequence on the given taxonomic rank than set by
88+
'-max-ambig-per-feature'. This can decrease the database
89+
size significantly at the expense of sensitivity. Note
90+
that the lower the given taxonomic rank is, the more
91+
pronounced the effect will be.
92+
Valid values: sequence, form, variety, subspecies,
93+
species, subgenus, genus, subtribe, tribe, subfamily,
94+
family, suborder, order, subclass, class, subphylum,
95+
phylum, subkingdom, kingdom, domain
96+
default: off
97+
98+
-max-ambig-per-feature <#>
99+
Maximum number of allowed different reference sequence
100+
taxa per feature if option '-remove-ambig-features' is
101+
used.
102+
103+
-max-load-fac <factor>
104+
maximum hash table load factor;
105+
This can be used to trade off larger memory consumption
106+
for speed and vice versa. A lower load factor will improve
107+
speed, a larger one will improve memory efficiency.
108+
default: 0.800000
109+
110+
-parts <#> Splits the database into multiple parts. Each part
111+
contains a separate hash table.
112+
default: 1
113+
114+
-save-db <database filename>
115+
Save database to disk after querying.
116+
117+
118+
QUERY PARAMETERS
119+
120+
<sequence file/directory>...
121+
FASTA or FASTQ files containing genomic sequences (short
122+
reads, long reads, contigs, complete genomes, ...) that
123+
shall be classified.
124+
* If directory names are given, they will be searched for
125+
sequence files (at most 10 levels deep).
126+
* If no input filenames or directories are given,
127+
MetaCache will run in interactive query mode. This can be
128+
used to load the database into memory only once and then
129+
query it multiple times with different query options.
130+
131+
132+
MAPPING RESULTS OUTPUT
133+
134+
-out <file> Redirect output to file <file>.
135+
If not specified, output will be written to stdout. If
136+
more than one input file was given all output will be
137+
concatenated into one file.
138+
139+
140+
-split-out <file> Generate output and statistics for each input file
141+
separately. For each input file <in> an output file with
142+
name <file>_<in> will be written.
143+
144+
145+
PAIRED-END READ HANDLING
146+
147+
-pairfiles Interleave paired-end reads from two consecutive files, so
148+
that the nth read from file m and the nth read from file
149+
m+1 will be treated as a pair. If more than two files are
150+
provided, their names will be sorted before processing.
151+
Thus, the order defined by the filenames determines the
152+
pairing not the order in which they were given in the
153+
command line.
154+
155+
156+
-pairseq Two consecutive sequences (1+2, 3+4, ...) from each file
157+
will be treated as paired-end reads.
158+
159+
160+
-insertsize <#> Maximum insert size to consider.
161+
default: sum of lengths of the individual reads
162+
163+
164+
CLASSIFICATION
165+
166+
-lowest <rank> Do not classify on ranks below <rank>
167+
(Valid values: sequence, form, variety, subspecies,
168+
species, subgenus, genus, subtribe, tribe, subfamily,
169+
family, suborder, order, subclass, class, subphylum,
170+
phylum, subkingdom, kingdom, domain)
171+
default: sequence
172+
173+
-highest <rank> Do not classify on ranks above <rank>
174+
(Valid values: sequence, form, variety, subspecies,
175+
species, subgenus, genus, subtribe, tribe, subfamily,
176+
family, suborder, order, subclass, class, subphylum,
177+
phylum, subkingdom, kingdom, domain)
178+
default: domain
179+
180+
-hitmin <t> Sets classification threshhold to <t>.
181+
A read will not be classified if less than t features from
182+
the database match. Higher values will increase precision
183+
at the expense of sensitivity.
184+
default: 0
185+
186+
-hitdiff <t> Sets classification threshhold to <t>.
187+
A read will not be classified if less than t features from
188+
the database match. Higher values will increase precision
189+
at the expense of sensitivity.
190+
default: 0
191+
192+
-maxcand <#> maximum number of reference taxon candidates to consider
193+
for each query;
194+
A large value can significantly decrease the querying
195+
speed!.
196+
default: 2
197+
198+
-cov-percentile <p>
199+
Remove the p-th percentile of hit reference sequences with
200+
the lowest coverage. Classification is done using only the
201+
remaining reference sequences. This can help to reduce
202+
false positives, especially whenyour input data has a high
203+
sequencing coverage.
204+
This feature decreases the querying speed!
205+
default: off
206+
207+
208+
GENERAL OUTPUT FORMATTING
209+
210+
-no-summary Dont't show result summary & mapping statistics at the end
211+
of the mapping output
212+
default: off
213+
214+
-no-query-params Don't show query settings at the beginning of the mapping
215+
output
216+
default: off
217+
218+
-no-err Suppress all error messages.
219+
default: off
220+
221+
222+
CLASSIFICATION RESULT FORMATTING
223+
224+
-no-map Don't report classification for each individual query
225+
sequence; show summaries only (useful for quick tests).
226+
default: off
227+
228+
-mapped-only Don't list unclassified reads/read pairs.
229+
default: off
230+
231+
-taxids Print taxon ids in addition to taxon names.
232+
default: off
233+
234+
-taxids-only Print taxon ids instead of taxon names.
235+
default: off
236+
237+
-omit-ranks Do not print taxon rank names.
238+
default: off
239+
240+
-separate-cols Prints *all* mapping information (rank, taxon name, taxon
241+
ids) in separate columns (see option '-separator').
242+
default: off
243+
244+
-separator <text> Sets string that separates output columns.
245+
default: '\t|\t'
246+
247+
-comment <text> Sets string that precedes comment (non-mapping) lines.
248+
default: '# '
249+
250+
-queryids Show a unique id for each query.
251+
Note that in paired-end mode a query is a pair of two read
252+
sequences. This option will always be activated if option
253+
'-hits-per-ref' is given.
254+
default: off
255+
256+
-lineage Report complete lineage for per-read classification
257+
starting with the lowest rank found/allowed and ending
258+
with the highest rank allowed. See also options '-lowest'
259+
and '-highest'.
260+
default: off
261+
262+
263+
ANALYSIS: ABUNDANCES
264+
265+
-abundances <file>
266+
Show absolute and relative abundance of each taxon.
267+
If a valid filename is given, the list will be written to
268+
this file.
269+
default: off
270+
271+
-abundance-per <rank>
272+
Show absolute and relative abundances for each taxon on
273+
one specific rank.
274+
Classifications on higher ranks will be estimated by
275+
distributing them down according to the relative
276+
abundances of classifications on or below the given rank.
277+
(Valid values: sequence, form, variety, subspecies,
278+
species, subgenus, genus, subtribe, tribe, subfamily,
279+
family, suborder, order, subclass, class, subphylum,
280+
phylum, subkingdom, kingdom, domain)
281+
If '-abundances <file>' was given, this list will be
282+
printed to the same file.
283+
default: off
284+
285+
286+
ANALYSIS: RAW DATABASE HITS
287+
288+
-tophits For each query, print top feature hits in database.
289+
default: off
290+
291+
-allhits For each query, print all feature hits in database.
292+
default: off
293+
294+
-locations Show locations in candidate reference sequences.
295+
Activates option '-tophits'.
296+
default: off
297+
298+
-hits-per-ref <file>
299+
Shows a list of all hits for each reference sequence.
300+
If this condensed list is all you need, you should
301+
deactive the per-read mapping output with '-no-map'.
302+
If a valid filename is given after '-hits-per-ref', the
303+
list will be written to a separate file.
304+
Option '-queryids' will be activated and the lowest
305+
classification rank will be set to 'sequence'.
306+
default: off
307+
308+
309+
ANALYSIS: ALIGNMENTS
310+
311+
-align Show semi-global alignment to best candidate reference
312+
sequence.
313+
Original files of reference sequences must be available.
314+
This feature decreases the querying speed!
315+
default: off
316+
317+
318+
ADVANCED: GROUND TRUTH BASED EVALUATION
319+
320+
-ground-truth Report correct query taxa if known.
321+
Queries need to have either a 'taxid|<number>' entry in
322+
their header or a sequence id that is also present in the
323+
database.
324+
This feature decreases the querying speed!
325+
default: off
326+
327+
-precision Report precision & sensitivity by comparing query taxa
328+
(ground truth) and mapped taxa.
329+
Queries need to have either a 'taxid|<number>' entry in
330+
their header or a sequence id that is also found in the
331+
database.
332+
This feature decreases the querying speed!
333+
default: off
334+
335+
-taxon-coverage Report true/false positives and true/false negatives.This
336+
option turns on '-precision', so ground truth data needs
337+
to be available.
338+
This feature decreases the querying speed!
339+
default: off
340+
341+
342+
ADVANCED: PERFORMANCE TUNING / TESTING
343+
344+
-threads <#> Sets the maximum number of parallel threads to use.default
345+
(on this machine): 88
346+
347+
348+
-batch-size <#> Process <#> many queries (reads or read pairs) per thread
349+
at once.
350+
default (on this machine): 4096
351+
352+
-query-limit <#> Classify at max. <#> queries (reads or read pairs) per
353+
input file.
354+
default: 9223372036854775807
355+
356+
EXAMPLES
357+
358+
Build database from sequence file 'genomes.fna' and query all sequences in 'myreads.fna':
359+
metacache build+query -targets genomes.fna -query myreads.fna
360+
361+
Build database with latest complete genomes from the NCBI RefSeq and query interactively
362+
download-ncbi-genomes refseq/bacteria myfolder
363+
download-ncbi-genomes refseq/viruses myfolder
364+
download-ncbi-taxonomy myfolder
365+
metacache build+query -targets myfolder -taxonomy myfolder
366+
367+

0 commit comments

Comments
 (0)