|
| 1 | +SYNOPSIS |
| 2 | + |
| 3 | + metacache build+query -targets <sequence file/directory>... [OPTION]... |
| 4 | + |
| 5 | + metacache build+query [OPTION]... -targets <sequence file/directory>... |
| 6 | + |
| 7 | + metacache build+query -targets <sequence file/directory>... -query <sequence file/directory>... [OPTION]... |
| 8 | + |
| 9 | + metacache build+query -targets <sequence file/directory>... [OPTION]... -query <sequence file/directory>... |
| 10 | + |
| 11 | + metacache build+query [OPTION]... -targets <sequence file/directory>... -query <sequence file/directory>... |
| 12 | + |
| 13 | + |
| 14 | +DESCRIPTION |
| 15 | + |
| 16 | + Create a new database of reference sequences (usually genomic sequences) and use it to map (other) sequences to their most likely taxon of origin. |
| 17 | + |
| 18 | + |
| 19 | +REQUIRED PARAMETERS |
| 20 | + |
| 21 | + <sequence file/directory>... |
| 22 | + FASTA or FASTQ files containing genomic sequences |
| 23 | + (complete genomes, scaffolds, contigs, ...) that shall |
| 24 | + beused as representatives of an organism/taxon. |
| 25 | + If directory names are given, they will be searched for |
| 26 | + sequence files (at most 10 levels deep). |
| 27 | + |
| 28 | + |
| 29 | + |
| 30 | +BASIC OPTIONS |
| 31 | + |
| 32 | + -taxonomy <path> directory with taxonomic hierarchy data (see NCBI's |
| 33 | + taxonomic data files) |
| 34 | + |
| 35 | + -taxpostmap <file> |
| 36 | + Files with sequence to taxon id mappings that are used as |
| 37 | + alternative source in a post processing step. |
| 38 | + default: 'nucl_(gb|wgs|est|gss).accession2taxid' |
| 39 | + |
| 40 | + -silent|-verbose information level during build: |
| 41 | + silent => none / verbose => most detailed |
| 42 | + default: neither => only errors/important info |
| 43 | + |
| 44 | + |
| 45 | +SKETCHING (SUBSAMPLING) |
| 46 | + |
| 47 | + -kmerlen <k> number of nucleotides/characters in a k-mer |
| 48 | + default: 16 |
| 49 | + |
| 50 | + -sketchlen <s> number of features (k-mer hashes) per sampling window |
| 51 | + default: 16 |
| 52 | + |
| 53 | + -winlen <w> number of letters in each sampling window |
| 54 | + default: 127 |
| 55 | + |
| 56 | + -winstride <l> distance between window starting positions |
| 57 | + default: 112 (w-k+1) |
| 58 | + |
| 59 | + |
| 60 | +ADVANCED OPTIONS |
| 61 | + |
| 62 | + -reset-taxa Attempts to re-rank all sequences after the main build |
| 63 | + phase using '.accession2taxid' files. This will reset the |
| 64 | + taxon id of a reference sequence even if a taxon id could |
| 65 | + be obtained from other sources during the build phase. |
| 66 | + default: off |
| 67 | + |
| 68 | + -max-locations-per-feature <#> |
| 69 | + maximum number of reference sequence locations to be |
| 70 | + stored per feature; |
| 71 | + If the value is too high it will significantly impact |
| 72 | + querying speed. Note that an upper hard limit is always |
| 73 | + imposed by the data type used for the hash table bucket |
| 74 | + size (set with compilation macro |
| 75 | + '-DMC_LOCATION_LIST_SIZE_TYPE'). |
| 76 | + default: 254 |
| 77 | + |
| 78 | + -remove-overpopulated-features |
| 79 | + Removes all features that have reached the maximum allowed |
| 80 | + amount of locations per feature. This can improve querying |
| 81 | + speed and can be used to remove non-discriminative |
| 82 | + features. |
| 83 | + default: off |
| 84 | + |
| 85 | + -remove-ambig-features <rank> |
| 86 | + Removes all features that have more distinct reference |
| 87 | + sequence on the given taxonomic rank than set by |
| 88 | + '-max-ambig-per-feature'. This can decrease the database |
| 89 | + size significantly at the expense of sensitivity. Note |
| 90 | + that the lower the given taxonomic rank is, the more |
| 91 | + pronounced the effect will be. |
| 92 | + Valid values: sequence, form, variety, subspecies, |
| 93 | + species, subgenus, genus, subtribe, tribe, subfamily, |
| 94 | + family, suborder, order, subclass, class, subphylum, |
| 95 | + phylum, subkingdom, kingdom, domain |
| 96 | + default: off |
| 97 | + |
| 98 | + -max-ambig-per-feature <#> |
| 99 | + Maximum number of allowed different reference sequence |
| 100 | + taxa per feature if option '-remove-ambig-features' is |
| 101 | + used. |
| 102 | + |
| 103 | + -max-load-fac <factor> |
| 104 | + maximum hash table load factor; |
| 105 | + This can be used to trade off larger memory consumption |
| 106 | + for speed and vice versa. A lower load factor will improve |
| 107 | + speed, a larger one will improve memory efficiency. |
| 108 | + default: 0.800000 |
| 109 | + |
| 110 | + -parts <#> Splits the database into multiple parts. Each part |
| 111 | + contains a separate hash table. |
| 112 | + default: 1 |
| 113 | + |
| 114 | + -save-db <database filename> |
| 115 | + Save database to disk after querying. |
| 116 | + |
| 117 | + |
| 118 | +QUERY PARAMETERS |
| 119 | + |
| 120 | + <sequence file/directory>... |
| 121 | + FASTA or FASTQ files containing genomic sequences (short |
| 122 | + reads, long reads, contigs, complete genomes, ...) that |
| 123 | + shall be classified. |
| 124 | + * If directory names are given, they will be searched for |
| 125 | + sequence files (at most 10 levels deep). |
| 126 | + * If no input filenames or directories are given, |
| 127 | + MetaCache will run in interactive query mode. This can be |
| 128 | + used to load the database into memory only once and then |
| 129 | + query it multiple times with different query options. |
| 130 | + |
| 131 | + |
| 132 | +MAPPING RESULTS OUTPUT |
| 133 | + |
| 134 | + -out <file> Redirect output to file <file>. |
| 135 | + If not specified, output will be written to stdout. If |
| 136 | + more than one input file was given all output will be |
| 137 | + concatenated into one file. |
| 138 | + |
| 139 | + |
| 140 | + -split-out <file> Generate output and statistics for each input file |
| 141 | + separately. For each input file <in> an output file with |
| 142 | + name <file>_<in> will be written. |
| 143 | + |
| 144 | + |
| 145 | +PAIRED-END READ HANDLING |
| 146 | + |
| 147 | + -pairfiles Interleave paired-end reads from two consecutive files, so |
| 148 | + that the nth read from file m and the nth read from file |
| 149 | + m+1 will be treated as a pair. If more than two files are |
| 150 | + provided, their names will be sorted before processing. |
| 151 | + Thus, the order defined by the filenames determines the |
| 152 | + pairing not the order in which they were given in the |
| 153 | + command line. |
| 154 | + |
| 155 | + |
| 156 | + -pairseq Two consecutive sequences (1+2, 3+4, ...) from each file |
| 157 | + will be treated as paired-end reads. |
| 158 | + |
| 159 | + |
| 160 | + -insertsize <#> Maximum insert size to consider. |
| 161 | + default: sum of lengths of the individual reads |
| 162 | + |
| 163 | + |
| 164 | +CLASSIFICATION |
| 165 | + |
| 166 | + -lowest <rank> Do not classify on ranks below <rank> |
| 167 | + (Valid values: sequence, form, variety, subspecies, |
| 168 | + species, subgenus, genus, subtribe, tribe, subfamily, |
| 169 | + family, suborder, order, subclass, class, subphylum, |
| 170 | + phylum, subkingdom, kingdom, domain) |
| 171 | + default: sequence |
| 172 | + |
| 173 | + -highest <rank> Do not classify on ranks above <rank> |
| 174 | + (Valid values: sequence, form, variety, subspecies, |
| 175 | + species, subgenus, genus, subtribe, tribe, subfamily, |
| 176 | + family, suborder, order, subclass, class, subphylum, |
| 177 | + phylum, subkingdom, kingdom, domain) |
| 178 | + default: domain |
| 179 | + |
| 180 | + -hitmin <t> Sets classification threshhold to <t>. |
| 181 | + A read will not be classified if less than t features from |
| 182 | + the database match. Higher values will increase precision |
| 183 | + at the expense of sensitivity. |
| 184 | + default: 0 |
| 185 | + |
| 186 | + -hitdiff <t> Sets classification threshhold to <t>. |
| 187 | + A read will not be classified if less than t features from |
| 188 | + the database match. Higher values will increase precision |
| 189 | + at the expense of sensitivity. |
| 190 | + default: 0 |
| 191 | + |
| 192 | + -maxcand <#> maximum number of reference taxon candidates to consider |
| 193 | + for each query; |
| 194 | + A large value can significantly decrease the querying |
| 195 | + speed!. |
| 196 | + default: 2 |
| 197 | + |
| 198 | + -cov-percentile <p> |
| 199 | + Remove the p-th percentile of hit reference sequences with |
| 200 | + the lowest coverage. Classification is done using only the |
| 201 | + remaining reference sequences. This can help to reduce |
| 202 | + false positives, especially whenyour input data has a high |
| 203 | + sequencing coverage. |
| 204 | + This feature decreases the querying speed! |
| 205 | + default: off |
| 206 | + |
| 207 | + |
| 208 | +GENERAL OUTPUT FORMATTING |
| 209 | + |
| 210 | + -no-summary Dont't show result summary & mapping statistics at the end |
| 211 | + of the mapping output |
| 212 | + default: off |
| 213 | + |
| 214 | + -no-query-params Don't show query settings at the beginning of the mapping |
| 215 | + output |
| 216 | + default: off |
| 217 | + |
| 218 | + -no-err Suppress all error messages. |
| 219 | + default: off |
| 220 | + |
| 221 | + |
| 222 | +CLASSIFICATION RESULT FORMATTING |
| 223 | + |
| 224 | + -no-map Don't report classification for each individual query |
| 225 | + sequence; show summaries only (useful for quick tests). |
| 226 | + default: off |
| 227 | + |
| 228 | + -mapped-only Don't list unclassified reads/read pairs. |
| 229 | + default: off |
| 230 | + |
| 231 | + -taxids Print taxon ids in addition to taxon names. |
| 232 | + default: off |
| 233 | + |
| 234 | + -taxids-only Print taxon ids instead of taxon names. |
| 235 | + default: off |
| 236 | + |
| 237 | + -omit-ranks Do not print taxon rank names. |
| 238 | + default: off |
| 239 | + |
| 240 | + -separate-cols Prints *all* mapping information (rank, taxon name, taxon |
| 241 | + ids) in separate columns (see option '-separator'). |
| 242 | + default: off |
| 243 | + |
| 244 | + -separator <text> Sets string that separates output columns. |
| 245 | + default: '\t|\t' |
| 246 | + |
| 247 | + -comment <text> Sets string that precedes comment (non-mapping) lines. |
| 248 | + default: '# ' |
| 249 | + |
| 250 | + -queryids Show a unique id for each query. |
| 251 | + Note that in paired-end mode a query is a pair of two read |
| 252 | + sequences. This option will always be activated if option |
| 253 | + '-hits-per-ref' is given. |
| 254 | + default: off |
| 255 | + |
| 256 | + -lineage Report complete lineage for per-read classification |
| 257 | + starting with the lowest rank found/allowed and ending |
| 258 | + with the highest rank allowed. See also options '-lowest' |
| 259 | + and '-highest'. |
| 260 | + default: off |
| 261 | + |
| 262 | + |
| 263 | +ANALYSIS: ABUNDANCES |
| 264 | + |
| 265 | + -abundances <file> |
| 266 | + Show absolute and relative abundance of each taxon. |
| 267 | + If a valid filename is given, the list will be written to |
| 268 | + this file. |
| 269 | + default: off |
| 270 | + |
| 271 | + -abundance-per <rank> |
| 272 | + Show absolute and relative abundances for each taxon on |
| 273 | + one specific rank. |
| 274 | + Classifications on higher ranks will be estimated by |
| 275 | + distributing them down according to the relative |
| 276 | + abundances of classifications on or below the given rank. |
| 277 | + (Valid values: sequence, form, variety, subspecies, |
| 278 | + species, subgenus, genus, subtribe, tribe, subfamily, |
| 279 | + family, suborder, order, subclass, class, subphylum, |
| 280 | + phylum, subkingdom, kingdom, domain) |
| 281 | + If '-abundances <file>' was given, this list will be |
| 282 | + printed to the same file. |
| 283 | + default: off |
| 284 | + |
| 285 | + |
| 286 | +ANALYSIS: RAW DATABASE HITS |
| 287 | + |
| 288 | + -tophits For each query, print top feature hits in database. |
| 289 | + default: off |
| 290 | + |
| 291 | + -allhits For each query, print all feature hits in database. |
| 292 | + default: off |
| 293 | + |
| 294 | + -locations Show locations in candidate reference sequences. |
| 295 | + Activates option '-tophits'. |
| 296 | + default: off |
| 297 | + |
| 298 | + -hits-per-ref <file> |
| 299 | + Shows a list of all hits for each reference sequence. |
| 300 | + If this condensed list is all you need, you should |
| 301 | + deactive the per-read mapping output with '-no-map'. |
| 302 | + If a valid filename is given after '-hits-per-ref', the |
| 303 | + list will be written to a separate file. |
| 304 | + Option '-queryids' will be activated and the lowest |
| 305 | + classification rank will be set to 'sequence'. |
| 306 | + default: off |
| 307 | + |
| 308 | + |
| 309 | +ANALYSIS: ALIGNMENTS |
| 310 | + |
| 311 | + -align Show semi-global alignment to best candidate reference |
| 312 | + sequence. |
| 313 | + Original files of reference sequences must be available. |
| 314 | + This feature decreases the querying speed! |
| 315 | + default: off |
| 316 | + |
| 317 | + |
| 318 | +ADVANCED: GROUND TRUTH BASED EVALUATION |
| 319 | + |
| 320 | + -ground-truth Report correct query taxa if known. |
| 321 | + Queries need to have either a 'taxid|<number>' entry in |
| 322 | + their header or a sequence id that is also present in the |
| 323 | + database. |
| 324 | + This feature decreases the querying speed! |
| 325 | + default: off |
| 326 | + |
| 327 | + -precision Report precision & sensitivity by comparing query taxa |
| 328 | + (ground truth) and mapped taxa. |
| 329 | + Queries need to have either a 'taxid|<number>' entry in |
| 330 | + their header or a sequence id that is also found in the |
| 331 | + database. |
| 332 | + This feature decreases the querying speed! |
| 333 | + default: off |
| 334 | + |
| 335 | + -taxon-coverage Report true/false positives and true/false negatives.This |
| 336 | + option turns on '-precision', so ground truth data needs |
| 337 | + to be available. |
| 338 | + This feature decreases the querying speed! |
| 339 | + default: off |
| 340 | + |
| 341 | + |
| 342 | +ADVANCED: PERFORMANCE TUNING / TESTING |
| 343 | + |
| 344 | + -threads <#> Sets the maximum number of parallel threads to use.default |
| 345 | + (on this machine): 88 |
| 346 | + |
| 347 | + |
| 348 | + -batch-size <#> Process <#> many queries (reads or read pairs) per thread |
| 349 | + at once. |
| 350 | + default (on this machine): 4096 |
| 351 | + |
| 352 | + -query-limit <#> Classify at max. <#> queries (reads or read pairs) per |
| 353 | + input file. |
| 354 | + default: 9223372036854775807 |
| 355 | + |
| 356 | +EXAMPLES |
| 357 | + |
| 358 | + Build database from sequence file 'genomes.fna' and query all sequences in 'myreads.fna': |
| 359 | + metacache build+query -targets genomes.fna -query myreads.fna |
| 360 | + |
| 361 | + Build database with latest complete genomes from the NCBI RefSeq and query interactively |
| 362 | + download-ncbi-genomes refseq/bacteria myfolder |
| 363 | + download-ncbi-genomes refseq/viruses myfolder |
| 364 | + download-ncbi-taxonomy myfolder |
| 365 | + metacache build+query -targets myfolder -taxonomy myfolder |
| 366 | + |
| 367 | + |
0 commit comments