You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+56-29Lines changed: 56 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,22 +6,29 @@ MetaCache is a classification system for mapping genomic sequences (short reads,
6
6
7
7
For an independend comparison to other tools in terms of classification accuracy see the [LEMMI](https://lemmi.ezlab.org) benchmarking site.
8
8
9
-
MetaCache's CPU version classifies around 60 Million reads (of length 100) per minute against all complete bacterial, viral and archaea genomes from NCBI RefSeq Release 97 running with 88 threads on a workstation with 2 Intel(R) Xeon(R) Gold 6238 CPUs.
9
+
**MetaCache's CPU** version classifies around 60 Million reads (of length 100) per minute against all complete bacterial, viral and archaea genomes from NCBI RefSeq Release 97 running with 88 threads on a workstation with 2 Intel(R) Xeon(R) Gold 6238 CPUs.
10
10
11
-
MetaCache's [GPU version](docs/gpu_version.md) classifies around 300 Million reads (of length 100) per minute against all complete bacterial, viral, fungal and archaea genomes from NCBI RefSeq Release 202 running on a workstation with 4 NVIDIA(R) Tesla(R) V100 GPUs (32 GB model).
11
+
**MetaCache's [GPU version](docs/gpu_version.md)** classifies around 300 Million reads (of length 100) per minute against all complete bacterial, viral, fungal and archaea genomes from NCBI RefSeq Release 202 running on a workstation with 4 NVIDIA(R) Tesla(R) V100 GPUs (32 GB model).
12
12
13
13
14
14
15
15
16
16
## Quick Start with NCBI RefSeq
17
-
This will download MetaCache, compile it, download the complete bacterial, viral and archaea genomes from the latest NCBI RefSeq release (this can take some time) and build a classification database from them:
* download the complete bacterial, viral and archaea genomes from the latest NCBI RefSeq release (this can take some time)
31
+
* build a classification database
25
32
26
33
Once the default database is built you can classify reads:
27
34
```
@@ -36,56 +43,69 @@ Once the default database is built you can classify reads:
36
43
37
44
## Detailed Installation Instructions
38
45
39
-
#### Requirements
40
-
MetaCache itself should compile on any platform for which a C++14 conforming compiler is available. The Makefile is written with g++ or clang++ in mind, but could probably be adapted to MSVC or other compilers.
46
+
Visit MetaCache's github [repository] to get the latest resources.
47
+
48
+
* To compile the CPU version: run `make` in the directory containing the Makefile
49
+
* To compile the GPU version, follow the instructions provided [here](docs/gpu_version.md).
50
+
51
+
52
+
### CPU Version Requirements
53
+
54
+
MetaCache itself should compile on any platform for which a C++14 conforming compiler is available. The Makefile is written with g++ or clang++ in mind, but could probably be adapted to (a very recent version of) MSVC or other compilers.
41
55
42
56
The helper scripts (for downloading genomes, taxonomy etc.) require the Bash shell to run. That means you need a working bash executable as well as some common GNU utilities like "awk" and "wget". On Windows you should use the 'Windows Subsystem for Linux' (which gives you an Ubuntu user mode talking to the Windows Kernel).
43
57
44
-
There are no dependencies on third party libraries.
45
-
MetaCache was successfully tested on the following platforms (all 64 bit + 64 bit compilers):
46
-
- Ubuntu 14.04 with g++ 5.4
47
-
- Ubuntu 16.04 with g++ 5.3, g++ 7.2
48
-
- Ubuntu 18.04 with g++ 5.4, g++ 7.4
49
-
- Windows 10 Build 1709 64bit with MinGW-w64 g++ 7.2
50
-
- Windows 10 Build 1909 64bit running Ubuntu 16.04 inside WSL and g++ 7.2
58
+
MetaCache 2.0.0 was successfully tested on the following platforms (all 64 bit + 64 bit compilers):
59
+
- Ubuntu 20.04 with g++ 5.4, g++ 7.4
60
+
- Windows 10 20H2 running Ubuntu 20.04 inside WSL2 and g++ 10.3
51
61
52
62
In order to be able to build the default database (based on NCBI RefSeq Release 97) with default settings your system should have around 64GB of RAM (note that the NCBI RefSeq will still be growing in the near future).
53
63
If you don't have enough RAM, you can use [database partitioning](docs/partitioning.md).
54
64
55
-
#### Get The Latest Sources
56
-
Visit MetaCache's github [repository].
57
65
66
+
### GPU Version Requirements
67
+
The GPU version requires a CUDA-capable device of the Pascal generation or newer and either CUDA >= 11 or CUDA 10.2 and a self-provided version of [CUB](https://github.com/NVlabs/cub).
58
68
59
-
#### Compile
60
-
Run 'make' in the directory containing the Makefile.
61
-
This will compile MetaCache with the default data type settings which support databases with up to 65,535 reference sequences (targets) and k-mer sizes up to 16. This offers a good database space efficiency and is currently sufficient for the complete bacterial, viral and archaea genomes from the NCBI RefSeq.
69
+
See [here](docs/gpu_version.md) for more.
62
70
63
-
If you want MetaCache to be able to process gzipped files make sure you have the zlib library installed on your system and compile with:
64
71
72
+
### Library Requirements (CPU & GPU versions)
73
+
MetaCache requires the zlib compression library to be installed on your system in order to be able to process gzipped FASTA/FASTQ files.
74
+
On Debian/Ubuntu zlib can be installed with
65
75
```
66
-
make MACROS="-DMC_ZLIB"
76
+
sudo apt install -y zlib1g zlib1g-dev
67
77
```
78
+
If you *don't* have zlib installed or cannot do so you can compile with:
79
+
```
80
+
make MC_ZLIB=NO
81
+
```
82
+
which will remove the zlib dependency and disables support for gzipped input files.
83
+
68
84
69
-
Using the following compilation options you can compile MetaCache with support for more reference sequences and greater k-mer lengths.
85
+
### Custom Configurations
70
86
71
-
##### number of referece sequences (targets)
87
+
If you run 'make' without additional parameters MetaCache will be compiled with the default data type settings which support databases with up to 65,535 reference sequences (targets) and k-mer sizes up to 16. This offers a good database space efficiency and is currently sufficient for the complete bacterial, viral and archaea genomes from the NCBI RefSeq.
72
88
73
-
* support for up to 65,535 reference sequences (default):
89
+
Using the following compilation options you can compile MetaCache with support for more targets and greater k-mer lengths.
90
+
91
+
#### number of referece sequences (targets)
92
+
93
+
* support for up to 65,535 targets (default):
74
94
```
75
95
make MACROS="-DMC_TARGET_ID_TYPE=uint16_t"
76
96
```
77
97
78
-
* support for up to 4,294,967,295 reference sequences (needs more memory):
98
+
* support for up to 4,294,967,295 targets (needs more memory):
79
99
```
80
100
make MACROS="-DMC_TARGET_ID_TYPE=uint32_t"
81
101
```
82
102
83
-
* support for more than 4,294,967,295 reference sequences (needs even more memory)
103
+
* support for more than 4,294,967,295 targets (needs even more memory)
84
104
```
85
105
make MACROS="-DMC_TARGET_ID_TYPE=uint64_t"
86
106
```
87
107
88
-
#####reference sequence lenghts
108
+
#### reference sequence lenghts
89
109
* support for targets up to a length of 4,294,967,295 windows (default)
90
110
with default settings (window length, k-mer size) no sequence length must exceed 485.3 billion nucleotides
91
111
```
@@ -98,8 +118,7 @@ Using the following compilation options you can compile MetaCache with support f
98
118
make MACROS="-DMC_WINDOW_ID_TYPE=uint16_t"
99
119
```
100
120
101
-
102
-
##### kmer lengths
121
+
#### kmer lengths
103
122
* support for kmer lengths up to 16 (default):
104
123
```
105
124
make MACROS="-DMC_KMER_TYPE=uint32_t"
@@ -112,14 +131,21 @@ Using the following compilation options you can compile MetaCache with support f
112
131
113
132
You can of course combine these options (don't forget the surrounding quotes):
114
133
```
115
-
make MACROS="-DMC_ZLIB -DMC_TARGET_ID_TYPE=uint32_t -DMC_WINDOW_ID_TYPE=uint32_t"
134
+
make MACROS="-DMC_TARGET_ID_TYPE=uint32_t -DMC_WINDOW_ID_TYPE=uint32_t"
116
135
```
117
136
118
137
**Note that a database can only be queried with the same variant of MetaCache (regarding data type sizes) that it was built with.**
119
138
120
139
In rare cases databases built on one platform might not work with MetaCache on other platforms due to bit-endianness and data type width differences. Especially mixing MetaCache executables compiled with 32-bit and 64-bit compilers might be probelematic.
121
140
122
141
142
+
#### disabling zlib support
143
+
144
+
If you *don't* have the zlib compression library installed and/or want *don't* want gzipped input file support you can compile with:
145
+
```
146
+
make MC_ZLIB=NO
147
+
```
148
+
123
149
124
150
125
151
## Building Databases
@@ -160,8 +186,9 @@ Once a database (e.g. the standard 'refseq'), is built you can classify reads.
160
186
161
187
## Documentation of Command Line Parameters
162
188
163
-
*[for mode `build`](docs/mode_build.txt): build database from reference genomes
189
+
*[for mode `build`](docs/mode_build.txt): build database from reference genomes (and write it to disk)
164
190
*[for mode `query`](docs/mode_query.txt): query reads against database
191
+
*[for mode `build+query`](docs/mode_build_query.txt): build reference database and immediately query reads (mainly recommended for GPU version)
165
192
*[for mode `merge`](docs/mode_merge.txt): merge results of independent queries
166
193
*[for mode `modify`](docs/mode_modify.txt): add reference genomes to database or update taxonomy
167
194
*[for mode `info`](docs/mode_info.txt): obtain information about a database
The GPU version of MetaCache requires a CUDA-capable device of the Pascal generation or newer and either:
8
16
9
-
* CUDA >= 11
10
-
* CUDA 10.2 and a self-provided version of [CUB](https://github.com/NVlabs/cub)
11
17
12
-
Make sure to adjust the Makefile to the GPU generation you want to use by setting the `-arch` flag (e.g. `-arch=sm_70` for Quadro GV100). You also have to set the include path for CUB if your CUDA version is below CUDA 11.
18
+
## Requirements
13
19
14
-
MetaCache-GPU depends on the hashtable implementation of [warpcore](https://github.com/sleeepyjack/warpcore) and the sorting algorithm [bb_segsort](https://github.com/Funatiq/bb_segsort). Both repositories are included as submodules and need to be checked out in addition to MetaCache itself. You can do so be calling
20
+
### Hardware Requirements
15
21
16
-
```git submodule update --init --recursive```
22
+
The GPU version of MetaCache requires a CUDA-capable device of the Pascal generation or newer.
17
23
18
-
In order to be able to build the default database (based on NCBI RefSeq Release 97) with default settings your system will need a total of 120 GB of GPU memory (e.g. 4x GPUs with 32 GB each).
24
+
In order to be able to build the default database (based on NCBI RefSeq Release 97) with default settings your system will need a total of 120 GB of GPU memory (e.g. 4x GPUs with 32 GB each).
19
25
If you don't have enough GPU memory, you can use [database partitioning](docs/partitioning.md).
20
26
21
-
#### Compile
22
-
Run '`make gpu_release`' in the directory containing the Makefile.
23
-
This will compile MetaCache-GPU with support for:
24
27
25
-
* up to 4,294,967,295 reference sequences
26
-
* targets up to a length of 4,294,967,295 windows
27
-
* kmer lengths up to 16
28
+
### Software Dependencies
29
+
30
+
* CUDA SDK
31
+
* CUDA >= 11
32
+
* CUDA 10.2 and a self-provided version of [CUB](https://github.com/NVlabs/cub) (you also need to set the include path for CUB by supplying `INCLUDE=*your_cub_path*` when calling make)
33
+
34
+
* Hashtable library [warpcore](https://github.com/sleeepyjack/warpcore) and sorting library [bb_segsort](https://github.com/Funatiq/bb_segsort). Both repositories are included as submodules and need to be checked out in addition to MetaCache itself. You can do so by calling
35
+
```git submodule update --init --recursive```
36
+
37
+
* Support for gzipped FASTA/FASTQ files requires the zlib compression library to be installed on your system.
38
+
On Debian/Ubuntu zlib can be installed with
39
+
`sudo apt install -y zlib1g zlib1g-dev`. If you *don't* have zlib installed or cannot do so you can compile with `make MC_ZLIB=NO`
40
+
which will remove the zlib dependency and disables support for gzipped input files.
41
+
42
+
43
+
## Installation / Compiling
44
+
45
+
Run `make` in the directory containing the Makefile and set the GPU generation with the `CUDA_ARCH` flag (e.g. `CUDA_ARCH=sm_70` for Quadro GV100):
46
+
```
47
+
make gpu CUDA_ARCH=sm_70
48
+
```
49
+
50
+
If you don't supply additional parameters MetaCache will be compiled with the default data type settings which support databases with
51
+
52
+
* up to 4,294,967,295 targets (= reference sequences)
53
+
* targets with a length of up to 4,294,967,295 windows (which corresponds to approximately 485.3 billion nucleotides with the default window size of 112)
54
+
* kmers with a lengths of up to 16
28
55
29
56
This corresponds to the CPU version compiled with `make MACROS="-DMC_TARGET_ID_TYPE=uint32_t"`
30
57
31
-
**Note that a database build by the GPU version can be queried by the corresponding CPU version and vice versa. The only restriction is the available (GPU) memory.**
58
+
**A database built by the GPU version can be queried by the corresponding CPU version and vice versa. The only restriction is the available (GPU) memory.**
59
+
32
60
33
61
34
62
## Differences to CPU version
35
63
36
64
MetaCache-GPU allows to **build** distributed databases across multiple GPUs.
37
-
In difference to the [database partitioning](docs/partitioning.md) approach, the program distributes the reference genomes automatically across the GPUs in a single run. Due to the dynamic distribution scheme and the concurrent execution on the GPUs, two database builds for the same input files will most likely differ. However, this should have only a small impact on classification performance.
65
+
In difference to the [database partitioning](docs/partitioning.md) approach, the reference genomes are automatically distributed across multiple GPUs in a single run. Due to the dynamic distribution scheme and the concurrent execution on the GPUs, two database builds for the same input files will most likely differ. However, this should only have a negligible impact on classification performance.
66
+
67
+
In order to **query** a multi-GPU database make sure to set the same number of GPUs when using the query mode.
68
+
69
+
### Build+Query Immediate Mode
70
+
Since building databases is significantly faster on the GPU than on the CPU and will often take less than a minute, the [build+query mode](docs/mode_build_query.txt) can be used to build and directly query a database without writing the database to disk.
38
71
39
-
In order to **query** a multi-GPU database make sure to set the same number of GPUs when using the query mode. Note, that only a small number of threads is needed to saturate the GPU query pipeline.
40
72
41
-
####Command Line Options
73
+
### Command Line Options
42
74
43
75
The command line options of the GPU version are similar to the CPU version with a few notable exceptions:
44
76
45
-
#####mode build
77
+
#### mode build
46
78
47
79
*`-parts <#>` sets the number of GPUs to use (default: all available GPUs).
48
80
49
-
#####mode query
81
+
#### mode query
50
82
51
83
*`-replicate <#>` enables multiple GPU pipelines (default: 1). Each pipeline occupies one GPU per database part.
52
84
53
-
#####mode build & mode query
85
+
#### mode build & mode query
54
86
55
87
*`-kmerlen` kmer length is limited to 16 (default: 16).
56
88
*`-sketchlen` sketch length is limited to 16 (default: 16).
57
89
*`-winlen` window length is limited to 127 (default: 127).
58
-
*`-winstride` window stride has to be multiple of 4 (default: 112).
59
-
*`-remove-overpopulated-features` is not supported.
60
-
*`-remove-ambig-features` is not supported.
90
+
*`-winstride` window stride has to be a multiple of 4 (default: 112).
91
+
*`-remove-overpopulated-features` is *not* supported.
92
+
*`-remove-ambig-features` is *not* supported.
61
93
62
-
#####mode info
94
+
#### mode info
63
95
64
-
*feature map is not available.
65
-
*feature counts are not available.
96
+
*submode `locations`is *not* available.
97
+
*submode `featurecounts` is *not* available.
66
98
67
-
#####mode merge
99
+
#### mode merge
68
100
69
-
* merging on GPU is not available and will fall back to CPU version.
101
+
Merging multiple result files will *not* be performed on the GPU and will fall back to the CPU.
0 commit comments