You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: thesis/00-intro.Rmd
+13-11Lines changed: 13 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ the requirements for having indexes with sizes of the same magnitude of the orig
12
12
For example,
13
13
NCBI provides BLAST search as a service on their website,
14
14
but it uses specially prepared databases with a subset of the data stored in GenBank or similar databases.
15
-
While NCBI does offer a similar service for each dataset in the SRA (sequence read archive),
15
+
While NCBI does offer a similar service for each dataset in the SRA (Sequence Read Archive),
16
16
there is no service to search across every dataset at once because of its size,
17
17
which is on the order of petabytes of data and growing exponentially.
18
18
@@ -28,7 +28,7 @@ k-mers can be hashed and stored in integer datatypes,
28
28
allowing for fast comparison and many opportunities for compression.
29
29
Solomon and Kingsford's solution for the problem,
30
30
the Sequence Bloom Tree,
31
-
use these properties to define and store the k-mer composition of a dataset in a Bloom Filter [@bloom_spacetime_1970],
31
+
uses these properties to define and store the k-mer composition of a dataset in a Bloom Filter [@bloom_spacetime_1970],
32
32
a probabilistic data structure that allows insertion and checking if a value might be present.
33
33
Bloom Filters can be tuned to reach a predefined false positive bound,
34
34
trading off memory for accuracy.
@@ -50,17 +50,18 @@ The downside is the false positive increase,
50
50
especially if both original filters are already reaching saturation.
51
51
To account for that,
52
52
Bloom Filters in a SBT need to be initialized with a size proportional to the cardinality of the combined datasets,
53
-
which can be quite large for big collections.
53
+
which can be quite large for large collections.
54
54
Since Bloom Filters only generate false positives,
55
55
and not false negatives,
56
-
in the worst case there is degradation of the computational performance,
57
-
(because more internal nodes need to be checked),
56
+
in the worst case there is degradation of the computational performance because more internal nodes need to be checked,
58
57
but the final results are unchanged.
59
58
60
59
While Bloom Filters can be used to calculate similarity of dataset,
61
-
there are data structures more efficient probabilistic data structures for this use case.
62
-
A MinHash sketch [@broder_resemblance_1997] is a representation of a dataset that allows estimating the Jaccard similarity of the original dataset without requiring the original data to be available.
63
-
The Jaccard similarity of two datasets is the size of the intersection of elements in both datasets divided by the size of the union of elements in both datasets.
60
+
there are more efficient probabilistic data structures for this use case.
61
+
A MinHash sketch [@broder_resemblance_1997] is a representation of a dataset allowing
62
+
estimation of the Jaccard similarity between dataset without requiring the original data to be available.
63
+
The Jaccard similarity of two datasets is the size of the intersection of elements in both datasets divided by the size of the union of elements in both datasets:
64
+
$J(A, B)=\frac{\vert A \cup B \vert}{\vert A \cap B \vert}$.
64
65
The MinHash sketch uses a subset of the original data as a proxy for the data -- in this case,
65
66
hashing each element and taking the smallest values for each dataset.
66
67
Broder defines two approaches for taking the smallest values:
@@ -74,7 +75,8 @@ The ModHash approach also allows calculating the containment of two datasets,
74
75
how much of a dataset is present in another.
75
76
It is defined as the size of the intersection divided by the size of the dataset,
76
77
and so is asymmetrical
77
-
(unlike the Jaccard similarity).
78
+
(unlike the Jaccard similarity):
79
+
$C(A, B)=\frac{\vert A \cup B \vert}{\vert A \vert}$.
78
80
While the MinHash can also calculate containment,
79
81
if the datasets are of distinct cardinalities the errors accumulate quickly.
80
82
This is relevant for biological use cases,
@@ -117,7 +119,7 @@ and a new approach for containment estimation using Scaled MinHash sketches.
117
119
118
120
**Chapter 2** describes indexing methods for sketches,
119
121
focusing on a hierarchical approach optimized for storage access and low memory consumption (`MHBT`)
120
-
and a fast inverted index optimized for fast retrieval but with larger memory consumption (`Revindex`).
122
+
and a fast inverted index optimized for fast retrieval but with larger memory consumption (`LCA index`).
121
123
It also introduces `sourmash`,
122
124
a software implementing these indices and optimized Scaled MinHash sketches,
123
125
as well as extended functionality for iterative and exploratory biological data analysis.
@@ -128,7 +130,7 @@ Comparisons with current taxonomic profiling methods using community-developed b
128
130
assessments show that `gather` paired with taxonomic information outperforms other approaches,
129
131
using a fraction of the computational resources and allowing analysis in platforms accessible to regular users (like laptops).
130
132
131
-
**Chapter 4** describes wort,
133
+
**Chapter 4** describes `wort`,
132
134
a framework for distributed signature calculation,
133
135
including discussions about performance and cost trade-offs for sketching public genomic databases,
134
136
as well as distributed systems architectures allowing large scale processing of petabytes of data.
0 commit comments