Skip to content

Commit 47190f7

Browse files
committed
upd
1 parent 094692b commit 47190f7

File tree

4 files changed

+100
-59
lines changed

4 files changed

+100
-59
lines changed

thesis/00-intro.Rmd

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ the requirements for having indexes with sizes of the same magnitude of the orig
1212
For example,
1313
NCBI provides BLAST search as a service on their website,
1414
but it uses specially prepared databases with a subset of the data stored in GenBank or similar databases.
15-
While NCBI does offer a similar service for each dataset in the SRA (sequence read archive),
15+
While NCBI does offer a similar service for each dataset in the SRA (Sequence Read Archive),
1616
there is no service to search across every dataset at once because of its size,
1717
which is on the order of petabytes of data and growing exponentially.
1818

@@ -28,7 +28,7 @@ k-mers can be hashed and stored in integer datatypes,
2828
allowing for fast comparison and many opportunities for compression.
2929
Solomon and Kingsford's solution for the problem,
3030
the Sequence Bloom Tree,
31-
use these properties to define and store the k-mer composition of a dataset in a Bloom Filter [@bloom_spacetime_1970],
31+
uses these properties to define and store the k-mer composition of a dataset in a Bloom Filter [@bloom_spacetime_1970],
3232
a probabilistic data structure that allows insertion and checking if a value might be present.
3333
Bloom Filters can be tuned to reach a predefined false positive bound,
3434
trading off memory for accuracy.
@@ -50,17 +50,18 @@ The downside is the false positive increase,
5050
especially if both original filters are already reaching saturation.
5151
To account for that,
5252
Bloom Filters in a SBT need to be initialized with a size proportional to the cardinality of the combined datasets,
53-
which can be quite large for big collections.
53+
which can be quite large for large collections.
5454
Since Bloom Filters only generate false positives,
5555
and not false negatives,
56-
in the worst case there is degradation of the computational performance,
57-
(because more internal nodes need to be checked),
56+
in the worst case there is degradation of the computational performance because more internal nodes need to be checked,
5857
but the final results are unchanged.
5958

6059
While Bloom Filters can be used to calculate similarity of dataset,
61-
there are data structures more efficient probabilistic data structures for this use case.
62-
A MinHash sketch [@broder_resemblance_1997] is a representation of a dataset that allows estimating the Jaccard similarity of the original dataset without requiring the original data to be available.
63-
The Jaccard similarity of two datasets is the size of the intersection of elements in both datasets divided by the size of the union of elements in both datasets.
60+
there are more efficient probabilistic data structures for this use case.
61+
A MinHash sketch [@broder_resemblance_1997] is a representation of a dataset allowing
62+
estimation of the Jaccard similarity between dataset without requiring the original data to be available.
63+
The Jaccard similarity of two datasets is the size of the intersection of elements in both datasets divided by the size of the union of elements in both datasets:
64+
$J(A, B)=\frac{\vert A \cup B \vert}{\vert A \cap B \vert}$.
6465
The MinHash sketch uses a subset of the original data as a proxy for the data -- in this case,
6566
hashing each element and taking the smallest values for each dataset.
6667
Broder defines two approaches for taking the smallest values:
@@ -74,7 +75,8 @@ The ModHash approach also allows calculating the containment of two datasets,
7475
how much of a dataset is present in another.
7576
It is defined as the size of the intersection divided by the size of the dataset,
7677
and so is asymmetrical
77-
(unlike the Jaccard similarity).
78+
(unlike the Jaccard similarity):
79+
$C(A, B)=\frac{\vert A \cup B \vert}{\vert A \vert}$.
7880
While the MinHash can also calculate containment,
7981
if the datasets are of distinct cardinalities the errors accumulate quickly.
8082
This is relevant for biological use cases,
@@ -117,7 +119,7 @@ and a new approach for containment estimation using Scaled MinHash sketches.
117119

118120
**Chapter 2** describes indexing methods for sketches,
119121
focusing on a hierarchical approach optimized for storage access and low memory consumption (`MHBT`)
120-
and a fast inverted index optimized for fast retrieval but with larger memory consumption (`Revindex`).
122+
and a fast inverted index optimized for fast retrieval but with larger memory consumption (`LCA index`).
121123
It also introduces `sourmash`,
122124
a software implementing these indices and optimized Scaled MinHash sketches,
123125
as well as extended functionality for iterative and exploratory biological data analysis.
@@ -128,7 +130,7 @@ Comparisons with current taxonomic profiling methods using community-developed b
128130
assessments show that `gather` paired with taxonomic information outperforms other approaches,
129131
using a fraction of the computational resources and allowing analysis in platforms accessible to regular users (like laptops).
130132

131-
**Chapter 4** describes wort,
133+
**Chapter 4** describes `wort`,
132134
a framework for distributed signature calculation,
133135
including discussions about performance and cost trade-offs for sketching public genomic databases,
134136
as well as distributed systems architectures allowing large scale processing of petabytes of data.

thesis/01-scaled.Rmd

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@ The {#rmd-basics} text after the chapter declaration will allow us to link throu
88

99
## Introduction
1010

11-
...
12-
1311
<!-- TODO
1412
- Note, can be narrow given the whole thesis introduction.
1513
- paragraph 1: what is the technical problem of interest? lightweight compositional queries? motivate briefly with some biology, maybe.

thesis/02-index.Rmd

Lines changed: 72 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,13 @@
1111
- Methods for indexing genomic datasets
1212
-->
1313

14-
Searching for matches in large collection of datasets is challenging when hundreds of thousands of
15-
them are available,
14+
Searching for matches in large collection of datasets is challenging when hundreds of thousands of them are available,
1615
especially if they are partitioned and the data is not all present at the same place,
1716
or too large to even be stored in a single system.
17+
18+
Efficient methods for sequencing datasets use exact $k$-mer matching instead of relying on sequence alignment,
19+
but sensitivity is reduced since they can't deal with sequencing errors and biological variation as alignment-based methods can.
20+
1821
<!-- cite some methods, including SBT and Mantis -->
1922

2023
<!--
@@ -42,7 +45,7 @@ but needs extra structures for storing what datasets each color represents.
4245
Both strategies allow the same class of queries,
4346
but with different trade-offs and optimizations:
4447
$k$-mer aggregative methods favor threshold queries
45-
("what datasets contain more 60% of the query $k$-mers?")
48+
("what datasets contain more than 60% of the query $k$-mers?")
4649
while color aggregative methods tend to be more efficient for specific $k$-mer
4750
queries ("what datasets contain this query $k$-mer?").
4851

@@ -66,7 +69,7 @@ with SBT.
6669

6770
<!-- 'k-mer aggregative methods in (marchet 2019)' -->
6871

69-
Bloofi [@crainiceanu_bloofi:_2015] is a hierarchical index structure that
72+
Bloofi [@crainiceanu_bloofi:_2015] is an example of an hierarchical index structure that
7073
extends the Bloom Filter basic query to collections of Bloom Filters.
7174
Instead of calculating the union of all Bloom Filters in the collection
7275
(which would allow answering if an element is present in any of them)
@@ -78,17 +81,18 @@ Bloofi can also be partitioned in a network,
7881
with network nodes containing a subtree of the original tree and only being
7982
accessed if the search requires it.
8083

81-
The Sequence Bloom Tree [@solomon_fast_2016] adapts Bloofi for genomic contexts,
82-
rephrasing the problem as experiment discovery:
84+
For genomic contexts,
85+
an hierarchical index is a $k$-mer aggregative method,
86+
with datasets represented by the $k$-mer composition of the dataset and stored in a data structure that allows querying for $k$-mer presence.
87+
The Sequence Bloom Tree [@solomon_fast_2016] adapts Bloofi for genomics and rephrasing the search problem as experiment discovery:
8388
given a query sequence $Q$ and a threshold $\theta$,
8489
which experiments contain at least $\theta$ of the original query $Q$?
8590
Experiments are encoded in Bloom Filters containing the $k$-mer composition of transcriptomes,
8691
and queries are transcripts.
8792

88-
Further developments focused on clustering similar datasets to prune search
93+
Further developments of the SBT approach focused on clustering similar datasets to prune search
8994
early [@sun_allsome_2017] and developing more efficient representations for the
90-
internal nodes [@solomon_improved_2017] [@harris_improved_2018] to use less
91-
storage space and memory.
95+
internal nodes [@solomon_improved_2017] [@harris_improved_2018] to use less storage space and memory.
9296

9397
<!--
9498
example figure for SBT:
@@ -106,35 +110,35 @@ Another example is the index in the back of a book,
106110
containing a list of topics and in which page they are present.
107111

108112
When indexing the $k$-mer decomposition of genomic datasets,
109-
the inverted index is a map of all hashes in the collection back to
113+
the inverted index is a color aggregative method,
114+
representable with a map of all hashed $k$-mers in the $k$-mer composition of the datasets in the collection back to
110115
the dataset from where they originated.
111116
Just as words can appear more than once in a text,
112-
hashes show up in more than one signature,
113-
so the inverted index maps a hash to a list of datasets.
114-
115-
kraken [@wood_kraken:_2014] has a similar index,
116-
but uses a taxonomic ID (taxon) for each dataset.
117-
Datasets can share the same ID,
118-
if they belong to the same taxon.
119-
Moreover,
120-
if a hash is present in more than one dataset
121-
kraken also reduces the list of taxons to the lowest common ancestor (LCA),
122-
which leads to reduced memory usage.
117+
hashes can show up in more than one dataset,
118+
and so the inverted index maps a hash to a list of datasets.
119+
120+
kraken [@wood_kraken:_2014] has a special case of this structure,
121+
using a taxonomic ID (taxon) for representing dataset identity.
122+
Datasets share the same ID if they belong to the same taxon,
123+
and if a hash is present in more than one dataset
124+
kraken reduces the list of taxons to the lowest common ancestor (LCA),
125+
which lowers memory requirements for storing the index.
123126
[@nasko_refseq_2018] explores how this LCA approach leads to decreased precision and sensitivity over time,
124-
since more datasets are added to reference databases and the chance of a k-mer being present
125-
in multiple datasets increases.
127+
since more datasets are frequently added to reference databases and the chance of a k-mer being present in multiple datasets increases.
126128

127129
Efficient storage of the list of signatures IDs can also be achieved via representation of the list as colors,
128-
where a color can represent one dataset or multiple datasets (if a hash is present in many of them).
129-
Mantis [@pandey_mantis:_2018] uses this hash to color mapping
130-
(and an auxiliary color table) to achieve reduced memory usage.
130+
where a color can represent one or more datasets (if a hash is present in many of them).
131+
Mantis [@pandey_mantis:_2018] uses this hash-to-color mapping
132+
(and an auxiliary color table) to achieve reduced memory usage,
133+
as well as Counting Quotient Filters [@pandey_general-purpose_2017] to store the data,
134+
an alternative to Bloom Filters that also support counting and resizing.
131135

132136
### Specialized indices for Scaled MinHash sketches
133137

134138
sourmash [@titus_brown_sourmash:_2016] is a software for large-scale sequence data comparisons based on MinHash sketches.
135139
Initially implementing operations for computing,
136140
comparing and plotting distance matrices for MinHash sketches,
137-
in version 2 [@pierce_large-scale_2019] it introduced Scaled MinHash sketches
141+
version 2 [@pierce_large-scale_2019] introduced Scaled MinHash sketches
138142
and indices for this new sketch format.
139143
Indices support a common set of operations
140144
(insertion, search and returning all signatures are the main ones),
@@ -183,13 +187,36 @@ but it simplifies implementation and provides better correctness guarantees.
183187

184188
#### LCA index
185189

186-
<!-- TODO
187-
- mash screen has a similar index, but it is constructed on-the-fly using the
188-
distinct hashes in a sketch collection as keys,
189-
and the values are mapped to a hash occurrence counter in a query metagenome.
190+
The LCA index in sourmash is an inverted index that stores a mapping from hashes
191+
in a collection of signatures to a list of IDs for signatures containing the hash.
192+
Despite the name,
193+
the list of signature IDs is not collapsed to the lowest common ancestor (as in kraken),
194+
with the LCA calculated as needed by downstream methods using the taxonomy information
195+
that is also stored separately in the LCA index.
196+
197+
The mapping from hashes to signature IDs in the LCA index is an implicit representation of the original signatures used to build the index,
198+
and so returning the signatures is implemented by rebuilding the original signatures on-the-fly.
199+
Search in an LCA index matches the $k$-mers in the query to the list of signatures IDs containing them,
200+
using a counter data structure to sort results by number of hashes per signature ID.
201+
The rebuilt signatures are then returned as matches based on the signature ID,
202+
with containment or similarity to the query calculated against the rebuild signatures.
203+
204+
mash screen [@ondov_mash_2019] has a similar index,
205+
but it is constructed on-the-fly using the distinct hashes in a sketch collection as keys,
206+
and values are counters initialized set to zero.
207+
As the query is processed,
208+
matching hashes have their counts incremented,
209+
and after all hashes in the query is processed then all the sketches in the collection are
210+
checked in other again to quantify the containment/similarity of each sketch in the query.
211+
The LCA index uses the opposite approach,
212+
opting to reconstruct the sketches on-the-fly.
213+
214+
## Results
190215

191-
- sourmash LCA index is the opposite: it stores the hashes, but allow
192-
reconstructing the original sketch collection.
216+
### Index construction and updating
217+
218+
<!-- TODO
219+
- resource usage (time, cpu, mem)
193220
-->
194221

195222
<!--
@@ -199,13 +226,6 @@ but it simplifies implementation and provides better correctness guarantees.
199226
sig.name(): 5078 MB
200227
-->
201228

202-
## Results
203-
204-
### Index construction and updating
205-
206-
<!-- TODO
207-
- resource usage (time, cpu, mem)
208-
-->
209229

210230
### Efficient similarity and containment queries
211231

@@ -219,18 +239,18 @@ but it simplifies implementation and provides better correctness guarantees.
219239
### Choosing an index
220240

221241
The Linear index is appropriate for operations that must check every signature,
222-
since they don't have any indexing overhead.
223-
An example is building a distance matrix for comparing signatures all-against-all,
224-
but search operations greatly benefit from extra indexing structure.
225-
The MHBT index and $k$-mer aggregative methods in general are appropriate for threshold queries,
242+
since it doesn't have any indexing overhead.
243+
An example is building a distance matrix for comparing signatures all-against-all.
244+
Search operations greatly benefit from extra indexing structure.
245+
The MHBT index and $k$-mer aggregative methods in general are appropriate for searches with query thresholds,
226246
like searching for similarity or containment of a query in a collection of datasets.
227247
The LCA index and color aggregative methods are appropriate for querying which datasets contain a specific query $k$-mer.
228248

229249
As implemented in sourmash,
230250
the MHBT index is more memory efficient because the data can stay in external memory and only the tree structure for the index
231-
need to be loaded in memory,
251+
need to be loaded in main memory,
232252
and data for the datasets and internal nodes can be loaded and unloaded on demand.
233-
The LCA index must be loaded in memory before it can be used,
253+
The LCA index must be loaded in main memory before it can be used,
234254
but once it is loaded it is faster,
235255
especially for operations that need to summarize $k$-mer assignments or required repeated searches.
236256

@@ -252,6 +272,9 @@ This allows trade-offs between storage efficiency,
252272
distribution,
253273
updating and query performance.
254274

275+
Because both are able to return the original sketch collection,
276+
it is also possible to convert one index into the other.
277+
255278
### Limitations and future directions
256279

257280
<!--
@@ -265,6 +288,9 @@ updating and query performance.
265288
266289
- sourmash is currently single threaded, but that's an implementation detail.
267290
Parallel queries are possible (in a shared read-only index)
291+
292+
- The LCA index can be implemented in external memory by using memory-mapped files,
293+
avoiding the need to load it all in memory.
268294
-->
269295

270296
## Conclusion

thesis/bib/thesis.bib

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3131,3 +3131,18 @@ @article{li_minimap2_2018
31313131
date = {2018},
31323132
note = {Publisher: Oxford University Press},
31333133
}
3134+
3135+
@online{noauthor_p1185-zhupdf_nodate,
3136+
title = {p1185-zhu.pdf},
3137+
url = {http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf},
3138+
urldate = {2020-07-20},
3139+
}
3140+
3141+
@inproceedings{pandey_general-purpose_2017,
3142+
title = {A general-purpose counting filter: Making every bit count},
3143+
shorttitle = {A general-purpose counting filter},
3144+
pages = {775--787},
3145+
booktitle = {Proceedings of the 2017 {ACM} international conference on Management of Data},
3146+
author = {Pandey, Prashant and Bender, Michael A. and Johnson, Rob and Patro, Rob},
3147+
date = {2017},
3148+
}

0 commit comments

Comments
 (0)