Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement functions for searching bytesets #15

Merged
merged 3 commits into from
Jul 21, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions bench/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 3 additions & 1 deletion bench/src/bench.rs
Original file line number Diff line number Diff line change
Expand Up @@ -269,4 +269,6 @@ criterion_group!(g8, trim);
criterion_group!(g9, search::find_iter);
criterion_group!(g10, search::rfind_iter);
criterion_group!(g11, search::find_char);
criterion_main!(g1, g2, g3, g4, g5, g6, g7, g8, g9, g10, g11);
criterion_group!(g12, search::find_byteset);
criterion_group!(g13, search::find_not_byteset);
criterion_main!(g1, g2, g3, g4, g5, g6, g7, g8, g9, g10, g11, g12, g13);
131 changes: 131 additions & 0 deletions bench/src/search.rs
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,137 @@ pub fn find_char(c: &mut Criterion) {
});
}

pub fn find_byteset(c: &mut Criterion) {
let corpus = SUBTITLE_EN_SMALL;
define(c, "bstr/find_byteset/1", "en-small-ascii", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.find_byteset(b"\0"));
});
});
define(c, "bstr/find_byteset/2", "en-small-ascii", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.find_byteset(b"\0\xff"));
});
});
define(c, "bstr/find_byteset/3", "en-small-ascii", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.find_byteset(b"\0\xff\xee"));
});
});
define(c, "bstr/find_byteset/4", "en-small-ascii", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.find_byteset(b"\0\xff\xee\xdd"));
});
});
define(c, "bstr/find_byteset/10", "en-small-ascii", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.find_byteset(b"0123456789"));
});
});

define(c, "bstr/rfind_byteset/1", "en-small-ascii", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.rfind_byteset(b"\0"));
});
});
define(c, "bstr/rfind_byteset/2", "en-small-ascii", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.rfind_byteset(b"\0\xff"));
});
});
define(c, "bstr/rfind_byteset/3", "en-small-ascii", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.rfind_byteset(b"\0\xff\xee"));
});
});
define(c, "bstr/rfind_byteset/4", "en-small-ascii", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.rfind_byteset(b"\0\xff\xee\xdd"));
});
});
define(c, "bstr/rfind_byteset/10", "en-small-ascii", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.rfind_byteset(b"0123456789"));
});
});
}

pub fn find_not_byteset(c: &mut Criterion) {
let corpus = REPEATED_RARE_SMALL;
define(c, "bstr/find_not_byteset/1", "repeated-rare-small", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(Some(1000), corpus.find_not_byteset(b"z"));
})
});
define(c, "bstr/find_not_byteset/2", "repeated-rare-small", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(Some(1000), corpus.find_not_byteset(b"zy"));
});
});
define(c, "bstr/find_not_byteset/3", "repeated-rare-small", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(Some(1000), corpus.find_not_byteset(b"zyx"));
});
});
define(c, "bstr/find_not_byteset/4", "repeated-rare-small", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(Some(1000), corpus.find_not_byteset(b"zyxw"));
});
});
define(c, "bstr/find_not_byteset/10", "repeated-rare-small", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(Some(1000), corpus.find_not_byteset(b"zyxwv12345"));
});
});

define(c, "bstr/rfind_not_byteset/1", "repeated-rare-small", corpus, move |b| {
// This file ends in \n, breaking our benchmark.... TODO find a better dataset...
let corpus = &corpus.as_bytes()[..(corpus.len()-1)];
b.iter(|| {
assert_eq!(None, corpus.rfind_not_byteset(b"z"));
});
});
define(c, "bstr/rfind_not_byteset/2", "repeated-rare-small", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.rfind_not_byteset(b"z\n"));
});
});
define(c, "bstr/rfind_not_byteset/3", "repeated-rare-small", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.rfind_not_byteset(b"zy\n"));
});
});
define(c, "bstr/rfind_not_byteset/4", "repeated-rare-small", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.rfind_not_byteset(b"zyx\n"));
});
});
define(c, "bstr/rfind_not_byteset/10", "repeated-rare-small", corpus, move |b| {
let corpus = corpus.as_bytes();
b.iter(|| {
assert_eq!(None, corpus.rfind_not_byteset(b"zyxwv1234\n"));
});
});
}

fn define_find_iter(
c: &mut Criterion,
group_name: &str,
Expand Down
115 changes: 115 additions & 0 deletions src/byteset/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
use memchr::{memchr, memchr2, memchr3, memrchr, memrchr2, memrchr3};
mod scalar;

#[inline]
fn build_table(byteset: &[u8]) -> [u8; 256] {
let mut table = [0u8; 256];
for &b in byteset {
table[b as usize] = 1;
}
table
}

#[inline]
pub(crate) fn find(haystack: &[u8], byteset: &[u8]) -> Option<usize> {
match byteset.len() {
0 => return None,
1 => memchr(byteset[0], haystack),
2 => memchr2(byteset[0], byteset[1], haystack),
3 => memchr3(byteset[0], byteset[1], byteset[2], haystack),
_ => {
let table = build_table(byteset);
scalar::forward_search_bytes(haystack, |b| table[b as usize] != 0)
}
}
}

#[inline]
pub(crate) fn rfind(haystack: &[u8], byteset: &[u8]) -> Option<usize> {
match byteset.len() {
0 => return None,
1 => memrchr(byteset[0], haystack),
2 => memrchr2(byteset[0], byteset[1], haystack),
3 => memrchr3(byteset[0], byteset[1], byteset[2], haystack),
_ => {
let table = build_table(byteset);
scalar::reverse_search_bytes(haystack, |b| table[b as usize] != 0)
}
}
}

#[inline]
pub(crate) fn find_not(haystack: &[u8], byteset: &[u8]) -> Option<usize> {
if haystack.is_empty() {
return None;
}
match byteset.len() {
0 => return Some(0),
1 => scalar::inv_memchr(byteset[0], haystack),
2 => scalar::forward_search_bytes(haystack, |b| {
b != byteset[0] && b != byteset[1]
}),
3 => scalar::forward_search_bytes(haystack, |b| {
b != byteset[0] && b != byteset[1] && b != byteset[2]
}),
_ => {
let table = build_table(byteset);
scalar::forward_search_bytes(haystack, |b| table[b as usize] == 0)
}
}
}
#[inline]
pub(crate) fn rfind_not(haystack: &[u8], byteset: &[u8]) -> Option<usize> {
if haystack.is_empty() {
return None;
}
match byteset.len() {
0 => return Some(haystack.len() - 1),
1 => scalar::inv_memrchr(byteset[0], haystack),
2 => scalar::reverse_search_bytes(haystack, |b| {
b != byteset[0] && b != byteset[1]
}),
3 => scalar::reverse_search_bytes(haystack, |b| {
b != byteset[0] && b != byteset[1] && b != byteset[2]
}),
_ => {
let table = build_table(byteset);
scalar::reverse_search_bytes(haystack, |b| table[b as usize] == 0)
}
}
}

#[cfg(test)]
mod tests {

quickcheck! {
fn qc_byteset_forward_matches_naive(
haystack: Vec<u8>,
needles: Vec<u8>
) -> bool {
super::find(&haystack, &needles)
== haystack.iter().position(|b| needles.contains(b))
}
fn qc_byteset_backwards_matches_naive(
haystack: Vec<u8>,
needles: Vec<u8>
) -> bool {
super::rfind(&haystack, &needles)
== haystack.iter().rposition(|b| needles.contains(b))
}
fn qc_byteset_forward_not_matches_naive(
haystack: Vec<u8>,
needles: Vec<u8>
) -> bool {
super::find_not(&haystack, &needles)
== haystack.iter().position(|b| !needles.contains(b))
}
fn qc_byteset_backwards_not_matches_naive(
haystack: Vec<u8>,
needles: Vec<u8>
) -> bool {
super::rfind_not(&haystack, &needles)
== haystack.iter().rposition(|b| !needles.contains(b))
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of feel like this might not be enough test coverage. Most of the code here is simple enough that I'd probably be okay with it, but the inv_memchr1 function is not as simple and probably deserves more testing. A particularly important part of testing code like inv_memchr1 is to make sure all offsets are tested, which is what the memchr crate does: https://github.com/BurntSushi/rust-memchr/blob/665fffcc9506ac1cd93339470d2ca4b018e648c6/src/tests/mod.rs#L353-L374

I realize this is perhaps more work than you wanted to take on. I would totally be okay with leaving the extra tests out of this PR, but I'd probably insist on dropping inv_memchr1 in addition to that. (You could just throw it up into another PR without tests, and I'd be happy to add tests to it later when I get a chance. It's up to you and your time/desire budget. :-))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding more tests doesn't bother me, I was just being lazy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added what I think amounts to equivalent tests to what you have for memchr 1 byte, although it's possible it's still not ideal, let me know.

}
Loading