ByteSlice functions to skip past or until a set of bytes. #13

thomcc · 2019-07-15T05:54:33Z

It's common to want to skip past/until a specific set of bytes. C++'s std::string::find_first_of/find_first_not_of are an example.

The current API has trim_start_with and trim_end_with which can replaces some uses of this, but require unicode (#12) and often be much slower than an implementation that leverages memchr when the set of bytes is small.

I had some helpers for this, and ended up not being disappointed that I couldn't really replace them (I could replace some uses of them with trim_start_with, but possibly slower, due to both extra UTF-8 decoding, as well as not being able to use memchr in the until case (and both could be accelerated in the same manner, of course).

My implementations are here (skip_until/skip_while) https://gist.github.com/thomcc/a39c9bf5c7c50b0db1e5f1d4f92429a7 in case that's interesting or it's unclear what I mean.

Additionally, while I wouldn't have used them, presumably versions starting from the right, and versions along the lines of fields would be helpful. (Actually, had a fields version of this existed, it would have replaced some of my uses of these functions, probably)

That said, this is getting to be a lot of functions 🙁 -- obviously this could be done with some pattern-esque API, you mentioned not being interested in a design like that, though (which I completely respect, and think makes the docs much clearer). I don't think this case is that niche, but it might be too niche given how many functions it would have.

Anyway, I don't think this is that niche, but I'd understand a desire not to increase the number of functions too far.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2019-07-15T11:07:53Z

I think adding these is a good idea. I did consider doing this initially, but decided to just punt on it because it was extra work. But I do think they belong. In particular, it's not only memchr that can be used to speed this up, but PCMPESTRI can be used as well when the set of bytes gets bigger (but still <=16). That's exactly the kind of stuff that should be bundled up in a library like this.

Looking at your functions, I think they mostly look good, although I'd probably modify their API to return indices instead of slices, which is more consistent with the APIs in bstr.

We probably also want to add a _byte suffix to the names of these functions, as it's plausible a char variant could be added later. (But making that fast is a bit more complex, and its utility is somewhat questionable, honestly.)

Again, a PR for this would be welcome, but it's likely something I'll do eventually otherwise. The implementation you have looks fine for a start. The next steps after that (if you're interested) would be to add benchmarks for them and optimize them with PCMPESTRI.

thomcc · 2019-07-15T15:13:41Z

Again, a PR for this would be welcome

Sure, I'll try to do this after work.

although I'd probably modify their API to return indices instead of slices, which is more consistent with the APIs in bstr

Hm, okay, that is more general (I imagine the output would frequently get sliced right away, though).

In that case, maybe the functions should be named similarly to the find_ family of functions (which seem like the only other ones that return usize (well, Option) in ByteString.

That is, maybe they'd be named find_in_bytes, find_not_in_bytes (and rfind_in_bytes, rfind_not_in_bytes)? Let me know if you'd prefer something different.

I'll start with these 4 as they're the closes to what I already wrote, and are the simplest as well. (It's also unclear to me how to continue that naming scheme to others where it might apply, which might mean it's not a good name...)

PCMPESTRI can be used as well when the set of bytes gets bigger (but still <=16).

Ah right, I forgot there were some pretty crazy instructions in the later SSEs...

BurntSushi · 2019-07-15T15:24:15Z

Hmmm. Those names don't seem too bad to me. I also don't mind the C++ flavored names, find_first_of_byte and find_first_not_of_byte, although they are a bit more of a mouthful. I think maybe the problem I have with find_in_bytes and the like is that if one saw that on its own, it could almost be interpreted as "find the string in the byte haystack," i.e., equivalent to find.

Maybe including set in the name would be better? Not sure. For example, find_set_bytes and find_not_set_bytes.

Ah right, I forgot there were some pretty crazy instructions in the later SSEs...

Yeah, PCMPESTRI can be used for substring search too. It's generally not used that much actually, since its latency is so horrendous. But it might actually work for the case of sets of bytes. In any case, benchmarks will clarify for us. :-)

thomcc · 2019-07-15T15:45:12Z

Yeah, I was going to suggest find_any_byte instead of find_in_bytes, but couldn't figure out a way to negate it.

find_first_of_byte is very clear to me, but I think you're right that maybe referring to the "set of bytes" as a whole somehow would be best -- finding way to talk about them makes using this convention on other functions easier, and makes the API easier to follow and search.

Sadly find_set_bytes reads very strangely to me. It feels like it's the opposite of some hypothetical find_unset_bytes function. find_byteset (one word to indicate it's not referring to a byte which is set)... Or even find_in_set?

BurntSushi · 2019-07-15T16:36:53Z

I still kind of like find_set_bytes, since it's just a matter of realizing that set refers to the mathematical structure and not the verb.

I do however like find_byteset. I guess the negated version would be find_not_byteset?

thomcc · 2019-07-15T16:59:23Z

I do however like find_byteset. I guess the negated version would be find_not_byteset?

Yeah, and then those could just be a suffix you'd use on any function that gets added that takes a set of bytes in that way, similar to how _char/_str/etc are used. Easy to search and once you learn it, easy to guess what a function does.

But I'm also willing to go with find_set_bytes (not sure how to negate it) if you'd prefer. It's not that weird, just not naturally how I'd think about it.

BurntSushi · 2019-07-15T17:03:07Z

SGTM. (For byteset.)

thomcc mentioned this issue Jul 15, 2019

ByteSlice should offer more functions taking u8 #12

Open

thomcc mentioned this issue Jul 18, 2019

Implement functions for searching bytesets #15

Merged

BurntSushi closed this as completed in #15 Jul 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ByteSlice functions to skip past or until a set of bytes. #13

ByteSlice functions to skip past or until a set of bytes. #13

thomcc commented Jul 15, 2019

BurntSushi commented Jul 15, 2019

thomcc commented Jul 15, 2019

BurntSushi commented Jul 15, 2019

thomcc commented Jul 15, 2019

BurntSushi commented Jul 15, 2019

thomcc commented Jul 15, 2019 •

edited

Loading

BurntSushi commented Jul 15, 2019 •

edited

Loading

ByteSlice functions to skip past or until a set of bytes. #13

ByteSlice functions to skip past or until a set of bytes. #13

Comments

thomcc commented Jul 15, 2019

BurntSushi commented Jul 15, 2019

thomcc commented Jul 15, 2019

BurntSushi commented Jul 15, 2019

thomcc commented Jul 15, 2019

BurntSushi commented Jul 15, 2019

thomcc commented Jul 15, 2019 • edited Loading

BurntSushi commented Jul 15, 2019 • edited Loading

thomcc commented Jul 15, 2019 •

edited

Loading

BurntSushi commented Jul 15, 2019 •

edited

Loading