Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByteSlice functions to skip past or until a set of bytes. #13

Closed
thomcc opened this issue Jul 15, 2019 · 7 comments · Fixed by #15
Closed

ByteSlice functions to skip past or until a set of bytes. #13

thomcc opened this issue Jul 15, 2019 · 7 comments · Fixed by #15

Comments

@thomcc
Copy link
Contributor

thomcc commented Jul 15, 2019

It's common to want to skip past/until a specific set of bytes. C++'s std::string::find_first_of/find_first_not_of are an example.

The current API has trim_start_with and trim_end_with which can replaces some uses of this, but require unicode (#12) and often be much slower than an implementation that leverages memchr when the set of bytes is small.

I had some helpers for this, and ended up not being disappointed that I couldn't really replace them (I could replace some uses of them with trim_start_with, but possibly slower, due to both extra UTF-8 decoding, as well as not being able to use memchr in the until case (and both could be accelerated in the same manner, of course).

My implementations are here (skip_until/skip_while) https://gist.github.com/thomcc/a39c9bf5c7c50b0db1e5f1d4f92429a7 in case that's interesting or it's unclear what I mean.

Additionally, while I wouldn't have used them, presumably versions starting from the right, and versions along the lines of fields would be helpful. (Actually, had a fields version of this existed, it would have replaced some of my uses of these functions, probably)

That said, this is getting to be a lot of functions 🙁 -- obviously this could be done with some pattern-esque API, you mentioned not being interested in a design like that, though (which I completely respect, and think makes the docs much clearer). I don't think this case is that niche, but it might be too niche given how many functions it would have.

Anyway, I don't think this is that niche, but I'd understand a desire not to increase the number of functions too far.

@BurntSushi
Copy link
Owner

I think adding these is a good idea. I did consider doing this initially, but decided to just punt on it because it was extra work. But I do think they belong. In particular, it's not only memchr that can be used to speed this up, but PCMPESTRI can be used as well when the set of bytes gets bigger (but still <=16). That's exactly the kind of stuff that should be bundled up in a library like this.

Looking at your functions, I think they mostly look good, although I'd probably modify their API to return indices instead of slices, which is more consistent with the APIs in bstr.

We probably also want to add a _byte suffix to the names of these functions, as it's plausible a char variant could be added later. (But making that fast is a bit more complex, and its utility is somewhat questionable, honestly.)

Again, a PR for this would be welcome, but it's likely something I'll do eventually otherwise. The implementation you have looks fine for a start. The next steps after that (if you're interested) would be to add benchmarks for them and optimize them with PCMPESTRI.

@thomcc
Copy link
Contributor Author

thomcc commented Jul 15, 2019

Again, a PR for this would be welcome

Sure, I'll try to do this after work.

although I'd probably modify their API to return indices instead of slices, which is more consistent with the APIs in bstr

Hm, okay, that is more general (I imagine the output would frequently get sliced right away, though).

In that case, maybe the functions should be named similarly to the find_ family of functions (which seem like the only other ones that return usize (well, Option) in ByteString.

That is, maybe they'd be named find_in_bytes, find_not_in_bytes (and rfind_in_bytes, rfind_not_in_bytes)? Let me know if you'd prefer something different.

I'll start with these 4 as they're the closes to what I already wrote, and are the simplest as well. (It's also unclear to me how to continue that naming scheme to others where it might apply, which might mean it's not a good name...)

PCMPESTRI can be used as well when the set of bytes gets bigger (but still <=16).

Ah right, I forgot there were some pretty crazy instructions in the later SSEs...

@BurntSushi
Copy link
Owner

Hmmm. Those names don't seem too bad to me. I also don't mind the C++ flavored names, find_first_of_byte and find_first_not_of_byte, although they are a bit more of a mouthful. I think maybe the problem I have with find_in_bytes and the like is that if one saw that on its own, it could almost be interpreted as "find the string in the byte haystack," i.e., equivalent to find.

Maybe including set in the name would be better? Not sure. For example, find_set_bytes and find_not_set_bytes.

Ah right, I forgot there were some pretty crazy instructions in the later SSEs...

Yeah, PCMPESTRI can be used for substring search too. It's generally not used that much actually, since its latency is so horrendous. But it might actually work for the case of sets of bytes. In any case, benchmarks will clarify for us. :-)

@thomcc
Copy link
Contributor Author

thomcc commented Jul 15, 2019

Yeah, I was going to suggest find_any_byte instead of find_in_bytes, but couldn't figure out a way to negate it.

find_first_of_byte is very clear to me, but I think you're right that maybe referring to the "set of bytes" as a whole somehow would be best -- finding way to talk about them makes using this convention on other functions easier, and makes the API easier to follow and search.

Sadly find_set_bytes reads very strangely to me. It feels like it's the opposite of some hypothetical find_unset_bytes function. find_byteset (one word to indicate it's not referring to a byte which is set)... Or even find_in_set?

@BurntSushi
Copy link
Owner

I still kind of like find_set_bytes, since it's just a matter of realizing that set refers to the mathematical structure and not the verb.

I do however like find_byteset. I guess the negated version would be find_not_byteset?

@thomcc
Copy link
Contributor Author

thomcc commented Jul 15, 2019

I do however like find_byteset. I guess the negated version would be find_not_byteset?

Yeah, and then those could just be a suffix you'd use on any function that gets added that takes a set of bytes in that way, similar to how _char/_str/etc are used. Easy to search and once you learn it, easy to guess what a function does.

But I'm also willing to go with find_set_bytes (not sure how to negate it) if you'd prefer. It's not that weird, just not naturally how I'd think about it.

@BurntSushi
Copy link
Owner

BurntSushi commented Jul 15, 2019

SGTM. (For byteset.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants