Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

homogeneous arrays #4

Open
michaelficarra opened this issue Dec 17, 2016 · 1 comment
Open

homogeneous arrays #4

michaelficarra opened this issue Dec 17, 2016 · 1 comment

Comments

@michaelficarra
Copy link
Member

I propose a new type for homogenous arrays which allows omitting the type tags of each of its elements. Most of the arrays I encode are homogenous and have more than one element, so this will save one byte per additional element.

@LewisJEllis
Copy link
Contributor

LewisJEllis commented Dec 20, 2016

By homogenous do you mean "exact same superpack type tag" or "same type family but possibly varying size classes"? The latter feels like it would have a very complex specification/implementation and I don't think it would help much since we'd still have to spend at least a byte per element specifying size, so I'll comment on the former.

If we look only for exact same superpack type tag, I think the cases where this would help are so narrow as to almost be contrived. The three most reasonable cases where I can see this helping significantly are:

  • an array of all ascii-only strings with some of length > 32 (so we encode all as cstring, save no bytes on strings of length < 32 and one byte per string of length > 32)
  • an array of all float32s/double64s (save 1 byte per element, but note that if some elements fit in float32 and some need double64, we probably won't benefit)
  • an array of uints all less than 65536 but with fewer elements below 64 than exceeding 16383 (so we encode all as uint16 and save a byte per n > 16383 but lose a byte per n < 64)

How common are any of these? The latter two are so specific that I'm a bit dubious as to the frequency of circumstances in which this might be worth the additional spec and implementation complexity.

At first glance I might be convinced by the string case, but note that it likely won't play well with the repeated string optimization. We only store strings in the lookup table if doing so will actually save space, so if we have an array containing a mix of one-off and repeated strings, we're out of luck. I think the only way we could still realize any benefit in this case would be to change the string lookup table to be of unbounded size, put every string (even one-offs) in the table, and change the strref type tag to be strref uint instead of strref one-byte-index. Then we could do homogenous array optimization over every element being a strref, but this would also effectively add 1-2 bytes to the total payload for each one-off string (the original motivation for limiting the string lookup table to only the 256 most-often-repeated strings), to say nothing of the compatibility concerns of major changes to the repeated string optimization.

I'd be interested to hear about the specific cases you have in mind where this would help significantly, or if you mean something other than what I've interpreted and commented on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants