Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #415

Merged
merged 1 commit into from
Feb 11, 2025
Merged

Conversation

ryanemerson
Copy link
Contributor

Closes #413
Closes #414

@ryanemerson ryanemerson requested a review from a team as a code owner February 5, 2025 10:09
@tristantarrant
Copy link
Member

For reference it would be nice to have some benchmarks between old and new implementations

Copy link
Member

@pruivo pruivo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you are going to like my suggestion.

I suggest to create a new org.infinispan.protostream.impl.TagWriterImpl.Encoder with the "LazyByteArrayOutputStream". I think all the methods would benefit from it and avoid doing instanceOf in the writeUTF8Field method.

For example, the method is invoking ensureCapacity 8 times while we can do better

      void writeFixed64(long value) throws IOException {
         out.write((byte) (value & 0xFF));
         out.write((byte) ((value >> 8) & 0xFF));
         out.write((byte) ((value >> 16) & 0xFF));
         out.write((byte) ((value >> 24) & 0xFF));
         out.write((byte) ((int) (value >> 32) & 0xFF));
         out.write((byte) ((int) (value >> 40) & 0xFF));
         out.write((byte) ((int) (value >> 48) & 0xFF));
         out.write((byte) ((int) (value >> 56) & 0xFF));
      }

Even the default impl of the Encoder class can benefit

      void writeFixed64Field(int fieldNumber, long value) throws IOException {
         writeVarint32(WireType.makeTag(fieldNumber, WireType.WIRETYPE_FIXED64));
         writeFixed64(value);
      }

^ for example, ensure 13 bytes capacity, write directly to the buffer, and update the position at the end.

*/
@Deprecated(forRemoval = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing this in the impl package, I'm tempted to say "remove it".
Are we planning to backport this PR/optimizations for ProtoStream 5.x? We can deprecate it there and remove it here.

Copy link
Contributor Author

@ryanemerson ryanemerson Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to backport this PR/optimizations for ProtoStream 5.x? We can deprecate it there and remove it here.

I'm not sure. My gut feeling would be that it's not worth backporting, given that the main driver is the migration to protostream and the fact we only want new features on main.

+1 to deprecating in 5.x.

@ryanemerson
Copy link
Contributor Author

For reference it would be nice to have some benchmarks between old and new implementations

You're right! Below are the figures I posted on Zulip the other day which tests UTF8 marshalling in isolation:

Benchmark                         (initialArraySize)  (initialPosition)  (stringLength)      (type)  (useMultiByte)   Mode  Cnt         Score         Error  Units
UtfBenchmark.testUtfWrite                         64                  0               8    proto-ex           false  thrpt    5  17293297.695 ±  775157.948  ops/s
UtfBenchmark.testUtfWrite                         64                  0               8  proto-lazy           false  thrpt    5  28974585.735 ±  144449.312  ops/s

proto-ex is the old code, proto-lazy new.

I will re-run the numbers once I have looked into Pedro's suggestions.

int getPosition();
void setPosition(int position);
void ensureCapacity(int size);
byte[] getRawBuffer();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we aren't just doing write(int pos, byte val) instead? This would allow for other implementations that aren't byte[] backed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original purpose of the interface was to allow different implementations of byte[] backed streams, so exposing byte[] directly removes the need for lots of additional #write methods calls.

I guess your suggestion would allow the interface to be used by other indexed OutputStream implementations though, e.g. one that is ByteBuf based. Would this be beneficial for the client?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what I am thinking. I think this should benefit both the client and possibly even the server with some more changes I am thinking about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interface and implementation updated to no longer expose a byte[]. Let me know if we're missing anything you think we'll need for your future ideas.

@ryanemerson
Copy link
Contributor Author

Latest benchmark results just writing strings:

Benchmark                         (initialArraySize)  (initialPosition)  (stringLength)      (type)  (useMultiByte)   Mode  Cnt         Score         Error  Units
UtfBenchmark.testUtfWrite                         32                  0               8  proto-lazy            true  thrpt    5  19378946.697 ± 1008986.982  ops/s
UtfBenchmark.testUtfWrite                         32                  0               8  proto-lazy           false  thrpt    5  35079258.012 ± 1634211.248  ops/s
UtfBenchmark.testUtfWrite                         32                  0               8    proto-ex            true  thrpt    5  11728437.002 ±   42893.047  ops/s
UtfBenchmark.testUtfWrite                         32                  0               8    proto-ex           false  thrpt    5  14566156.958 ±   65063.897  ops/s

}

default byte[] toByteArray() {
return Arrays.copyOf(getRawBuffer(), getPosition());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leads me to also wonder, if we were to support byte[] backed streams, this method would not work if there is an offset in that byte[]. Currently, we don't have that but it could be useful down the line if we did that. Another reason I don't think we should expose the byte[] if we can.

int pos = 0;

public RawByteArrayOutputStreamImpl() {
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't create a new a byte[] with default size (MIN_SIZE)? We could skip the null checks in ensureCapacityand getByteBuffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During my benchmarks it seem liked initializing the array was leading to worse performance. I believe this is because we're creating a lot of stream instances and in many cases the initial capacity required exceeded MIN_SIZE.

@@ -198,7 +199,7 @@ public boolean hasTag(int tag) {

@Override
public void writeExternal(ObjectOutput out) throws IOException {
ByteArrayOutputStreamEx baos = new ByteArrayOutputStreamEx();
RawByteArrayOutputStream baos = new RawByteArrayOutputStreamImpl();
TagWriter output = TagWriterImpl.newInstance(null, baos);
writeTo(output);
output.flush();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 206-210 can be replaced with

      out.writeInt(baos.getPosition());
      out.write(baos.getRawBuffer(), 0, baos.getPosition());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're not exposing getRawBuffer this is no longer possible. Does it make more sense to wrap the provided ObjectOutput in an Adapter and pass this to TagWriterImpl#newInstance?

Copy link
Member

@pruivo pruivo Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRawBuffer is still a public method.
At the moment, I don't see how the adapter would work as you don't know the size beforehand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the confusion. I replied before pushing my changes. @wburns wanted to remove the getRawBuffer method in favour of utilising methods write(index, ...) methods so that we can provide a ByteBuf implementation on the client side.

At the moment, I don't see how the adapter would work as you don't know the size beforehand.

You're right. I hadn't really thought of the length field.

I have updated the code so that we just loop through the stream bytes and write them individually, so that we don't need to create a ByteBufffer implementation. I'm not sure if this is much better than what we already had though, so happy to revert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let me take another look 👍

@ryanemerson
Copy link
Contributor Author

ryanemerson commented Feb 11, 2025

I have updated the interface to not expose the raw byte[] as suggested by @wburns so that we can utilise the stream for ByteBuf based solutions as well. Unfortunately it seems like there's a slight perf hit when compared with the values I posted yesterday:


Benchmark                         (initialArraySize)  (initialPosition)  (stringLength)      (type)  (useMultiByte)   Mode  Cnt         Score        Error  Units
UtfBenchmark.testUtfWrite                         32                  0               8  proto-lazy            true  thrpt    5  18189985.935 ± 708413.260  ops/s
UtfBenchmark.testUtfWrite                         32                  0               8  proto-lazy           false  thrpt    5  34802151.363 ± 543860.030  ops/s
UtfBenchmark.testUtfWrite                         32                  0               8    proto-ex            true  thrpt    5  11461152.933 ± 569242.019  ops/s
UtfBenchmark.testUtfWrite                         32                  0               8    proto-ex           false  thrpt    5  14165941.018 ±  54503.091  ops/s

Copy link
Member

@pruivo pruivo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested an "improvement" for UnknownFieldSetImpl. The remaining comments are optional.

Copy link
Member

@pruivo pruivo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work!

Copy link
Member

@wburns wburns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


private static int varIntBytes(long value) {
int i = 1;
while ((value & 0xFFFFFFFFFFFFFF80L) != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't say we have to change now, but I wonder if the 4 ifs is faster than this where it checks less than 128, 16_384, etc. I can look at it later after it is integrated. It is probably will make no real measurable difference taking into account margin of error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 ifs, this has to support long too.

Copy link
Member

@wburns wburns Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it is only 4 ifs with an else. But I still think it would be better than a loop the JIT can't unroll.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how so? the max var int size is 10 bytes. How can you reduce that to 4 ifs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot. I also meant we need another method. We have 18 references to this method and only 2 actually pass a long.

Copy link
Member

@wburns wburns Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote a quick JMH benchmark and for a single byte they have the same perf. Only once I increased it to something larger was there much of a difference and it still only a ns at worst (.22ns for the ifs and 1.13ns for the loop for 5 bytes). Note the ifs were the same perf no matter the size, guessing due to branch prediction and multiple branches evaluated concurrently in the CPU. Either way as I mentioned the difference is so minor it probably isn't worth it. Especially as the code here works for int and long and is much more concise.

Benchmark                                        Mode  Cnt  Score   Error  Units
VarIntSizeBenchmark.test268435456VarIntSizeIf    avgt   10  0.222 ± 0.004  ns/op
VarIntSizeBenchmark.test268435456VarIntSizeLoop  avgt   10  1.133 ± 0.002  ns/op

- [infinispan#413] Optimize TagWriterImpl#writeString for ASCII only Strings
- [infinispan#414] Replace ByteArrayOutputStreamEx with non-synchronized implementation
@wburns wburns merged commit 3e133fe into infinispan:main Feb 11, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants