Skip to content

Avoid allocations in IndexFileNames.parseGeneration method #14920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

doom369
Copy link

@doom369 doom369 commented Jul 8, 2025

IndexFileNames.parseGeneration allocates multiple unnecessary strings (3-6 strings per call) + string[] objects:

image (17)

With often reindexing this allocations become notable during profiling. These allocations could be easily eliminated with a manual underscore search. After the fix only 1 string will be allocated in the worst case.

Copy link

github-actions bot commented Jul 8, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 11.0.0 milestone Jul 8, 2025
@rmuir
Copy link
Member

rmuir commented Jul 8, 2025

With often reindexing this allocations become notable during profiling.

Your profiler is lying to you. Toss it and get a better one :)

int third = (second != -1) ? filename.indexOf('_', second + 1) : -1;

int parts = (second == -1 || second >= end) ? 2 : (third == -1 || third >= end) ? 3 : 4;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a massive increase in complexity of the code: the tradeoff isn't worth it.

It's a simple string method called once, doesn't even approach the cost of the system call to actually read the file.

Copy link
Author

@doom369 doom369 Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our scenario, index updates (and we have ~6 different indexes) can happen every 5 seconds. It's profiling info from a 5-minute production run with a async-profiler (one of the best at the moment, imho). So, if these allocations were captured, they're not rare. Agree about complexity. But this method now should be also 5-10x faster if that matters for you. If you want, I can provide you with a microbenchmark.

Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this doesn't look like the right tradeoff to me. I'd recommend using a better profiler.

I don't see IndexFileNames anywhere in the CPU nor heap profiling of our benchmarks:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants