-
Notifications
You must be signed in to change notification settings - Fork 505
Solr Index Improvements #11374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
qqmyers
wants to merge
84
commits into
IQSS:develop
Choose a base branch
from
GlobalDataverseCommunityConsortium:solr-index-improvements
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Solr Index Improvements #11374
qqmyers
wants to merge
84
commits into
IQSS:develop
from
GlobalDataverseCommunityConsortium:solr-index-improvements
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
saw IndirectList failure in this section, simplifying
Causes edit locks to remain (in memory only!) after dataset changes
This reverts commit 2dc56e1.
df32ade
to
9c36fcf
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
FY25 Sprint 20
FY25 Sprint 20 (2025-03-26 - 2025-04-09)
FY25 Sprint 21
FY25 Sprint 21 (2025-04-09 - 2025-04-23)
Size: 10
A percentage of a sprint. 7 hours.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it: Indexing is slow. This PR speeds up the per-file indexing via multiple changes:
Which issue(s) this PR closes:
Special notes for your reviewer:
Testing at QDR with ~330 datasets containing up to 3K files (~12K files total): indexing now takes <2 minutes, <1 minute for a second run. (This includes some additional permissions checks since QDR allows full-text indexing of restricted files, and was done on our smallest test machine (1GB DV heap). Before the updates indexing took 6+ hours.)
The one ~non-obvious change w.r.t. moving constants out of loops is removing the datafile.isHarvested call with a dataset.isHarvested constant. If you look in the code, the datafile call just calls owner.isHarvested() so there's not change.
In general, I tested after each change to see if there was a performance improvement. In some cases the change was small - 10% and others were very large. I rejected and force pushed to remove some commits for trials that didn't improve things or caused problems (a parallel stream over files in the IndexServiceBean seems to cause failures in DataTable processing that look like the IndirectList failures we've seen in a couple other places).
What I did not do is go back to see if later changes, like increasing the cache size, made other changes less important. If there's anything concerning in the result, we could potentially try to pull that change out and test performance to see if everything is still useful.
W.r.t. the min-files-to-use-proxy: in the permission loop, we only need the file id, displayName (which comes from the fileMetadata.label for the latest version, regardless of which version you're indexing - possibly a bug), and whether the file is released. For large numbers of files, I created a proxy object with just those three fields, that can be retrieved via a query (when the dataset has more than min-files-to-use-proxy files) so that the list of filemetadata and datafile objects for a given version don't have to be retrieved (which appears to happen when you call version.getFileMetadatas() - before that it appears the filemetadata list is an IndirectList (assuming you don't use findDeep)). The setting is slightly misnamed in that the proxy object is also used for small datasets, it is just constructed from the fileMetadata directly. In testing, I thought I saw some slow-down using the query for very small datasets but somewhere in the 200-1000 file range, performance improved by using it.
Suggestions on how to test this: regression test, performance test. As this changes indexing and permission indexing, careful testing to assure that files can be found by category tags, prov text etc. would be worthwhile, as would verifying that files in draft versions can't be found unless the user has relevant permissions.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?: included
Additional documentation: The only thing added is one new setting which is documented.