Skip to content

Solr Index Improvements #11374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 84 commits into
base: develop
Choose a base branch
from

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Mar 26, 2025

What this PR does / why we need it: Indexing is slow. This PR speeds up the per-file indexing via multiple changes:

  • moving dataset-level constants out of the per-file loop.
  • small changes to the differencing algorithm to check whether restriction has changed before digging into any tabular differences and avoids creating one of the differencing details when details aren't required,
  • increase the solr hard commit time (improved performance, possible slower restart time, no impact on how fast a dataset appears (soft index time))
  • use of a small DataFileProxy class for permission indexing, used when the a new jvm option is set for datasets with more than x files - reduces memory use as a large list of filemetadata/datafiles are not kept in memory
  • Use of NamedNativeQueries
  • Use of SqlResultSetMapping to avoid post processing results in Dataverse
  • Comparing any draft and last released filemetadatas for all files in a dataset via one query rather than per-file checks in Dataverse code
  • avoid double loops in comparing variable metadata
  • avoiding instantiating datasets before obtaining an indexing semaphore
  • Using streams instead of for loops
  • avoid calls to services to re-retrieve info from the db (that is already available in the dataset object tree)
  • NamedQuery to find assignees with a permission (via some role) on a given object (versus scanning roleassignments in code to find ones where the role has the right permission)
  • moved loops over versions out of the per-file methods
  • remove the deprecated unpublishedDataRelatedToMeModeEnabled and if statements that were always true
  • Increased the eclipselink cache sizes for filemetadata and datafiles to 5K and generally to 1K
  • avoid findDeep on datasets

Which issue(s) this PR closes:

  • Closes #

Special notes for your reviewer:

Testing at QDR with ~330 datasets containing up to 3K files (~12K files total): indexing now takes <2 minutes, <1 minute for a second run. (This includes some additional permissions checks since QDR allows full-text indexing of restricted files, and was done on our smallest test machine (1GB DV heap). Before the updates indexing took 6+ hours.)

The one ~non-obvious change w.r.t. moving constants out of loops is removing the datafile.isHarvested call with a dataset.isHarvested constant. If you look in the code, the datafile call just calls owner.isHarvested() so there's not change.

In general, I tested after each change to see if there was a performance improvement. In some cases the change was small - 10% and others were very large. I rejected and force pushed to remove some commits for trials that didn't improve things or caused problems (a parallel stream over files in the IndexServiceBean seems to cause failures in DataTable processing that look like the IndirectList failures we've seen in a couple other places).

What I did not do is go back to see if later changes, like increasing the cache size, made other changes less important. If there's anything concerning in the result, we could potentially try to pull that change out and test performance to see if everything is still useful.

W.r.t. the min-files-to-use-proxy: in the permission loop, we only need the file id, displayName (which comes from the fileMetadata.label for the latest version, regardless of which version you're indexing - possibly a bug), and whether the file is released. For large numbers of files, I created a proxy object with just those three fields, that can be retrieved via a query (when the dataset has more than min-files-to-use-proxy files) so that the list of filemetadata and datafile objects for a given version don't have to be retrieved (which appears to happen when you call version.getFileMetadatas() - before that it appears the filemetadata list is an IndirectList (assuming you don't use findDeep)). The setting is slightly misnamed in that the proxy object is also used for small datasets, it is just constructed from the fileMetadata directly. In testing, I thought I saw some slow-down using the query for very small datasets but somewhere in the 200-1000 file range, performance improved by using it.

Suggestions on how to test this: regression test, performance test. As this changes indexing and permission indexing, careful testing to assure that files can be found by category tags, prov text etc. would be worthwhile, as would verifying that files in draft versions can't be found unless the user has relevant permissions.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: included

Additional documentation: The only thing added is one new setting which is documented.

@qqmyers qqmyers marked this pull request as ready for review March 26, 2025 15:30
@qqmyers qqmyers moved this to Ready for Triage in IQSS Dataverse Project Mar 26, 2025
@qqmyers qqmyers added the Size: 3 A percentage of a sprint. 2.1 hours. label Mar 26, 2025
@qqmyers qqmyers added this to the 6.7 milestone Mar 26, 2025
@coveralls
Copy link

coveralls commented Mar 26, 2025

Coverage Status

coverage: 22.77% (+0.04%) from 22.729%
when pulling 9c36fcf on GlobalDataverseCommunityConsortium:solr-index-improvements
into c687d18 on IQSS:develop.

@qqmyers qqmyers force-pushed the solr-index-improvements branch from df32ade to 9c36fcf Compare April 5, 2025 14:42
@qqmyers qqmyers moved this from In Progress 💻 to Ready for Review ⏩ in IQSS Dataverse Project Apr 5, 2025
@qqmyers qqmyers removed their assignment Apr 7, 2025
@cmbz cmbz added the FY25 Sprint 21 FY25 Sprint 21 (2025-04-09 - 2025-04-23) label Apr 9, 2025
@landreev landreev self-requested a review April 10, 2025 15:04
@landreev landreev self-assigned this Apr 10, 2025
@landreev landreev moved this from Ready for Review ⏩ to In Review 🔎 in IQSS Dataverse Project Apr 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 20 FY25 Sprint 20 (2025-03-26 - 2025-04-09) FY25 Sprint 21 FY25 Sprint 21 (2025-04-09 - 2025-04-23) Size: 10 A percentage of a sprint. 7 hours.
Projects
Status: In Review 🔎
Development

Successfully merging this pull request may close these issues.

4 participants