Solr Index Improvements #11374

qqmyers · 2025-03-26T14:49:53Z

What this PR does / why we need it: Indexing is slow. This PR speeds up the per-file indexing via multiple changes:

moving dataset-level constants out of the per-file loop.
small changes to the differencing algorithm to check whether restriction has changed before digging into any tabular differences and avoids creating one of the differencing details when details aren't required,
increase the solr hard commit time (improved performance, possible slower restart time, no impact on how fast a dataset appears (soft index time))
use of a small DataFileProxy class for permission indexing, used when the a new jvm option is set for datasets with more than x files - reduces memory use as a large list of filemetadata/datafiles are not kept in memory
Use of NamedNativeQueries
Use of SqlResultSetMapping to avoid post processing results in Dataverse
Comparing any draft and last released filemetadatas for all files in a dataset via one query rather than per-file checks in Dataverse code
avoid double loops in comparing variable metadata
avoiding instantiating datasets before obtaining an indexing semaphore
Using streams instead of for loops
avoid calls to services to re-retrieve info from the db (that is already available in the dataset object tree)
NamedQuery to find assignees with a permission (via some role) on a given object (versus scanning roleassignments in code to find ones where the role has the right permission)
moved loops over versions out of the per-file methods
remove the deprecated unpublishedDataRelatedToMeModeEnabled and if statements that were always true
Increased the eclipselink cache sizes for filemetadata and datafiles to 5K and generally to 1K
avoid findDeep on datasets

Which issue(s) this PR closes:

Closes #

Special notes for your reviewer:

Testing at QDR with ~330 datasets containing up to 3K files (~12K files total): indexing now takes <2 minutes, <1 minute for a second run. (This includes some additional permissions checks since QDR allows full-text indexing of restricted files, and was done on our smallest test machine (1GB DV heap). Before the updates indexing took 6+ hours.)

The one ~non-obvious change w.r.t. moving constants out of loops is removing the datafile.isHarvested call with a dataset.isHarvested constant. If you look in the code, the datafile call just calls owner.isHarvested() so there's not change.

In general, I tested after each change to see if there was a performance improvement. In some cases the change was small - 10% and others were very large. I rejected and force pushed to remove some commits for trials that didn't improve things or caused problems (a parallel stream over files in the IndexServiceBean seems to cause failures in DataTable processing that look like the IndirectList failures we've seen in a couple other places).

What I did not do is go back to see if later changes, like increasing the cache size, made other changes less important. If there's anything concerning in the result, we could potentially try to pull that change out and test performance to see if everything is still useful.

W.r.t. the min-files-to-use-proxy: in the permission loop, we only need the file id, displayName (which comes from the fileMetadata.label for the latest version, regardless of which version you're indexing - possibly a bug), and whether the file is released. For large numbers of files, I created a proxy object with just those three fields, that can be retrieved via a query (when the dataset has more than min-files-to-use-proxy files) so that the list of filemetadata and datafile objects for a given version don't have to be retrieved (which appears to happen when you call version.getFileMetadatas() - before that it appears the filemetadata list is an IndirectList (assuming you don't use findDeep)). The setting is slightly misnamed in that the proxy object is also used for small datasets, it is just constructed from the fileMetadata directly. In testing, I thought I saw some slow-down using the query for very small datasets but somewhere in the 200-1000 file range, performance improved by using it.

Suggestions on how to test this: regression test, performance test. As this changes indexing and permission indexing, careful testing to assure that files can be found by category tags, prov text etc. would be worthwhile, as would verifying that files in draft versions can't be found unless the user has relevant permissions.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: included

Additional documentation: The only thing added is one new setting which is documented.

coveralls · 2025-03-26T15:39:04Z

coverage: 22.77% (+0.04%) from 22.729%
when pulling 9c36fcf on GlobalDataverseCommunityConsortium:solr-index-improvements
into c687d18 on IQSS:develop.

saw IndirectList failure in this section, simplifying

Causes edit locks to remain (in memory only!) after dataset changes

This reverts commit 2dc56e1.

qqmyers added 7 commits March 26, 2025 08:47

add debug index logging

1ab8b57

use loop constants, etc.

8cf78c4

minimize work when details false, check restrict earlier/simplier

1e43490

really fix test

3b746f7

simplify - fix restrict bug

f23a274

release note

8f89906

fix compile issue, additional tweaks

17cd5b5

qqmyers marked this pull request as ready for review March 26, 2025 15:30

qqmyers added this to IQSS Dataverse Project Mar 26, 2025

qqmyers moved this to Ready for Triage in IQSS Dataverse Project Mar 26, 2025

qqmyers added the Size: 3 A percentage of a sprint. 2.1 hours. label Mar 26, 2025

qqmyers added this to the 6.7 milestone Mar 26, 2025

qqmyers added 17 commits March 28, 2025 12:27

try parallel file loop

fb36f3b

fix NPE and final issues

a8e5476

try finddeep

646bb83

avoid double loop

612e521

diff by query

0d6f7be

numeric params

e2d4e98

fix merge issues, change doFullText logic

85425e2

formatting

985227b

restore indexing of released files

3d2c408

delay getting dataset until semaphore is available

a649937

restore transaction, don't finddeep

1b2548a

simplify ToU logic

9deef72

avoid keeping files in List

9e5ea00

change dataset case too

b7924a3

avoid variableservice

6f6e32e

saw IndirectList failure in this section, simplifying

try EAGER

dfbf603

avoid isTabularData

7e508b6

qqmyers added 23 commits April 2, 2025 17:04

try weak on files/md

b92489f

add file proxy

0fcd064

stream, cleanup feature flag

93c4e69

make the jvm option optional

7c817b9

merge fix

2a6e9f3

DvObj missed changes

2f87415

cleanup

1bfe78d

Merge remote-tracking branch 'IQSS/develop' into solr-index-improvements

60ec76e

cleanup, remove restricted ft code from QDR

36b4efb

make named queries

c508ec6

try stream, remove sync blocks from parallel test

474c3b2

docs and setting updates

82043d1

sync query mapping and constructor

da1b631

named query, back to asc order

2db1625

query fix

db8791e

lengthen hard commit time

7cf09a6

remove unused query

750974f

revert hard commit change

2dc56e1

remove shared cache from persistence.xml

862197d

Causes edit locks to remain (in memory only!) after dataset changes

Revert "revert hard commit change"

ac32815

This reverts commit 2dc56e1.

update query to recurse to permissionroot

cff9848

fix mapping to long

e4e39d4

flip recursion

9c36fcf

qqmyers force-pushed the solr-index-improvements branch from df32ade to 9c36fcf Compare April 5, 2025 14:42

qqmyers moved this from In Progress 💻 to Ready for Review ⏩ in IQSS Dataverse Project Apr 5, 2025

qqmyers removed their assignment Apr 7, 2025

cmbz added the FY25 Sprint 21 FY25 Sprint 21 (2025-04-09 - 2025-04-23) label Apr 9, 2025

landreev self-requested a review April 10, 2025 15:04

landreev self-assigned this Apr 10, 2025

landreev moved this from Ready for Review ⏩ to In Review 🔎 in IQSS Dataverse Project Apr 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solr Index Improvements #11374

Solr Index Improvements #11374

qqmyers commented Mar 26, 2025 •

edited

Loading

coveralls commented Mar 26, 2025 •

edited

Loading

Solr Index Improvements #11374

Are you sure you want to change the base?

Solr Index Improvements #11374

Conversation

qqmyers commented Mar 26, 2025 • edited Loading

coveralls commented Mar 26, 2025 • edited Loading

qqmyers commented Mar 26, 2025 •

edited

Loading

coveralls commented Mar 26, 2025 •

edited

Loading