Skip to content

fix: allow single-valued controlled vocabulary fields in Solr schema #11320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

vera
Copy link
Contributor

@vera vera commented Mar 10, 2025

What this PR does / why we need it:

A controlled vocabulary metadata field in our custom metadata blocks was incorrectly marked multiValued in the Solr schema. This caused errors when performing Solr queries with grouping on that field.

Example:

https://dataverse.harvard.edu/api/search?&type=dataset&q=*:*&fq={!collapse%20field=%27journalArticleType%27}

leads to

"Search Syntax Error: Error from server at http://dvn-cloud-solr.lib.harvard.edu:8983/solr/collection1: org.apache.solr.search.SyntaxError: Collapsing not supported on multivalued fields".

Previously, controlled vocabulary fields were always set as multiValued="true" in the Solr schema, even when neither they nor their parent fields were declared to be multivalued in the TSV file. I'm not sure why. If there is a reason for this, I would be interested to know.

This affects the following fields in the standard metadata blocks, which will now be singlevalued within the Solr schema:

Which issue(s) this PR closes:

/

Special notes for your reviewer:

/

Suggestions on how to test this:

Use the updated schema.xml to index datasets and verify correct behavior.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

/

Is there a release notes update needed for this change?:

/

Additional documentation:

/

@qqmyers
Copy link
Member

qqmyers commented Mar 10, 2025

This was changed in #8601 to support indexing values in multiple languages. I think that's something we'd want to keep. Is there some other way to accomplish what you want with collapsing?

@pdurbin pdurbin moved this to Ready for Triage in IQSS Dataverse Project Mar 10, 2025
@vera
Copy link
Contributor Author

vera commented Mar 10, 2025

Ok I see! That makes sense.

Hmm, I have two ideas:

  1. Would it be possible to check whether multiple languages are configured, and if not, index the cvoc fields as singlevalued? Would work for us, since we don't use multiple languages.

  2. Or, would it be possible to index the cvoc value of one "main" language (not sure how that would be determined, maybe just the first configured language?) in a separate field for grouping (like the _ss fields for faceting, but singlevalued)?

@qqmyers
Copy link
Member

qqmyers commented Mar 10, 2025

I haven't tested but I think either of those would work, though both have ~minor drawbacks (the first means you have to change the schema if you turn i18n on/off, the latter means more fields).

What's the use case for the collapse query you're doing? (Is it something multiple institutions will want to do and that doesn't have an alternative?)

@vera
Copy link
Contributor Author

vera commented Mar 11, 2025

Yes, I see those drawbacks as well. The latter drawback (more fields) sounds preferable to me. It might work to go that way, and possibly additionally only produce those extra fields if a feature flag is set?

This issue arose because we are doing work on our UI related to the improved "related datasets" prototype. For each of our search results, we'd like to show related items and group them by their type. The type is captured in a single-valued CVOC field, which we can't group by unless it is indexed in a non-multiValued field.

image

This use case is currently somewhat specific to our instance, but it's possible that others would also want to group search results by a single-valued CVOC field.

@vera vera changed the title fix: correctly apply 'multiValued' to fields in Solr schema fix: allow single-valued controlled vocabulary fields in Solr schema Mar 11, 2025
@ofahimIQSS ofahimIQSS added the Size: 3 A percentage of a sprint. 2.1 hours. label Mar 11, 2025
@ofahimIQSS ofahimIQSS moved this from Ready for Triage to Ready for Review ⏩ in IQSS Dataverse Project Mar 11, 2025
@ofahimIQSS
Copy link
Contributor

To be reviewed in Tech Hours

@ofahimIQSS ofahimIQSS moved this from Ready for Review ⏩ to Ready for Triage in IQSS Dataverse Project Mar 11, 2025
@ofahimIQSS ofahimIQSS marked this pull request as draft March 11, 2025 15:15
vera added a commit to nfdi4health/csh-deployment that referenced this pull request Mar 12, 2025
@ofahimIQSS ofahimIQSS moved this from Ready for Triage to On Hold ⌛ in IQSS Dataverse Project Mar 18, 2025
@qqmyers
Copy link
Member

qqmyers commented Mar 19, 2025

@vera - we had a tech hour discussion of internationalization and possible ways we could improve i18n support that would also avoid making these fields multivalue. We didn't come up with anything small enough to think about in the short term though.

In the call we thought that option 1 above - adding a check to see if multiple languages are in use would be preferable, but now, I think that would still leave an issue with some external cvoc scripts - ones that store the identifier and the human readable form in the solr field. That's controlled by the retrieval filtering in the config and so Dataverse could check to see if that was being done and use single/multi value as needed, but if you need to use collapse on a field where such an external script is in use, it wouldn't work. (To be clear - I think code code be written so that the single/multivalue is always correctly set for what the metadatablock, i18n, and external scripts need. The problem would just be if you needed to do collapse on a field that was multi and you didn't want to change the external vocab script to allow it to be single value.) If that still works for your use case(s) I think we're OK with a PR for it (I can help with figuring out the retrieval query logic to set such fields to single/multi based on the CVocConf).

We didn't think just adding a second field would be something we'd want to maintain in the main repository unless/until there are some additional use cases from other instances. I think if you want to go this route, we'd suggest maintaining this feature as a fork for now. (If your overall feature for linking datasets is something that gets into the main repo, we could look at adding the extra field(s) required as part of that PR.)

We also had some discussion of whether facets or grouping could be used to do what you wanted without requiring a single value field, or whether post-processing the solr result to group results would be usable (enough performance given the hopefully small list of items for a given dataset - assuming your queries were all for one dataset at a time). I don't think we know enough about solr or your use case to know if these or other solr features would be viable though. (We were confused a bit by the 3d* fields you listed above - we were guessing that you weren't trying to use collapse on those and that they were just other examples of fields that are single cvv with i18n making them multiple solr fields.)

I hope that gives you a way forward. Let us know if you want to close this PR or keep it open for you to make changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 3 A percentage of a sprint. 2.1 hours.
Projects
Status: On Hold ⌛
Development

Successfully merging this pull request may close these issues.

3 participants