Skip to content

User input storage strategy for I18N-aware serialization #2080

@slint

Description

@slint

We need to establish a clear strategy for preserving user input in fields that potentially have vocabulary relations when serializing records in localized contexts. Currently, there's ambiguity around how to handle cases where users provide custom input for fields that can be localized, for example:

  • Creator affiliations
    1. User selects from the affiliation search dropdown CERN
    2. We store in the creators.affilations array an object entry with:
      • id: "01ggx4157" (CERN's ROR ID)
      • name: "European Organization for Nuclear Research", the CERN vocabulary entry's title at the time of selection. This is important because in case the affiliation vocabulary entry name changes later, we want to preserve the original input at the time of selection.
  • Custom license
    1. User adds a custom license
    2. We store in the rights array an object entry with:
      • title.en: "My Custom License" (User input for license title)
      • description.en: "My custom license description ..." (User input for license description)
  • Custom funding award
    1. User enters custom award information with a dropdown selection of "European Commission" as funder
    2. We store in the funding array an object entry with:
      • funder.id: "00k4n6c32" (European Commission's ROR ID)
      • award.number: "123456" (User input for award number)
      • award.title.en: "My custom award title" (User input for award title)

Note that the en suffix above is controlled via the BABEL_DEFAULT_LOCALE config, which is passed to the deposit form.

When serializing a record from a service (either in the UI or REST API), the above input results to the following output:

# NOTE: We include the user input described above for clarity
user_input = {
    "creators": [{
        "family_name": "Ioannidis",
        "given_name": "Alex",
        "affiliations": [{
            "id": "01ggx4157",
            "name": "European Organization for Nuclear Research",
        }]
    }],
    "rights": [{
        "title": {"en": "My Custom License"},
        "description": {"en": "My custom license description ..."}
    }],
    "funding": [{
        "funder": {"id": "00k4n6c32"},
        "award": {
            "number": "123456",
            "title": {"en": "My custom award title"}
        }
    }],
    ...
}

# Accept-Language: fr
serialized_record = {
    "creators": [{
        "affiliations": [{
            "id": "01ggx4157",
            # ✅ Preserved user input using "name"
            "name": "CERN",
            # ✅ I18N-dict title from vocabulary expansion
            "title": {
                "en": "European Organization for Nuclear Research",
                "fr": "Organisation européenne pour la recherche nucléaire"
            }
        }]
    }],
    "rights": [{
        # ❌ Preserved user input, but as "title.en"...
        "title": {"en": "My Custom License"},
        # ...should have been:
        #   "description": "My Custom License",

        # ❌ Preserved user input, but as "description.en"...
        "description": {"en": "My custom license description ..."}
        # ❔...what should it be? If "name" is used for "title", what would the
        # equivalent for "description" be?
    }],
    "funding": [{
        "award": {
            "numer": "123456",
            # ❌ Preserved user input, but as "title.en"...
            "title": {"en": "My custom award title"}
            # ...should have been:
            #   "name": "My custom award title",
        },
        "funder": {
            "id": "00k4n6c32",
            # ✅ "name" from vocabulary expansion
            "name": "European Commission",
            # ✅ I18N-dict title from vocabulary expansion
            "title": {
                "en": "European Commission",
                "fr": "Commission européenne"
            }
        }
    }],
    "ui": {
        "creators": [{
            "affiliations": [{
                # ✅ Preserved user input for easier UI rendering
                "name": "CERN",
                # ✅ Helper field to make UI rendering easier
                "title_l10n": "Organisation européenne pour la recherche nucléaire"
            }]
        }],
        "rights": [{
            # ❌ Original user input used for localized title and description, which is
            # wrong, since the user didn't specify a language...
            "title_l10n": "My Custom License",
            # ...should have been:
            #   "name": "My Custom License",

            # ❌ Original user input used for localized description, which is
            # wrong, since the user didn't specify a language...
            "description_l10n": "My custom license description ..."
            # ❔...what should it be? If "name" is used for "title", what would the
            # equivalent for "description" be?
        }],
        "funding": [{
            "award": {
                "number": "123456",
                # ❌ Original user input used for localized title and description, which is
                # wrong, since the user didn't specify a language...
                "title_l10n": "My custom award title"
            },
            "funder": {
                "name": "European Commission",
                # ✅ Helper field to make UI rendering easier.
                "title_l10n": "Commission européenne"
            }
        }]
    }
}

Questions to resolve

  • Input preservation: Should we always preserve user input in the name field, even when vocabulary data is available? What if there's another user input field like description (e.g. for licenses)?
  • Language detection: How do we handle cases where user input language differs from the selected locale? Do we even want to allow user input to become part of the localization data?
  • UI consistency: How should the UI layer balance showing user input vs. localized vocabulary terms?

Next Steps

(cc @utnapischtim, @tmorrell, @mesemus, @SarahW91)

  • Identify all fields that might suffer from this issue in the current schema
  • Organize discussion with I18N-focused instances to gather input on current pain points
  • Establish guidelines in InvenioRDM docs for handling user input vs. vocabulary data in I18N contexts

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions