Skip to content

TIMX 410 - add TIMDEX provenance to Opensearch mapping #360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 24, 2025

Conversation

ghukill
Copy link
Contributor

@ghukill ghukill commented Jan 22, 2025

Purpose and background context

This PR updates the Opensearch mapping to support the new timdex_provenance field.

The mapping update itself was quite minimal, following the pattern of a nested structure of fields.

Testing was a bit of a rabbit hole, that I eventually crawled back out of with minimal changes.

As noted in a previous PR, the testing suite for TIM relies heavily on VCR casettes for recording interactions with Opensearch. Interestingly, there is relatively light per-field testing of the mapping (e.g. there are not dedicated tests for field foo where we attempt to index value bar), and perhaps that is okay.

Short of a dedicated test to see if this timdex_provenance mapping is matching the actual JSON values we expect in the transformed record, the VCR casette for test test_create_index_success has been re-recorded with updated sample records that contain this new field. The successful creation of records suggests that the mapping is aligned with the new values in tests/fixtures/sample_records.json.

Additionally, a minor update was made to Makefile for managing a local Opensearch instance per this commit.

How can a reviewer manually see the effects of these changes?

1- Run Opensearch locally

make local-opensearch-start

2- Set Dev1 credentials in terminal

3- Give Opensearch 20-30 seconds to start... then create an index and bulk update from a test dataset in S3

export INDEX_NAME=libguides-2025-01-22t12-00-00
export DATASET_LOCATION=s3://timdex-extract-dev-222053980223/gh-test/test-dataset-2025-01-22

pipenv run tim --verbose create -i $INDEX_NAME

pipenv run tim --verbose promote -i $INDEX_NAME -a libguides -a all-current

pipenv run tim bulk-update \
-i $INDEX_NAME \
--run-date="2023-08-09" \
--run-id="e0c35f24-6030-4282-9aca-82cb20000210" \
$DATASET_LOCATION

4- Navigate to http://localhost:5601/app/dev_tools#/console and perform this query

GET libguides-2025-01-22t12-00-00/_search
{
 "_source": [
	 "timdex_record_id",
	 "timdex_provenance"
 ],
 "size": 20
}

Note the timdex_provenance sections in the records returned, example:

      {
        "_index": "libguides-2025-01-22t12-00-00",
        "_id": "libguides:guides-175846",
        "_score": 1,
        "_source": {
          "timdex_provenance": {
            "run_id": "e0c35f24-6030-4282-9aca-82cb20000210",
            "source": "libguides",
            "run_record_offset": 0,
            "run_date": "2023-08-09"
          },
          "timdex_record_id": "libguides:guides-175846"
        }
      }

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES: Opensearch documents will now contain timdex_provenance fields if Transmogrifier has included them during transformation

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples have been verified
  • New dependencies are appropriate or there were no changes

@ghukill ghukill force-pushed the TIMX-410-provenance-mappings branch from 4fb6e47 to 8364fa3 Compare January 22, 2025 21:06
@coveralls
Copy link

coveralls commented Jan 22, 2025

Pull Request Test Coverage Report for Build 12932961282

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 12932673476: 0.0%
Covered Lines: 435
Relevant Lines: 435

💛 - Coveralls

@ghukill ghukill marked this pull request as ready for review January 23, 2025 14:08
@ghukill
Copy link
Contributor Author

ghukill commented Jan 23, 2025

@ehanson8, @jonavellecuerdo - if interested, I've confirmed this works in Dev1 as well.

Here is the document for accessing Opensearch dashboards in Dev1: https://mitlibraries.atlassian.net/wiki/spaces/D/pages/3665854480/How+to+access+OpenSearch+Dashboards+in+AWS.

Once open and to the querying screen, this works:

GET /all-current/_search
{
  "query": {
    "exists": {
      "field": "timdex_provenance"
    }
  }
}

Here is an example provenance object:

"timdex_provenance": {
  "source": "libguides",
  "run_date": "2025-01-23",
  "run_id": "e758d6c4-6ee4-4862-a00f-b9da4d3758ad",
  "run_record_offset": 0
}

Note the run_id = "e758d6c4-6ee4-4862-a00f-b9da4d3758ad". That run_id correlates directly to this StepFunction invocation: https://us-east-1.console.aws.amazon.com/states/home?region=us-east-1#/v2/executions/details/arn:aws:states:us-east-1:222053980223:execution:timdex-ingest-v2-dev:e758d6c4-6ee4-4862-a00f-b9da4d3758ad.

This is beginning to fully close the loop here:

  1. StepFunction runs, the execution ID is passed as a run_id to Transmog
  2. Transmog writes this run_id to dataset records and adds to provenance section
  3. TIM indexes records, now including a provenance section
  4. we encounter a record "in the wild" and suddenly we know a) what StepFunction run it came from, and b) how to find it quickly in the dataset

Why these changes are being introduced:

With Transmogrifier beginning to write a "timdex_provenance" section
to TIMDEX records, an update is needed for TIM to include this
during index creation and writing.

How this addresses that need:

Updates Opensearch mapping to include new "timdex_provenance" field.

Additionally, sample records were update to include provenance field
values, and casettes were re-recorded for successful indexing of
records with those provenance values.

Side effects of this change:
* Support for records with provenance sections

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-406
* https://mitlibraries.atlassian.net/browse/TIMX-410
@ghukill ghukill force-pushed the TIMX-410-provenance-mappings branch from 227eecf to e5c3197 Compare January 23, 2025 15:52
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and here's to rabbit holes only leading to minimal changes!

@ghukill ghukill merged commit de0e4d5 into main Jan 24, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants