Add bulk-update command and support for TIMDEX parquet dataset #359

jonavellecuerdo · 2025-01-10T16:25:00Z

Purpose and background context

A new CLI command bulk-update is added to TIM that expects run_date and run_id to get a subset of the v2 TIMDEX parquet dataset, where it then performs bulk indexing and deleting based on records with action="index" and action="delete" respectively.

How can a reviewer manually see the effects of these changes?

A. Review added unit tests.
Note: Similar to what was done for timdex-pipeline-lambdas, I moved the existing V1 unit tests for CLI commands into test_cli_v1.py. V1 tests remain unchanged. The two new unit tests for the bulk-update command are in test_cli.py. For V2 unit tests, I used patch decorators to replace function calls with mock objects.

@ghukill and I took a look at the recorded vcr cassettes and realized that the unit tests will need to be updated at some point (either update the vcr cassettes or refactor unit tests), but this is outside the scope of this ticket. Additional testing was done via a local OpenSearch instance to verify functionality.

B. Screenshots and logs from testing with an [OpenSearch instance running locally with Docker](https://mitlibraries.atlassian.net/browse/TIMX-428) have been provided.

Prerequisite: A TIMDEX Dataset was created in Dev1 containing records from a sample of extract files for libguides and gisogm (thanks, @ghukill 😄). Here are some details regarding the files:

# Dev1 Dataset
location = s3://timdex-extract-dev-222053980223/dataset/

# "libguides" daily run with 33 records to index
run_date = 2024-12-01
run_id = 90d6f2e1-8233-4471-9d4c-59830effda23

# "gisogm" daily run with 103 records to index, 83 to delete
run_date = 2024-05-02
run_id = cc4339a5-00de-4717-a4b1-43be292f61f7

Two indexes were created and promoted for libguides and gisogm.
⭐ bulk-update command for gisogm:

pipenv run tim bulk-update \
-s gisogm \
--run-date="2024-05-02" \
--run-id="cc4339a5-00de-4717-a4b1-43be292f61f7" \
s3://timdex-extract-dev-222053980223/dataset/

⭐ bulk-update log output [REDACTED] for gisogm:
Note: Errors are thrown because the "records to delete" are not in the local OpenSearch instance. This is okay!

Loading .env environment variables...
2025-01-10 09:43:15,858 INFO tim.cli.main(): Logger 'root' configured with level=INFO
2025-01-10 09:43:15,858 INFO tim.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-01-10 09:43:15,858 INFO tim.opensearch.configure_opensearch_client(): OpenSearch request configurations: {'OPENSEARCH_BULK_MAX_CHUNK_BYTES': 104857600, 'OPENSEARCH_BULK_MAX_RETRIES': 50, 'OPENSEARCH_REQUEST_TIMEOUT': 120}
2025-01-10 09:43:15,859 INFO tim.cli.main(): OpenSearch client configured for endpoint 'localhost'
2025-01-10 09:43:15,870 INFO tim.cli.bulk_update(): Bulk updating records from dataset 's3://timdex-extract-dev-222053980223/dataset/' into 'gisogm-2025-01-10t14-41-50'
2025-01-10 09:43:16,261 INFO timdex_dataset_api.dataset.load() line 132: Dataset successfully loaded: 's3://timdex-extract-dev-222053980223/dataset/', 0.36s
2025-01-10 09:43:16,712 INFO tim.opensearch.bulk_index(): All records ingested, refreshing index.
2025-01-10 09:43:17,009 ERROR tim.opensearch.bulk_delete(): Record to delete 'gisogm:edu.wisc:94dfabb705ae' was not found in index 'gisogm-2025-01-10t14-41-50'.
2025-01-10 09:43:17,009 ERROR tim.opensearch.bulk_delete(): Record to delete 'gisogm:edu.wisc:46f062d861b9' was not found in index 'gisogm-2025-01-10t14-41-50'.
2025-01-10 09:43:17,009 ERROR tim.opensearch.bulk_delete(): Record to delete 'gisogm:edu.wisc:a712ff78f94f' was not found in index 'gisogm-2025-01-10t14-41-50'.
2025-01-10 09:43:17,009 ERROR tim.opensearch.bulk_delete(): Record to delete 'gisogm:edu.wisc:1e64bf30243f' was not found in index 'gisogm-2025-01-10t14-41-50'.
2025-01-10 09:43:17,010 ERROR tim.opensearch.bulk_delete(): Record to delete 'gisogm:edu.wisc:4685a8bda151' was not found in index 'gisogm-2025-01-10t14-41-50'.
.
.
.
2025-01-10 09:43:17,012 ERROR tim.opensearch.bulk_delete(): Record to delete 'gisogm:edu.wisc:b315beb82615' was not found in index 'gisogm-2025-01-10t14-41-50'.

⭐ bulk-update result in OpenSearch
Number of hits / record counts matches expected number of records to index: 103.

⭐ bulk-update command for libguides:

pipenv run tim bulk-update \
-s libguides \
--run-date="2024-12-01" \
--run-id="90d6f2e1-8233-4471-9d4c-59830effda23" \
s3://timdex-extract-dev-222053980223/dataset/

⭐ bulk-update log output for libguides:

Loading .env environment variables...
Courtesy Notice: Pipenv found itself running within a virtual environment, so it will automatically use that environment, instead of creating its own for any project. You can set PIPENV_IGNORE_VIRTUALENVS=1 to force pipenv to ignore that environment and create its own instead. You can set PIPENV_VERBOSITY=-1 to suppress this warning.
2025-01-10 09:56:49,691 INFO tim.cli.main(): Logger 'root' configured with level=INFO
2025-01-10 09:56:49,691 INFO tim.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-01-10 09:56:49,691 INFO tim.opensearch.configure_opensearch_client(): OpenSearch request configurations: {'OPENSEARCH_BULK_MAX_CHUNK_BYTES': 104857600, 'OPENSEARCH_BULK_MAX_RETRIES': 50, 'OPENSEARCH_REQUEST_TIMEOUT': 120}
2025-01-10 09:56:49,692 INFO tim.cli.main(): OpenSearch client configured for endpoint 'localhost'
2025-01-10 09:56:49,701 INFO tim.cli.bulk_update(): Bulk updating records from dataset 's3://timdex-extract-dev-222053980223/dataset/' into 'libguides-2025-01-10t14-54-23'
2025-01-10 09:56:50,036 INFO timdex_dataset_api.dataset.load() line 132: Dataset successfully loaded: 's3://timdex-extract-dev-222053980223/dataset/', 0.32s
2025-01-10 09:56:50,263 INFO tim.opensearch.bulk_index(): All records ingested, refreshing index.
2025-01-10 09:56:50,286 INFO tim.opensearch.bulk_delete(): All records deleted, refreshing index.
2025-01-10 09:56:50,289 INFO tim.cli.bulk_update(): Bulk update complete: {"index": {"created": 34, "updated": 0, "errors": 0, "total": 34}, "delete": {"deleted": 0, "errors": 0, "total": 0}}
2025-01-10 09:56:50,290 INFO tim.cli.log_process_time(): Total time to complete process: 0:00:00.599161

⭐ bulk-update result in OpenSearch
Number of hits / record counts matches expected number of records to index: 34.

Includes new or updated dependencies?

YES - Add timdex-dataset-api library as dependency (to read records from TIMDEX parquet dataset).

Changes expectations for external applications?

YES - The new TIM CLI command bulk-update is configured to work with v2 TIMDEX parquet dataset(s). TIM remains backwards v1 compatible.

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-428

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples have been verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: * The timdex-index-manager (TIM) needs to support the v2 parquet dataset, which now contains records for both indexing and deleting. The new CLI command performs a "bulk update" given a subset of the dataset (filtered by 'run_date' and 'run_id') and uses the timdex-dataset-api library to read records from the TIMDEXDataset. By introducing a new CLI command, it doesn't require the feature flagging approach, allowing the existing CLI commands and helper functions to remain untouched for v1 purposes. How this addresses that need: * Implement 'bulk-update' CLI command * Add unit tests for 'bulk-update' * Move v1 CLI unit tests to 'test_cli_v1' * Update README Side effects of this change: * TIM remains backwards v1 compatible but will now support v2 runs. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-428

ghukill

Looking real good! My thoughts/comments are somewhat minimal here given some good off-PR discussions.

I was able to run Opensearch locally and perform the examples bulk-update runs outlined in the PR.

I'm requesting changes related to test_cli.py and test_cli_v1.py.

I wonder if test_cli_v1.py should only include tests related to bulk-index and bulk-delete (these tests)?

It feels like what the other tests test, even given these "v2" (parquet) updates, remain unchanged and could therefore remain in test_cli.py. The approach of the _v1 test files is that if and when we go fully "v2", we could remove that file altogether. I don't think that's the case as it stands here.

This application is slightly different, as we are just adding functionality; a new CLI command bulk-update. Technically, bulk-index and bulk-delete would still function, but they are expecting "v1" inputs.

This is really nice work, and I think any fuzziness here can be attributed to the somewhat fuzzy nature of the feature flags. They are admiteddly implemented a little different in each component touched (Transmog, lambdas, etc.).

An alternative could be to:

add feature flagging branching logic to bulk-index and bulk-delete
have those CLI commands support both v1 and v2 inputs
add tests as needed

My recommendation would be that we don't go this route. TIM remains v1 and v2 compliant, because the v1 and v2 StepFunction calls will use different CLI commands (bulk-index + bulk-delete for v1, bulk-update for v2).

Taking all this together... I wonder if we can just have a single test_cli.py file, with the understanding that bulk-index and bulk-delete -- both CLI code and tests -- are "v1" for the time being. If we ever do want v2 compliant bulk-index and bulk-delete, then we'd need some additional work to update the code + tests.

* Consolidate CLI tests for both ETL v1 and v2

coveralls · 2025-01-13T17:42:10Z

Pull Request Test Coverage Report for Build 12932609744

Details

27 of 27 (100.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 11519649883:	0.0%
Covered Lines:	435
Relevant Lines:	435

💛 - Coveralls

ghukill

Looks great! Nice work.

ghukill

Reverting "Approve" to "Request Changes".

After a test run in Dev1 (now possible), noticed that timdex-dataset-api was not installed. Think we need to move from a "dev" dependency into a normal one.

Pipfile

ghukill · 2025-01-13T20:52:17Z

@jonavellecuerdo, thrilled to report with this PR (and the requested dependencies change), the first v2 successful end-to-end run in Dev!

https://us-east-1.console.aws.amazon.com/states/home?region=us-east-1#/v2/executions/details/arn:aws:states:us-east-1:222053980223:execution:timdex-ingest-v2-dev:095c98e9-dcef-4db4-a4c0-d7b0a4e2a6f7

* Include TDA to main list of dependencies

ehanson8

One minor docstring suggestion but otherwise looks great!

ehanson8 · 2025-01-14T16:51:26Z

README.md

-
-# If set to a valid Sentry DSN, enables Sentry exception monitoring This is not needed for local development.
-SENTRY_DSN=
+```shell


Much cleaner, great change!

tim/cli.py

jonavellecuerdo self-assigned this Jan 10, 2025

jonavellecuerdo marked this pull request as draft January 10, 2025 16:25

jonavellecuerdo marked this pull request as ready for review January 10, 2025 16:56

jonavellecuerdo requested review from ghukill and ehanson8 January 10, 2025 16:56

ghukill requested changes Jan 10, 2025

View reviewed changes

Address comment in PR #359

8231eb3

* Consolidate CLI tests for both ETL v1 and v2

jonavellecuerdo requested a review from ghukill January 13, 2025 17:40

ghukill approved these changes Jan 13, 2025

View reviewed changes

ghukill requested changes Jan 13, 2025

View reviewed changes

Pipfile Outdated Show resolved Hide resolved

Address comment in PR #359

95825bd

* Include TDA to main list of dependencies

jonavellecuerdo requested a review from ghukill January 13, 2025 21:44

ghukill approved these changes Jan 14, 2025

View reviewed changes

ehanson8 approved these changes Jan 14, 2025

View reviewed changes

ghukill mentioned this pull request Jan 22, 2025

TIMX 410 - add TIMDEX provenance to Opensearch mapping #360

Merged

9 tasks

Update CLI command desc

5dac713

jonavellecuerdo merged commit 37c30fe into main Jan 23, 2025
3 checks passed

jonavellecuerdo deleted the TIMX-428-add-bulk-update-cli-cmd branch January 23, 2025 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add bulk-update command and support for TIMDEX parquet dataset #359

Add bulk-update command and support for TIMDEX parquet dataset #359

Uh oh!

jonavellecuerdo commented Jan 10, 2025 •

edited

Loading

Uh oh!

ghukill left a comment

Uh oh!

coveralls commented Jan 13, 2025 •

edited

Loading

Uh oh!

ghukill left a comment

Uh oh!

ghukill left a comment

Uh oh!

Uh oh!

ghukill commented Jan 13, 2025

Uh oh!

ehanson8 left a comment

Uh oh!

ehanson8 Jan 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add bulk-update command and support for TIMDEX parquet dataset #359

Add bulk-update command and support for TIMDEX parquet dataset #359

Uh oh!

Conversation

jonavellecuerdo commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

coveralls commented Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 12932609744

Details

💛 - Coveralls

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ghukill commented Jan 13, 2025

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

ehanson8 Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonavellecuerdo commented Jan 10, 2025 •

edited

Loading

coveralls commented Jan 13, 2025 •

edited

Loading