Skip to content

Add bulk-update command and support for TIMDEX parquet dataset #359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 23, 2025

Conversation

jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Jan 10, 2025

Purpose and background context

A new CLI command bulk-update is added to TIM that expects run_date and run_id to get a subset of the v2 TIMDEX parquet dataset, where it then performs bulk indexing and deleting based on records with action="index" and action="delete" respectively.

How can a reviewer manually see the effects of these changes?

A. Review added unit tests.
Note: Similar to what was done for timdex-pipeline-lambdas, I moved the existing V1 unit tests for CLI commands into test_cli_v1.py. V1 tests remain unchanged. The two new unit tests for the bulk-update command are in test_cli.py. For V2 unit tests, I used patch decorators to replace function calls with mock objects.

@ghukill and I took a look at the recorded vcr cassettes and realized that the unit tests will need to be updated at some point (either update the vcr cassettes or refactor unit tests), but this is outside the scope of this ticket. Additional testing was done via a local OpenSearch instance to verify functionality.

B. Screenshots and logs from testing with an [OpenSearch instance running locally with Docker](https://mitlibraries.atlassian.net/browse/TIMX-428) have been provided.

Prerequisite: A TIMDEX Dataset was created in Dev1 containing records from a sample of extract files for libguides and gisogm (thanks, @ghukill 😄). Here are some details regarding the files:

# Dev1 Dataset
location = s3://timdex-extract-dev-222053980223/dataset/

# "libguides" daily run with 33 records to index
run_date = 2024-12-01
run_id = 90d6f2e1-8233-4471-9d4c-59830effda23

# "gisogm" daily run with 103 records to index, 83 to delete
run_date = 2024-05-02
run_id = cc4339a5-00de-4717-a4b1-43be292f61f7
  1. Two indexes were created and promoted for libguides and gisogm.
  2. bulk-update command for gisogm:
pipenv run tim bulk-update \
-s gisogm \
--run-date="2024-05-02" \
--run-id="cc4339a5-00de-4717-a4b1-43be292f61f7" \
s3://timdex-extract-dev-222053980223/dataset/
  1. bulk-update log output [REDACTED] for gisogm:
    Note: Errors are thrown because the "records to delete" are not in the local OpenSearch instance. This is okay!
Loading .env environment variables...
2025-01-10 09:43:15,858 INFO tim.cli.main(): Logger 'root' configured with level=INFO
2025-01-10 09:43:15,858 INFO tim.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-01-10 09:43:15,858 INFO tim.opensearch.configure_opensearch_client(): OpenSearch request configurations: {'OPENSEARCH_BULK_MAX_CHUNK_BYTES': 104857600, 'OPENSEARCH_BULK_MAX_RETRIES': 50, 'OPENSEARCH_REQUEST_TIMEOUT': 120}
2025-01-10 09:43:15,859 INFO tim.cli.main(): OpenSearch client configured for endpoint 'localhost'
2025-01-10 09:43:15,870 INFO tim.cli.bulk_update(): Bulk updating records from dataset 's3://timdex-extract-dev-222053980223/dataset/' into 'gisogm-2025-01-10t14-41-50'
2025-01-10 09:43:16,261 INFO timdex_dataset_api.dataset.load() line 132: Dataset successfully loaded: 's3://timdex-extract-dev-222053980223/dataset/', 0.36s
2025-01-10 09:43:16,712 INFO tim.opensearch.bulk_index(): All records ingested, refreshing index.
2025-01-10 09:43:17,009 ERROR tim.opensearch.bulk_delete(): Record to delete 'gisogm:edu.wisc:94dfabb705ae' was not found in index 'gisogm-2025-01-10t14-41-50'.
2025-01-10 09:43:17,009 ERROR tim.opensearch.bulk_delete(): Record to delete 'gisogm:edu.wisc:46f062d861b9' was not found in index 'gisogm-2025-01-10t14-41-50'.
2025-01-10 09:43:17,009 ERROR tim.opensearch.bulk_delete(): Record to delete 'gisogm:edu.wisc:a712ff78f94f' was not found in index 'gisogm-2025-01-10t14-41-50'.
2025-01-10 09:43:17,009 ERROR tim.opensearch.bulk_delete(): Record to delete 'gisogm:edu.wisc:1e64bf30243f' was not found in index 'gisogm-2025-01-10t14-41-50'.
2025-01-10 09:43:17,010 ERROR tim.opensearch.bulk_delete(): Record to delete 'gisogm:edu.wisc:4685a8bda151' was not found in index 'gisogm-2025-01-10t14-41-50'.
.
.
.
2025-01-10 09:43:17,012 ERROR tim.opensearch.bulk_delete(): Record to delete 'gisogm:edu.wisc:b315beb82615' was not found in index 'gisogm-2025-01-10t14-41-50'.
  1. bulk-update result in OpenSearch
    Number of hits / record counts matches expected number of records to index: 103.
image
  1. bulk-update command for libguides:
pipenv run tim bulk-update \
-s libguides \
--run-date="2024-12-01" \
--run-id="90d6f2e1-8233-4471-9d4c-59830effda23" \
s3://timdex-extract-dev-222053980223/dataset/
  1. bulk-update log output for libguides:
Loading .env environment variables...
Courtesy Notice: Pipenv found itself running within a virtual environment, so it will automatically use that environment, instead of creating its own for any project. You can set PIPENV_IGNORE_VIRTUALENVS=1 to force pipenv to ignore that environment and create its own instead. You can set PIPENV_VERBOSITY=-1 to suppress this warning.
2025-01-10 09:56:49,691 INFO tim.cli.main(): Logger 'root' configured with level=INFO
2025-01-10 09:56:49,691 INFO tim.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-01-10 09:56:49,691 INFO tim.opensearch.configure_opensearch_client(): OpenSearch request configurations: {'OPENSEARCH_BULK_MAX_CHUNK_BYTES': 104857600, 'OPENSEARCH_BULK_MAX_RETRIES': 50, 'OPENSEARCH_REQUEST_TIMEOUT': 120}
2025-01-10 09:56:49,692 INFO tim.cli.main(): OpenSearch client configured for endpoint 'localhost'
2025-01-10 09:56:49,701 INFO tim.cli.bulk_update(): Bulk updating records from dataset 's3://timdex-extract-dev-222053980223/dataset/' into 'libguides-2025-01-10t14-54-23'
2025-01-10 09:56:50,036 INFO timdex_dataset_api.dataset.load() line 132: Dataset successfully loaded: 's3://timdex-extract-dev-222053980223/dataset/', 0.32s
2025-01-10 09:56:50,263 INFO tim.opensearch.bulk_index(): All records ingested, refreshing index.
2025-01-10 09:56:50,286 INFO tim.opensearch.bulk_delete(): All records deleted, refreshing index.
2025-01-10 09:56:50,289 INFO tim.cli.bulk_update(): Bulk update complete: {"index": {"created": 34, "updated": 0, "errors": 0, "total": 34}, "delete": {"deleted": 0, "errors": 0, "total": 0}}
2025-01-10 09:56:50,290 INFO tim.cli.log_process_time(): Total time to complete process: 0:00:00.599161
  1. bulk-update result in OpenSearch
    Number of hits / record counts matches expected number of records to index: 34.
image

Includes new or updated dependencies?

YES - Add timdex-dataset-api library as dependency (to read records from TIMDEX parquet dataset).

Changes expectations for external applications?

YES - The new TIM CLI command bulk-update is configured to work with v2 TIMDEX parquet dataset(s). TIM remains backwards v1 compatible.

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples have been verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
* The timdex-index-manager (TIM) needs to support the v2 parquet dataset,
which now contains records for both indexing and deleting. The new CLI
command performs a "bulk update" given a subset of the dataset
(filtered by 'run_date' and 'run_id') and uses the timdex-dataset-api
library to read records from the TIMDEXDataset.

By introducing a new CLI command, it doesn't require the feature
flagging approach, allowing the existing CLI commands and helper functions
to remain untouched for v1 purposes.

How this addresses that need:
* Implement 'bulk-update' CLI command
* Add unit tests for 'bulk-update'
* Move v1 CLI unit tests to 'test_cli_v1'
* Update README

Side effects of this change:
* TIM remains backwards v1 compatible but will now support v2 runs.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-428
@jonavellecuerdo jonavellecuerdo self-assigned this Jan 10, 2025
@jonavellecuerdo jonavellecuerdo marked this pull request as draft January 10, 2025 16:25
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review January 10, 2025 16:56
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking real good! My thoughts/comments are somewhat minimal here given some good off-PR discussions.

I was able to run Opensearch locally and perform the examples bulk-update runs outlined in the PR.

I'm requesting changes related to test_cli.py and test_cli_v1.py.

I wonder if test_cli_v1.py should only include tests related to bulk-index and bulk-delete (these tests)?

It feels like what the other tests test, even given these "v2" (parquet) updates, remain unchanged and could therefore remain in test_cli.py. The approach of the _v1 test files is that if and when we go fully "v2", we could remove that file altogether. I don't think that's the case as it stands here.

This application is slightly different, as we are just adding functionality; a new CLI command bulk-update. Technically, bulk-index and bulk-delete would still function, but they are expecting "v1" inputs.

This is really nice work, and I think any fuzziness here can be attributed to the somewhat fuzzy nature of the feature flags. They are admiteddly implemented a little different in each component touched (Transmog, lambdas, etc.).

An alternative could be to:

  • add feature flagging branching logic to bulk-index and bulk-delete
  • have those CLI commands support both v1 and v2 inputs
  • add tests as needed

My recommendation would be that we don't go this route. TIM remains v1 and v2 compliant, because the v1 and v2 StepFunction calls will use different CLI commands (bulk-index + bulk-delete for v1, bulk-update for v2).

Taking all this together... I wonder if we can just have a single test_cli.py file, with the understanding that bulk-index and bulk-delete -- both CLI code and tests -- are "v1" for the time being. If we ever do want v2 compliant bulk-index and bulk-delete, then we'd need some additional work to update the code + tests.

* Consolidate CLI tests for both ETL v1 and v2
@coveralls
Copy link

coveralls commented Jan 13, 2025

Pull Request Test Coverage Report for Build 12932609744

Details

  • 27 of 27 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11519649883: 0.0%
Covered Lines: 435
Relevant Lines: 435

💛 - Coveralls

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Nice work.

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting "Approve" to "Request Changes".

After a test run in Dev1 (now possible), noticed that timdex-dataset-api was not installed. Think we need to move from a "dev" dependency into a normal one.

@ghukill
Copy link
Contributor

ghukill commented Jan 13, 2025

@jonavellecuerdo, thrilled to report with this PR (and the requested dependencies change), the first v2 successful end-to-end run in Dev!

https://us-east-1.console.aws.amazon.com/states/home?region=us-east-1#/v2/executions/details/arn:aws:states:us-east-1:222053980223:execution:timdex-ingest-v2-dev:095c98e9-dcef-4db4-a4c0-d7b0a4e2a6f7

Screenshot 2025-01-13 at 3 50 45 PM

* Include TDA to main list of dependencies
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor docstring suggestion but otherwise looks great!


# If set to a valid Sentry DSN, enables Sentry exception monitoring This is not needed for local development.
SENTRY_DSN=
```shell
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner, great change!

@jonavellecuerdo jonavellecuerdo merged commit 37c30fe into main Jan 23, 2025
3 checks passed
@jonavellecuerdo jonavellecuerdo deleted the TIMX-428-add-bulk-update-cli-cmd branch January 23, 2025 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants