-
Notifications
You must be signed in to change notification settings - Fork 0
Add bulk-update command and support for TIMDEX parquet dataset #359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: * The timdex-index-manager (TIM) needs to support the v2 parquet dataset, which now contains records for both indexing and deleting. The new CLI command performs a "bulk update" given a subset of the dataset (filtered by 'run_date' and 'run_id') and uses the timdex-dataset-api library to read records from the TIMDEXDataset. By introducing a new CLI command, it doesn't require the feature flagging approach, allowing the existing CLI commands and helper functions to remain untouched for v1 purposes. How this addresses that need: * Implement 'bulk-update' CLI command * Add unit tests for 'bulk-update' * Move v1 CLI unit tests to 'test_cli_v1' * Update README Side effects of this change: * TIM remains backwards v1 compatible but will now support v2 runs. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-428
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking real good! My thoughts/comments are somewhat minimal here given some good off-PR discussions.
I was able to run Opensearch locally and perform the examples bulk-update
runs outlined in the PR.
I'm requesting changes related to test_cli.py
and test_cli_v1.py
.
I wonder if test_cli_v1.py
should only include tests related to bulk-index
and bulk-delete
(these tests)?
It feels like what the other tests test, even given these "v2" (parquet) updates, remain unchanged and could therefore remain in test_cli.py
. The approach of the _v1
test files is that if and when we go fully "v2", we could remove that file altogether. I don't think that's the case as it stands here.
This application is slightly different, as we are just adding functionality; a new CLI command bulk-update
. Technically, bulk-index
and bulk-delete
would still function, but they are expecting "v1" inputs.
This is really nice work, and I think any fuzziness here can be attributed to the somewhat fuzzy nature of the feature flags. They are admiteddly implemented a little different in each component touched (Transmog, lambdas, etc.).
An alternative could be to:
- add feature flagging branching logic to
bulk-index
andbulk-delete
- have those CLI commands support both v1 and v2 inputs
- add tests as needed
My recommendation would be that we don't go this route. TIM remains v1 and v2 compliant, because the v1 and v2 StepFunction calls will use different CLI commands (bulk-index
+ bulk-delete
for v1, bulk-update
for v2).
Taking all this together... I wonder if we can just have a single test_cli.py
file, with the understanding that bulk-index
and bulk-delete
-- both CLI code and tests -- are "v1" for the time being. If we ever do want v2 compliant bulk-index
and bulk-delete
, then we'd need some additional work to update the code + tests.
* Consolidate CLI tests for both ETL v1 and v2
Pull Request Test Coverage Report for Build 12932609744Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Nice work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting "Approve" to "Request Changes".
After a test run in Dev1 (now possible), noticed that timdex-dataset-api
was not installed. Think we need to move from a "dev" dependency into a normal one.
@jonavellecuerdo, thrilled to report with this PR (and the requested dependencies change), the first v2 successful end-to-end run in Dev! ![]() |
* Include TDA to main list of dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor docstring suggestion but otherwise looks great!
|
||
# If set to a valid Sentry DSN, enables Sentry exception monitoring This is not needed for local development. | ||
SENTRY_DSN= | ||
```shell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much cleaner, great change!
Purpose and background context
A new CLI command bulk-update is added to TIM that expects run_date and run_id to get a subset of the v2 TIMDEX parquet dataset, where it then performs bulk indexing and deleting based on records with action="index" and action="delete" respectively.
How can a reviewer manually see the effects of these changes?
A. Review added unit tests.
Note: Similar to what was done for
timdex-pipeline-lambdas
, I moved the existing V1 unit tests for CLI commands intotest_cli_v1.py
. V1 tests remain unchanged. The two new unit tests for thebulk-update
command are intest_cli.py
. For V2 unit tests, I used patch decorators to replace function calls with mock objects.@ghukill and I took a look at the recorded
vcr
cassettes and realized that the unit tests will need to be updated at some point (either update thevcr
cassettes or refactor unit tests), but this is outside the scope of this ticket. Additional testing was done via a local OpenSearch instance to verify functionality.B. Screenshots and logs from testing with an [OpenSearch instance running locally with Docker](https://mitlibraries.atlassian.net/browse/TIMX-428) have been provided.
Prerequisite: A TIMDEX Dataset was created in Dev1 containing records from a sample of extract files for
libguides
andgisogm
(thanks, @ghukill 😄). Here are some details regarding the files:libguides
andgisogm
.bulk-update
command forgisogm
:bulk-update
log output [REDACTED] forgisogm
:Note: Errors are thrown because the "records to delete" are not in the local OpenSearch instance. This is okay!
bulk-update
result in OpenSearchNumber of hits / record counts matches expected number of records to index: 103.
bulk-update
command forlibguides
:bulk-update
log output forlibguides
:bulk-update
result in OpenSearchNumber of hits / record counts matches expected number of records to index: 34.
Includes new or updated dependencies?
YES - Add
timdex-dataset-api
library as dependency (to read records from TIMDEX parquet dataset).Changes expectations for external applications?
YES - The new TIM CLI command
bulk-update
is configured to work with v2 TIMDEX parquet dataset(s). TIM remains backwards v1 compatible.What are the relevant tickets?
Developer
Code Reviewer(s)