Skip to content

Commit c0ce8ba

Browse files
authored
Merge pull request #365 from MITLibraries/TIMX-495-full-reindex-cli-command
TIMX 495 - add new reindex-source CLI command
2 parents cdd42f8 + 75412fc commit c0ce8ba

File tree

3 files changed

+182
-58
lines changed

3 files changed

+182
-58
lines changed

README.md

Lines changed: 71 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -18,38 +18,48 @@ TIMDEX! Index Manager (TIM) is a Python CLI application for managing TIMDEX indi
1818

1919
1. Run the following command:
2020

21-
``` bash
22-
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" \
23-
-e "plugins.security.disabled=true" \
24-
opensearchproject/opensearch:2.11.1
25-
```
21+
``` bash
22+
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" \
23+
-e "plugins.security.disabled=true" \
24+
opensearchproject/opensearch:2.11.1
25+
```
2626

2727
2. To confirm the instance is up, run `pipenv run tim -u localhost ping` or visit http://localhost:9200/. This should produce a log that looks like the following:
28-
```
29-
2024-02-08 13:22:16,826 INFO tim.cli.main(): OpenSearch client configured for endpoint 'localhost'
28+
29+
```text
30+
2024-02-08 13:22:16,826 INFO tim.cli.main(): OpenSearch client configured for endpoint 'localhost'
3031
31-
Name: docker-cluster
32-
UUID: RVCmwQ_LQEuh1GrtwGnRMw
33-
OpenSearch version: 2.11.1
34-
Lucene version: 9.7.0
32+
Name: docker-cluster
33+
UUID: RVCmwQ_LQEuh1GrtwGnRMw
34+
OpenSearch version: 2.11.1
35+
Lucene version: 9.7.0
3536
36-
2024-02-08 13:22:16,930 INFO tim.cli.log_process_time(): Total time to complete process: 0:00:00.105506
37-
```
37+
2024-02-08 13:22:16,930 INFO tim.cli.log_process_time(): Total time to complete process: 0:00:00.105506
38+
```
3839

3940
### Running Opensearch and OpenSearch Dashboards locally with Docker
4041

4142
You can use the included Docker Compose file ([compose.yaml](compose.yaml)) to start an OpenSearch instance along with OpenSearch Dashboards, "[the user interface that lets you visualize your Opensearch data and run and scale your OpenSearch clusters](https://opensearch.org/docs/latest/dashboards/)". Two tools that are useful for exploring indices are [DevTools](https://opensearch.org/docs/latest/dashboards/dev-tools/index-dev/) and [Discover](https://opensearch.org/docs/latest/dashboards/discover/index-discover/).
4243

4344
**Note:** To use Discover, you'll need to create an index pattern. When creating the index pattern, decline the option to set a date field. When set, it detects a date field in our indices but then crashes trying to use it. When prompted, enter an index or alias to pull patterns from, and it will automatically be configured to work well enough for initial data exploration.
4445

46+
First, ensure the following environment variables are set:
47+
48+
0. First, set some environment variables:
49+
50+
```shell
51+
OPENSEARCH_INITIAL_ADMIN_PASSWORD=SuperSecret42!
52+
```
53+
4554
1. Run the following command:
46-
```bash
47-
docker pull opensearchproject/opensearch:latest
48-
docker pull opensearchproject/opensearch-dashboards:latest
49-
docker compose up
50-
```
5155

52-
2. To confirm the instance is up, run `pipenv run tim -u localhost ping` or visit http://localhost:9200/.
56+
```shell
57+
docker pull opensearchproject/opensearch:latest
58+
docker pull opensearchproject/opensearch-dashboards:latest
59+
docker compose up
60+
```
61+
62+
2. To confirm the instance is up, run `pipenv run tim ping` or visit http://localhost:9200/.
5363

5464
3. Access OpenSearch Dashboards through <http://localhost:5601>.
5565

@@ -60,25 +70,28 @@ For a more detailed example with test data, please refer to the Confluence docum
6070
1. Follow the instructions in either [Running Opensearch locally with Docker](#running-opensearch-locally-with-docker) or [Running Opensearch and OpenSearch Dashboards locally with Docker](#running-opensearch-and-opensearch-dashboards-locally-with-docker).
6171

6272
2. Open a new terminal, and create a new index. Copy the name of the created index printed to the terminal's output.
63-
```
64-
pipenv run tim create -s <source-name>
65-
```
73+
74+
```shell
75+
pipenv run tim create -s <source-name>
76+
```
6677

6778
3. Copy the index name and promote the index to the alias.
6879

69-
```
70-
pipenv run tim promote -a <source-name> -i <index-name>
71-
```
80+
```shell
81+
pipenv run tim promote -a <source-name> -i <index-name>
82+
```
7283

7384
4. Bulk index records from a specified directory (e.g., including S3).
74-
```
75-
pipenv run tim bulk-index -s <source-name> <filepath-to-records>
76-
```
85+
86+
```shell
87+
pipenv run tim bulk-index -s <source-name> <filepath-to-records>
88+
```
7789

7890
5. After verifying that the bulk-index was successful, clean up your local OpenSearch instance by deleting the index.
79-
```
80-
pipenv run tim delete -i <index-name>
81-
```
91+
92+
```shell
93+
pipenv run tim delete -i <index-name>
94+
```
8295

8396
### Running OpenSearch on AWS
8497

@@ -115,31 +128,32 @@ SENTRY_DSN=### If set to a valid Sentry DSN, enables Sentry exception monitoring
115128
All CLI commands can be run with `pipenv run`.
116129

117130
```
118-
Usage: tim [OPTIONS] COMMAND [ARGS]...
119-
120-
TIM provides commands for interacting with OpenSearch indexes.
121-
For more details on a specific command, run tim COMMAND -h.
122-
123-
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
124-
│ --url -u TEXT The OpenSearch instance endpoint minus the http scheme, e.g. │
125-
'search-timdex-env-1234567890.us-east-1.es.amazonaws.com'. If not provided, will attempt to get from the │
126-
│ TIMDEX_OPENSEARCH_ENDPOINT environment variable. Defaults to 'localhost'. │
127-
│ --verbose -v Pass to log at debug level instead of info │
128-
│ --help -h Show this message and exit. │
129-
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
130-
╭─ Get cluster-level information ────────────────────────────────────────────────────────────────────────────────────────────────╮
131-
│ ping Ping OpenSearch and display information about the cluster. │
132-
│ indexes Display summary information about all indexes in the cluster. │
133-
│ aliases List OpenSearch aliases and their associated indexes. │
134-
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
135-
╭─ Index management commands ────────────────────────────────────────────────────────────────────────────────────────────────────╮
136-
│ create Create a new index in the cluster. │
137-
│ delete Delete an index. │
138-
│ promote Promote index as the primary alias and add it to any additional provided aliases. │
139-
│ demote Demote an index from all its associated aliases. │
140-
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
141-
╭─ Bulk record processing commands ──────────────────────────────────────────────────────────────────────────────────────────────╮
142-
│ bulk-update Bulk update records for an index. │
143-
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
131+
Usage: tim [OPTIONS] COMMAND [ARGS]...
132+
133+
TIM provides commands for interacting with OpenSearch indexes.
134+
For more details on a specific command, run tim COMMAND -h.
135+
136+
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
137+
│ --url -u TEXT The OpenSearch instance endpoint minus the http scheme, e.g. │
138+
│ 'search-timdex-env-1234567890.us-east-1.es.amazonaws.com'. If not provided, will attempt to get │
139+
│ from the TIMDEX_OPENSEARCH_ENDPOINT environment variable. Defaults to 'localhost'. │
140+
│ --verbose -v Pass to log at debug level instead of info │
141+
│ --help -h Show this message and exit. │
142+
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
143+
╭─ Get cluster-level information ───────────────────────────────────────────────────────────────────────────────────────╮
144+
│ ping Ping OpenSearch and display information about the cluster. │
145+
│ indexes Display summary information about all indexes in the cluster. │
146+
│ aliases List OpenSearch aliases and their associated indexes. │
147+
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
148+
╭─ Index management commands ───────────────────────────────────────────────────────────────────────────────────────────╮
149+
│ create Create a new index in the cluster. │
150+
│ delete Delete an index. │
151+
│ promote Promote index as the primary alias and add it to any additional provided aliases. │
152+
│ demote Demote an index from all its associated aliases. │
153+
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
154+
╭─ Bulk record processing commands ─────────────────────────────────────────────────────────────────────────────────────╮
155+
│ bulk-update Bulk update records for an index. │
156+
│ reindex-source Perform a full refresh for a source in Opensearch for all current records. │
157+
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
144158
```
145159

tests/test_cli.py

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -278,3 +278,44 @@ def test_bulk_update_with_source_raise_bulk_indexing_error(
278278
f'{{"index": {json.dumps(index_results_default)}, '
279279
f'"delete": {json.dumps(mock_bulk_delete())}}}' in caplog.text
280280
)
281+
282+
283+
@patch("tim.opensearch.create_index")
284+
@patch("tim.opensearch.promote_index")
285+
@patch("tim.opensearch.get_index_aliases")
286+
@patch("timdex_dataset_api.dataset.TIMDEXDataset.load")
287+
@patch("tim.opensearch.bulk_index")
288+
def test_reindex_source_success(
289+
mock_bulk_index,
290+
mock_timdex_dataset,
291+
mock_get_index_aliases,
292+
mock_promote_index,
293+
mock_create_index,
294+
caplog,
295+
monkeypatch,
296+
runner,
297+
):
298+
monkeypatch.delenv("TIMDEX_OPENSEARCH_ENDPOINT", raising=False)
299+
mock_get_index_aliases.return_value = ["alma", "all-current", "timdex"]
300+
mock_bulk_index.return_value = {
301+
"created": 1000,
302+
"updated": 0,
303+
"errors": 0,
304+
"total": 1000,
305+
}
306+
mock_timdex_dataset.return_value = MagicMock()
307+
308+
result = runner.invoke(
309+
main,
310+
[
311+
"reindex-source",
312+
"--source",
313+
"alma",
314+
"s3://test-timdex-bucket/dataset",
315+
],
316+
)
317+
assert result.exit_code == EXIT_CODES["success"]
318+
assert (
319+
"Reindex source complete: "
320+
f'{{"index": {json.dumps(mock_bulk_index())}' in caplog.text
321+
)

tim/cli.py

Lines changed: 70 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
},
2727
{
2828
"name": "Bulk record processing commands",
29-
"commands": ["bulk-index", "bulk-delete", "bulk-update"],
29+
"commands": ["bulk-update", "reindex-source"],
3030
},
3131
]
3232
}
@@ -315,3 +315,72 @@ def bulk_update(
315315

316316
summary_results = {"index": index_results, "delete": delete_results}
317317
logger.info(f"Bulk update complete: {json.dumps(summary_results)}")
318+
319+
320+
@main.command()
321+
@click.option(
322+
"-s",
323+
"--source",
324+
type=click.Choice(VALID_SOURCES),
325+
required=True,
326+
help="TIMDEX Source to fully reindex in Opensearch.",
327+
)
328+
@click.option(
329+
"-a",
330+
"--alias",
331+
multiple=True,
332+
help="Alias to promote the index to in addition to the primary alias. May "
333+
"be repeated to promote the index to multiple aliases at once.",
334+
)
335+
@click.argument("dataset_path", type=click.Path())
336+
@click.pass_context
337+
def reindex_source(
338+
ctx: click.Context,
339+
source: str,
340+
alias: tuple[str],
341+
dataset_path: str,
342+
) -> None:
343+
"""Perform a full refresh for a source in Opensearch for all current records.
344+
345+
This CLI command performs the following:
346+
1. creates a new index for the source
347+
2. promotes this index as the primary for the source alias, and added to any other
348+
aliases passed (e.g. 'timdex')
349+
3. uses the TDA library to yield only current records from the parquet dataset
350+
for the source
351+
4. bulk index these records to the new Opensearch index
352+
353+
The net effect is a full refresh for a source in Opensearch, ensuring only current,
354+
non-deleted versions of records are used from the parquet dataset.
355+
"""
356+
client = ctx.obj["CLIENT"]
357+
358+
# create new index
359+
index = helpers.generate_index_name(source)
360+
new_index = tim_os.create_index(ctx.obj["CLIENT"], str(index))
361+
logger.info("Index '%s' created.", new_index)
362+
363+
# promote index
364+
aliases = [source, *list(alias)]
365+
tim_os.promote_index(client, index, extra_aliases=aliases)
366+
logger.info(
367+
"Index promoted. Current aliases for index '%s': %s",
368+
index,
369+
tim_os.get_index_aliases(client, index),
370+
)
371+
372+
# perform bulk indexing of current records from source
373+
index_results = {"created": 0, "updated": 0, "errors": 0, "total": 0}
374+
375+
td = TIMDEXDataset(location=dataset_path)
376+
td.load(current_records=True, source=source)
377+
378+
# bulk index records
379+
records_to_index = td.read_transformed_records_iter(action="index")
380+
try:
381+
index_results.update(tim_os.bulk_index(client, index, records_to_index))
382+
except BulkIndexingError as exception:
383+
logger.info(f"Bulk indexing failed: {exception}")
384+
385+
summary_results = {"index": index_results}
386+
logger.info(f"Reindex source complete: {json.dumps(summary_results)}")

0 commit comments

Comments
 (0)