Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to documentation #139

Merged
merged 3 commits into from
Feb 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 24 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,25 @@ Depending on your environment, and if you are using Docker directly or not, thes
#### Architecture
![Architecture Diagram](img/presto_architectural_diagram.png?raw=true "Architecture Diagram")

[Source](https://docs.google.com/drawings/d/1ZWiUwMfMDdJprloUZEYm9GxUUisxmOXotdlX_aRprp8/edit)

#### Execution Flowchart
![Architecture Flowchart](img/presto_flowchart.png?raw=true "Architecture Flowchart")

[Source](https://docs.google.com/drawings/d/1_UEb8k5wZUSJ3Jnz_G-hDVh5xpkBjGrN6Mn--RVhVMY/edit)

#### External Services Flowchart
![External Architectural Flowchart](img/presto_flowchart_external.png?raw=true "External Architecture Flowchart")

[Source](https://docs.google.com/drawings/d/1lsNS68YsGUMT1YBKQrpqZq1o3XPsZ1BWNIXlLv7-u78/edit)

#### Swimlane Chart
![Swimlane Diagram](img/presto_swimlane.png?raw=true "Swimlane Diagram")

[Source](https://docs.google.com/drawings/d/1EqgLmbXJgk-DJrfEJwpT4WIduv-L3EWDRFpvUJAOxWA/edit)

### Setup
To run the project, you can use the provided `Dockerfile`, or start via `docker-compose build && docker-compose up`. This file sets up the environment by installing the required dependencies and running the `run.py` file when the container is started. To build and run the Docker image from the Dockerfile directly, run the following commands:
To run the project, you can use the provided `Dockerfile`, or start via `docker compose build && docker compose up`. This file sets up the environment by installing the required dependencies and running the `run.py` file when the container is started. To build and run the Docker image from the Dockerfile directly, run the following commands:

```
docker build -t text-vectorization .
Expand All @@ -41,6 +51,12 @@ Here, we require at least one environment variable - `model_name`. If left unspe

Currently supported `model_name` values are just module names keyed from the `model` directory, and currently are as follows:

* `classycat.Model` - ClassyCat Parent Model
* `classycat_classify.Model` - ClassyCat Text Classifier
* `classycat_schema_create.Model` - ClassyCat Schema Creation
* `classycat_schema_lookup.Model` - ClassyCat Schema Lookup
* `fasttext.Model` - fasttext Language Model
* `yake_keywords.Model` - YAKE Keyword Extractor
* `fptg.Model` - text model, uses `meedan/paraphrase-filipino-mpnet-base-v2`
* `indian_sbert.Model` - text model, uses `meedan/indian-sbert`
* `mean_tokens.Model` - text model, uses `xlm-r-bert-base-nli-stsb-mean-tokens`
Expand All @@ -49,17 +65,20 @@ Currently supported `model_name` values are just module names keyed from the `mo
* `audio.Model` - audio model

### Makefile
The Makefile contains four targets, `run`, `run_http`, `run_worker`, and `run_test`. `run` runs the `run.py` file when executed - if `RUN_MODE` is set to `http`, it will run the `run_http` command, else it will `run_worker`. Alternatively, you can call `run_http` or `run_worker` directly. Remember to have the environment variables described above defined. `run_test` runs the test suite which is expected to be passing currently - reach out if it fails on your hardware!
The Makefile contains five targets, `run`, `run_http`, `run_worker`, `run_processor`, and `run_test`. `run` runs the `run.py` file when executed - if `RUN_MODE` is set to `http`, it will run the `run_http` command, else it will `run_worker` as well as a `run_processor`. Alternatively, you can call `run_http`, `run_processor`, or `run_worker` directly. Remember to have the environment variables described above defined. `run_test` runs the test suite which is expected to be passing currently - reach out if it fails on your hardware!

### run_worker.py
The `run_worker.py` file is the main routine that runs the fingerprinting process. It sets up the queue and model instances, receives messages from the queue, applies the model to the messages, responds to the queue with the vectorized text, and deletes the original messages in a loop within the `queue.process_messages` function. The `os.environ` statements retrieve environment variables to create the queue and model instances.

### run.py
The `run.py` file is the main routine that runs the vectorization process. It sets up the queue and model instances, receives messages from the queue, applies the model to the messages, responds to the queue with the vectorized text, and deletes the original messages in a loop within the `queue.process_messages` function. The `os.environ` statements retrieve environment variables to create the queue and model instances.
### run_processor.py
The `run_processor.py` file is a simple script that listens to output queues and sends callbacks to original requestors via HTTP.

### main.py
The `main.py` file is the HTTP server. We use FastAPI and provide two endpoints, which are described lower in this document. The goal of this HTTP server is to add items into a queue, and take messages from a queue and use them to fire HTTP call backs with the queue body to external services.

### Queues

Presto is able to `process_messages` via ElasticMQ or SQS. In practice, we use ElasticMQ for local development, and SQS for production environments. When interacting with a `queue`, we use the generic superclass `queue`. `queue.fingerprint` takes as an argument a `model` instance. The `fingerprint` routine collects a batch of `BATCH_SIZE` messages appropriate to the `BATCH_SIZE` for the `model` specified. Once pulled from the `input_queue`, those messages are processed via `model.respond`. The resulting fingerprint outputs from the model are then zipped with the original message pulled from the `input_queue`, and a message is placed onto the `output_queue` that consists of exactly: `{"request": message, "response": response}`.
Presto is able to `process_messages` via ElasticMQ or SQS. In practice, we use ElasticMQ for local development, and SQS for production environments. When interacting with a `queue`, we use the generic superclass `queue`. `queue.fingerprint` takes as an argument a `model` instance. The `fingerprint` routine collects a batch of `BATCH_SIZE` messages appropriate to the `BATCH_SIZE` for the `model` specified. Once pulled from the `input_queue`, those messages are processed via `model.respond`. The resulting fingerprint outputs from the model are then packed with the original message pulled from the `input_queue`, and a message is placed onto the `output_queue` that consists of exactly: `{"request": message, "response": response}`.

### Models

Expand Down
Binary file modified img/presto_architectural_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/presto_flowchart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/presto_flowchart_external.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/presto_swimlane.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion test/lib/model/test_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ def test_compute_pdq(self, mock_pdq_hasher):
mock_hasher_instance = mock_pdq_hasher.return_value
mock_hasher_instance.fromBufferedImage.return_value.getHash.return_value.dumpBitsFlat.return_value = '1001'
result = Model().compute_pdq(io.BytesIO(image_content))
self.assertEqual(result, '0011100000111011010110100001001110001011110100100010101011010111010110101010000111001010111000001010111111110000000101110010000011111110111110100100011111010010110110101111101100111001000000010010100101010111110001001101101011000110001000001110010000111100')
self.assertEqual(result, '0001110000111110101111100001110001110100001111100001011000111100101101100001000000010010101110110111110000010110010110001011111011101001100100101000101111000001000000111110100110100011110000011111111010010100010001001011111011110110100101001110011101000100')

@patch("urllib.request.urlopen")
def test_get_iobytes_for_image(self, mock_urlopen):
Expand Down
Loading