|
1 | 1 | # dedupeknn
|
2 |
| -Fast Scalable Dedupe - Fuzzy Matching with Opensearch and nmslib. |
| 2 | +dedupeknn is an innovative project designed to address the challenges of |
| 3 | +finding duplicated addresses and performing address matching efficiently. |
| 4 | +Leveraging advanced technologies such as FastText for generating |
| 5 | +vector representations and OpenSearch as a vector data source, |
| 6 | +Dedupeknn offers powerful solutions for these tasks. |
| 7 | +By employing nearest neighbor algorithms from NMSLIB, dedupeknn |
| 8 | +achieves accurate and speedy address comparisons. |
| 9 | + |
| 10 | +dedupeknn utilizes the FastText library, renowned for its effectiveness |
| 11 | +in generating high-quality vector representations of text inputs. |
| 12 | +By transforming address strings into vector embeddings, dedupeknn |
| 13 | +captures the semantic meaning and contextual information essential |
| 14 | +for accurate address comparisons. |
| 15 | + |
| 16 | +The OpenSearch framework serves as the vector data source for dedupeknn. |
| 17 | +OpenSearch is a search db maintained by AWS that provides efficient |
| 18 | +storage and retrieval capabilities for large-scale |
| 19 | +vector datasets. With OpenSearch, dedupeknn can handle vast amounts of |
| 20 | +address data, ensuring scalability and performance. |
| 21 | + |
| 22 | +To find the nearest neighbors of a given address vector, |
| 23 | +Dedupeknn employs nearest neighbor algorithms from NMSLIB. |
| 24 | +These algorithms efficiently search the vector data source to |
| 25 | +identify the most similar addresses, allowing for effective |
| 26 | +deduplication and address matching. |
| 27 | + |
| 28 | +By combining the strengths of FastText, OpenSearch, and NMSLIB, |
| 29 | +dedupeknn delivers a robust and accurate solution for addressing |
| 30 | +the challenges of duplicated addresses and address matching. |
| 31 | +Its fast and efficient algorithms enable organizations to streamline |
| 32 | +their operations, enhance data quality, and improve customer experiences. |
| 33 | + |
| 34 | +## Running dedupeknn |
| 35 | +1. The project uses `fastapi` library and runs as a microservice. The dependencies include |
| 36 | +running opensearch cluster with _opensearch-knn_ plugin installed. |
| 37 | +2. The configuration is loaded from the properties file - `properties/opensearch-client.properties` |
| 38 | +. Set the values accordingly with your installation setup. |
| 39 | +3. Creating a new conda environment - `conda create -n dedupeknn python=3.10` |
| 40 | +4. Install the required dependencies by - `pip install -r requirements.txt` |
| 41 | +5. Run the project - `python main.py` |
| 42 | + |
| 43 | +## Creating KNN index before ingesting data |
| 44 | +The below example shows, how to create opensearch index with knn support. |
| 45 | +```json |
| 46 | +{ |
| 47 | + "settings": { |
| 48 | + "index": { |
| 49 | + "knn": true, |
| 50 | + "knn.algo_param.ef_search": 100 |
| 51 | + } |
| 52 | + }, |
| 53 | + "mappings": { |
| 54 | + "properties": { |
| 55 | + "dedupe_vector_nmslib": { |
| 56 | + "type": "knn_vector", |
| 57 | + "dimension": 300, |
| 58 | + "method": { |
| 59 | + "name": "hnsw", |
| 60 | + "space_type": "cosinesimil", |
| 61 | + "engine": "nmslib", |
| 62 | + "parameters": { |
| 63 | + "ef_construction": 128, |
| 64 | + "m": 24 |
| 65 | + } |
| 66 | + } |
| 67 | + } |
| 68 | + } |
| 69 | + } |
| 70 | +} |
| 71 | +``` |
| 72 | +Note: |
| 73 | +1. We are using _consinesimil_ as KNN similarity match pattern. |
| 74 | +2. Using KNN algorihm implementation from _nmslib_ (non-metric space library). |
| 75 | +3. The fasttext model that we use for creating vector representation on input data is of 300 dimensions. |
| 76 | +So, we set the field _dimensions_ value to 300. If you are using any other model with 500 or 800 |
| 77 | +dimensions, change this filed accordingly. |
| 78 | + |
| 79 | +## API's exposed |
| 80 | + |
| 81 | +### Ingesting data: |
| 82 | +```shell |
| 83 | +curl --location 'http://localhost:8080/api/v1/knn/doc/insert' \ |
| 84 | +--header 'Content-Type: application/json' \ |
| 85 | +--data '{ |
| 86 | + "text": "#6/A Shashank J, 3rd Floor, Chetan Nilaya, 20 C Cross Rd, Ejipura, Bengaluru - 560047" |
| 87 | +}' |
| 88 | +``` |
| 89 | + |
| 90 | +### Getting vector representation of a string |
| 91 | +```shell |
| 92 | +curl --location 'http://localhost:8080/api/v1/vector/representation' \ |
| 93 | +--header 'Content-Type: application/json' \ |
| 94 | +--data-raw '{ |
| 95 | + "text": "*@) sdfd *29&3 -2030" |
| 96 | +}' |
| 97 | +``` |
| 98 | + |
| 99 | +### Getting K-Nearest-Neighbours for the input string |
| 100 | +```shell |
| 101 | +curl --location 'http://localhost:8080/api/v1/similarity/knn/search' \ |
| 102 | +--header 'Content-Type: application/json' \ |
| 103 | +--data '{ |
| 104 | + "text": "Chetan Nilaya, House No 6, 3rd Floor, Ejipur, Bangalore 560047", |
| 105 | + "size": 30, |
| 106 | + "k": 1 |
| 107 | +}' |
| 108 | +``` |
| 109 | +Note: |
| 110 | +1. size - number of neighbours. |
| 111 | +2. k - level of neighbours. |
| 112 | + |
| 113 | +### Similarity Match |
| 114 | +```shell |
| 115 | +curl --location 'http://localhost:8080/api/v1/similarity/address/search' \ |
| 116 | +--header 'Content-Type: application/json' \ |
| 117 | +--data '{ |
| 118 | + "text": "#6/A Third Floor, ChetanNilaya, 20C Road Ejipura, bengaluru karnataka 560047", |
| 119 | + "size": 30, |
| 120 | + "k": 1, |
| 121 | + "threshold": 70 |
| 122 | +}' |
| 123 | +``` |
0 commit comments