Skip to content

Commit 3dfe786

Browse files
main: Updated README.md
1 parent 33e8093 commit 3dfe786

File tree

1 file changed

+122
-1
lines changed

1 file changed

+122
-1
lines changed

README.md

Lines changed: 122 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,123 @@
11
# dedupeknn
2-
Fast Scalable Dedupe - Fuzzy Matching with Opensearch and nmslib.
2+
dedupeknn is an innovative project designed to address the challenges of
3+
finding duplicated addresses and performing address matching efficiently.
4+
Leveraging advanced technologies such as FastText for generating
5+
vector representations and OpenSearch as a vector data source,
6+
Dedupeknn offers powerful solutions for these tasks.
7+
By employing nearest neighbor algorithms from NMSLIB, dedupeknn
8+
achieves accurate and speedy address comparisons.
9+
10+
dedupeknn utilizes the FastText library, renowned for its effectiveness
11+
in generating high-quality vector representations of text inputs.
12+
By transforming address strings into vector embeddings, dedupeknn
13+
captures the semantic meaning and contextual information essential
14+
for accurate address comparisons.
15+
16+
The OpenSearch framework serves as the vector data source for dedupeknn.
17+
OpenSearch is a search db maintained by AWS that provides efficient
18+
storage and retrieval capabilities for large-scale
19+
vector datasets. With OpenSearch, dedupeknn can handle vast amounts of
20+
address data, ensuring scalability and performance.
21+
22+
To find the nearest neighbors of a given address vector,
23+
Dedupeknn employs nearest neighbor algorithms from NMSLIB.
24+
These algorithms efficiently search the vector data source to
25+
identify the most similar addresses, allowing for effective
26+
deduplication and address matching.
27+
28+
By combining the strengths of FastText, OpenSearch, and NMSLIB,
29+
dedupeknn delivers a robust and accurate solution for addressing
30+
the challenges of duplicated addresses and address matching.
31+
Its fast and efficient algorithms enable organizations to streamline
32+
their operations, enhance data quality, and improve customer experiences.
33+
34+
## Running dedupeknn
35+
1. The project uses `fastapi` library and runs as a microservice. The dependencies include
36+
running opensearch cluster with _opensearch-knn_ plugin installed.
37+
2. The configuration is loaded from the properties file - `properties/opensearch-client.properties`
38+
. Set the values accordingly with your installation setup.
39+
3. Creating a new conda environment - `conda create -n dedupeknn python=3.10`
40+
4. Install the required dependencies by - `pip install -r requirements.txt`
41+
5. Run the project - `python main.py`
42+
43+
## Creating KNN index before ingesting data
44+
The below example shows, how to create opensearch index with knn support.
45+
```json
46+
{
47+
"settings": {
48+
"index": {
49+
"knn": true,
50+
"knn.algo_param.ef_search": 100
51+
}
52+
},
53+
"mappings": {
54+
"properties": {
55+
"dedupe_vector_nmslib": {
56+
"type": "knn_vector",
57+
"dimension": 300,
58+
"method": {
59+
"name": "hnsw",
60+
"space_type": "cosinesimil",
61+
"engine": "nmslib",
62+
"parameters": {
63+
"ef_construction": 128,
64+
"m": 24
65+
}
66+
}
67+
}
68+
}
69+
}
70+
}
71+
```
72+
Note:
73+
1. We are using _consinesimil_ as KNN similarity match pattern.
74+
2. Using KNN algorihm implementation from _nmslib_ (non-metric space library).
75+
3. The fasttext model that we use for creating vector representation on input data is of 300 dimensions.
76+
So, we set the field _dimensions_ value to 300. If you are using any other model with 500 or 800
77+
dimensions, change this filed accordingly.
78+
79+
## API's exposed
80+
81+
### Ingesting data:
82+
```shell
83+
curl --location 'http://localhost:8080/api/v1/knn/doc/insert' \
84+
--header 'Content-Type: application/json' \
85+
--data '{
86+
"text": "#6/A Shashank J, 3rd Floor, Chetan Nilaya, 20 C Cross Rd, Ejipura, Bengaluru - 560047"
87+
}'
88+
```
89+
90+
### Getting vector representation of a string
91+
```shell
92+
curl --location 'http://localhost:8080/api/v1/vector/representation' \
93+
--header 'Content-Type: application/json' \
94+
--data-raw '{
95+
"text": "*@) sdfd *29&3 -2030"
96+
}'
97+
```
98+
99+
### Getting K-Nearest-Neighbours for the input string
100+
```shell
101+
curl --location 'http://localhost:8080/api/v1/similarity/knn/search' \
102+
--header 'Content-Type: application/json' \
103+
--data '{
104+
"text": "Chetan Nilaya, House No 6, 3rd Floor, Ejipur, Bangalore 560047",
105+
"size": 30,
106+
"k": 1
107+
}'
108+
```
109+
Note:
110+
1. size - number of neighbours.
111+
2. k - level of neighbours.
112+
113+
### Similarity Match
114+
```shell
115+
curl --location 'http://localhost:8080/api/v1/similarity/address/search' \
116+
--header 'Content-Type: application/json' \
117+
--data '{
118+
"text": "#6/A Third Floor, ChetanNilaya, 20C Road Ejipura, bengaluru karnataka 560047",
119+
"size": 30,
120+
"k": 1,
121+
"threshold": 70
122+
}'
123+
```

0 commit comments

Comments
 (0)