Skip to content

Given a set of documents and the minimum required similarity threshold find the number of document pairs that exceed the threshold

Notifications You must be signed in to change notification settings

zuliani99/All-Pairs-Docs-Similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

All-Pairs-Docs-Similarity

Given a set of documents and the minimum required similarity threshold find the number of document pairs that exceed the threshold

Requisites

sudo apt install default-jre
pip install beir
pip install pandas
pip install sklearn
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
pip install ipywidgets

PySpark Local Installation

wget https://dlcdn.apache.org/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
sha512sum spark-3.4.0-bin-hadoop3.tgz
tar -xzf spark-3.4.0-bin-hadoop3.tgz

Configure Spark Environment

Follow this tutorial

Used Dataset

nfcorpus

Start Application

Enter in the app folder and run

python main.py

Results

Benchmark Results

About

Given a set of documents and the minimum required similarity threshold find the number of document pairs that exceed the threshold

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published