COMP479-Fall2022

Information Retrieval and Web Search course project at Concordia University - assigned by Dr. Sabine Bergler.

Overview

This assignment has 3 stages: P1, P2, and P3.

Built with Python

Python>=3.8 is used as a programming language for this project due to its compatibility with natural language processing tasks, facilitated by the NLTK package.

Project 1 (P1): Text Preprocessing and Proofreading

Key Tasks

Utilize NLTK for text preprocessing, which involves tasks like tokenization and stemming.
Proofread and ensure the quality of the processed text data.

Resources

Project 2 (P2): Indexing and Query Processing

Key Tasks

Implement a naive indexer for indexing documents.
Develop a mechanism for processing single-term queries.
Apply lossy dictionary compression techniques to create a compressed indexer.

Resources

Project 3 (P3): Performance Analysis and Search Engine Implementation

Key Tasks

Compile and measure the execution time required for constructing both the naive indexer and the SPIMI (Single Pass In-Memory Indexing) indexer.
Utilize the SPIMI indexer to implement two search engines:
- A Ranked BM25 search engine, which ranks search results based on relevance using the BM25 algorithm.
- An Unranked Boolean search engine, which performs basic Boolean (AND, OR, NOT) queries.

Resources

Dataset Used

Reuter’s Corpus "Reuters-21578"
(Visit Original Website)

Setup

In this project, pypy3 is used as Python3 executable.

Pypy3 serves as a substitute for the native Python3 interpreter due to its superior runtime performance. Given that these projects involve processing an extensive volume of large files through iterative operations, opting for pypy3 as an alternative interpreter was a highly efficient decision.

Install pypy3 on MacOS

$ brew install pypy3

Install virtualenv

$ pypy3 -m pip install virtualenv

Create a PyPy virtualenv in the directory pypy-venv

$ pypy3 -m virtualenv pypy3-env

Start working in the virtual environment

$ cd ~/pypy3-venv/ then $ . bin/activate

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
P1		P1
P2		P2
P3		P3
reuters21578_extracted		reuters21578_extracted
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
extract_tar.py		extract_tar.py
reuters21578.tar.gz		reuters21578.tar.gz

chihiroanihr/COMP479-Fall2022

Folders and files

Latest commit

History

Repository files navigation

COMP479-Fall2022

Overview

Built with Python

Project 1 (P1): Text Preprocessing and Proofreading

Key Tasks

Resources

Project 2 (P2): Indexing and Query Processing

Key Tasks

Resources

Project 3 (P3): Performance Analysis and Search Engine Implementation

Key Tasks

Resources

Dataset Used

Setup

Install pypy3 on MacOS

Install virtualenv

Create a PyPy virtualenv in the directory pypy-venv

Start working in the virtual environment

About

Topics

Resources

Stars

Watchers

Forks

Languages