Intro-Data-Science-ovelty Detection

Clone

cd existing_repo
git remote add origin https://gitlab.com/mab0205/introcd2-novelty-detection.git
git branch -M main
git push -uf origin main

📝 Abstract

Novelty Detection in documents is a challenging and highly relevant task in today's context. This topic holds particular significance in the field of Natural Language Processing (NLP). The goal is to differentiate documents that belong to an already known set from those that exhibit novel or emerging characteristics. This work explores various algorithms for novelty detection, comparing the results obtained using the TAP-DLND 1.0 benchmark dataset. The results highlighted the differences among the evaluated techniques and their implications across different topics, emphasizing the need to explore more advanced methods.

🔜 Objectives

⚠️Evaluate algorithm performance: Assess the performance of the Local Outlier Factor (LOF), Isolation Forest, and Elliptic Envelope algorithms in detecting novelty within unknown datasets, using TF-IDF-based vector representations.

⚠️Identify thematic differences: Analyze the most significant thematic differences between articles classified as "novelty" and "non-novelty."

⚠️Evaluate results by category: Develop a method to evaluate the results of novelty detection algorithms both globally (across all categories) and individually (for each category). Each article has a .txt file with the content and an accompanying .xml file containing metadata such as title, publication date, publisher, and other event-related information.

Results

1 . Elliptic Envelope

Demonstrated a balanced yet insufficient performance. Most predictions were random, highlighting the algorithm's difficulty in identifying consistent patterns to distinguish new documents.

2 . Isolation Forest

Proved to be the least effective for non-novelty identification, struggling to detect notable differences between new and already known documents.

3 . Local Outlier Factor (LOF)

Comparing our results with those presented in the TAPN article, we observed considerable similarities. Both approaches employ lexical analysis and the Euclidean distance metric for novelty detection. However: Our model uses an n-gram size of 1, while TAPN adopts an n-gram size of 3. Despite this difference, TAPN achieved an F1-score of ~0.66 for the Non-Novelty class and ~0.73 for the Novelty class. While these results are promising, TAPN's research achieved a Macro avg F1-score below 0.7, compared to 0.6 in our study.

By employing more sophisticated techniques such as language models, TAPN achieved higher F1-scores exceeding 0.7. This indicates significant room for improvement in our LOF algorithm by exploring more advanced approaches.

Authors and acknowledgment

Author: Martín Ávila Buitrón
RAs: 2274183
logins GitLab: mab0205

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
01 - Research and Results		01 - Research and Results
Exploratory_Analysis		Exploratory_Analysis
docs		docs
models		models
results		results
scripts		scripts
.gitignore		.gitignore
README.md		README.md
main_LOF_IsoForest_Elliptic.ipynb		main_LOF_IsoForest_Elliptic.ipynb
main_kmeans_classification.ipynb		main_kmeans_classification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intro-Data-Science-ovelty Detection

Clone

📝 Abstract

🔜 Objectives

Results

Authors and acknowledgment

About

Releases

Packages

Languages

mab0205/Novelty-Detection-Data-Science

Folders and files

Latest commit

History

Repository files navigation

Intro-Data-Science-ovelty Detection

Clone

📝 Abstract

🔜 Objectives

Results

Authors and acknowledgment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages