Skip to content

mab0205/Novelty-Detection-Data-Science

Repository files navigation

Intro-Data-Science-ovelty Detection

Clone

cd existing_repo
git remote add origin https://gitlab.com/mab0205/introcd2-novelty-detection.git
git branch -M main
git push -uf origin main

📝 Abstract

Novelty Detection in documents is a challenging and highly relevant task in today's context. This topic holds particular significance in the field of Natural Language Processing (NLP). The goal is to differentiate documents that belong to an already known set from those that exhibit novel or emerging characteristics. This work explores various algorithms for novelty detection, comparing the results obtained using the TAP-DLND 1.0 benchmark dataset. The results highlighted the differences among the evaluated techniques and their implications across different topics, emphasizing the need to explore more advanced methods.

🔜 Objectives

⚠️Evaluate algorithm performance: Assess the performance of the Local Outlier Factor (LOF), Isolation Forest, and Elliptic Envelope algorithms in detecting novelty within unknown datasets, using TF-IDF-based vector representations.

⚠️Identify thematic differences: Analyze the most significant thematic differences between articles classified as "novelty" and "non-novelty."

⚠️Evaluate results by category: Develop a method to evaluate the results of novelty detection algorithms both globally (across all categories) and individually (for each category). Each article has a .txt file with the content and an accompanying .xml file containing metadata such as title, publication date, publisher, and other event-related information.


Results

1 . Elliptic Envelope

  • Demonstrated a balanced yet insufficient performance. Most predictions were random, highlighting the algorithm's difficulty in identifying consistent patterns to distinguish new documents.

2 . Isolation Forest

  • Proved to be the least effective for non-novelty identification, struggling to detect notable differences between new and already known documents.

3 . Local Outlier Factor (LOF)

  • Comparing our results with those presented in the TAPN article, we observed considerable similarities. Both approaches employ lexical analysis and the Euclidean distance metric for novelty detection. However: Our model uses an n-gram size of 1, while TAPN adopts an n-gram size of 3. Despite this difference, TAPN achieved an F1-score of ~0.66 for the Non-Novelty class and ~0.73 for the Novelty class. While these results are promising, TAPN's research achieved a Macro avg F1-score below 0.7, compared to 0.6 in our study.

By employing more sophisticated techniques such as language models, TAPN achieved higher F1-scores exceeding 0.7. This indicates significant room for improvement in our LOF algorithm by exploring more advanced approaches.


Authors and acknowledgment

  • Author: Martín Ávila Buitrón
  • RAs: 2274183
  • logins GitLab: mab0205

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published