cd existing_repo
git remote add origin https://gitlab.com/mab0205/introcd2-novelty-detection.git
git branch -M main
git push -uf origin main
Novelty Detection in documents is a challenging and highly relevant task in today's context. This topic holds particular significance in the field of Natural Language Processing (NLP). The goal is to differentiate documents that belong to an already known set from those that exhibit novel or emerging characteristics. This work explores various algorithms for novelty detection, comparing the results obtained using the TAP-DLND 1.0 benchmark dataset. The results highlighted the differences among the evaluated techniques and their implications across different topics, emphasizing the need to explore more advanced methods.
.txt
file with the content and an accompanying .xml
file containing metadata such as title, publication date, publisher, and other event-related information.
1 . Elliptic Envelope
- Demonstrated a balanced yet insufficient performance. Most predictions were random, highlighting the algorithm's difficulty in identifying consistent patterns to distinguish new documents.
2 . Isolation Forest
- Proved to be the least effective for non-novelty identification, struggling to detect notable differences between new and already known documents.
3 . Local Outlier Factor (LOF)
- Comparing our results with those presented in the TAPN article, we observed considerable similarities. Both approaches employ lexical analysis and the Euclidean distance metric for novelty detection. However: Our model uses an n-gram size of 1, while TAPN adopts an n-gram size of 3. Despite this difference, TAPN achieved an F1-score of ~0.66 for the Non-Novelty class and ~0.73 for the Novelty class. While these results are promising, TAPN's research achieved a Macro avg F1-score below 0.7, compared to 0.6 in our study.
By employing more sophisticated techniques such as language models, TAPN achieved higher F1-scores exceeding 0.7. This indicates significant room for improvement in our LOF algorithm by exploring more advanced approaches.
- Author: Martín Ávila Buitrón
- RAs: 2274183
- logins GitLab: mab0205