Unifying Deduplication API Modules #516

praateekmahajan · 2025-02-04T22:01:38Z

SemDedup currently outpus "ids to keep" while Fuzzy/Exact dedup outputs "ids to remove". Across our deduplication API we should make sure we have the same API
The deduplicator modules should have identify / remove and identify_and_remove ?
The __call__ should behave as identify_and_remove and advanced users who need to configure which dupe among dupes to keep (for exact / fuzzy) can call identify and remove separately?
Rename / remove current (Fuzzy)Duplicates in favor of a (Fuzzy)Deduplicator that has both methods

Base class called BaseDeduplicator that has the abstract methods (feel free to suggest)
Exact and Fuzzy by default will keep randomly 1 of the documents in the "matched" groups, however users who have opinions on which dupe to keep, can break it into identify and dedup
Whether we output ids_to_keep or ids_to_remove is to be decided as we learn more on the performance implications in a dask merge

The text was updated successfully, but these errors were encountered:

praateekmahajan added the enhancement New feature or request label Feb 4, 2025

Provide feedback