Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unifying Deduplication API Modules #516

Open
praateekmahajan opened this issue Feb 4, 2025 · 0 comments
Open

Unifying Deduplication API Modules #516

praateekmahajan opened this issue Feb 4, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@praateekmahajan
Copy link
Collaborator

praateekmahajan commented Feb 4, 2025

  1. SemDedup currently outpus "ids to keep" while Fuzzy/Exact dedup outputs "ids to remove". Across our deduplication API we should make sure we have the same API
  2. The deduplicator modules should have identify / remove and identify_and_remove ?
  3. The __call__ should behave as identify_and_remove and advanced users who need to configure which dupe among dupes to keep (for exact / fuzzy) can call identify and remove separately?
  4. Rename / remove current (Fuzzy)Duplicates in favor of a (Fuzzy)Deduplicator that has both methods

Architectural Design

  1. Base class called BaseDeduplicator that has the abstract methods (feel free to suggest)
  2. Exact and Fuzzy by default will keep randomly 1 of the documents in the "matched" groups, however users who have opinions on which dupe to keep, can break it into identify and dedup
  3. Whether we output ids_to_keep or ids_to_remove is to be decided as we learn more on the performance implications in a dask merge
@praateekmahajan praateekmahajan added the enhancement New feature or request label Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant