You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SemDedup currently outpus "ids to keep" while Fuzzy/Exact dedup outputs "ids to remove". Across our deduplication API we should make sure we have the same API
The deduplicator modules should have identify / remove and identify_and_remove ?
The __call__ should behave as identify_and_remove and advanced users who need to configure which dupe among dupes to keep (for exact / fuzzy) can call identify and remove separately?
Rename / remove current (Fuzzy)Duplicates in favor of a (Fuzzy)Deduplicator that has both methods
Architectural Design
Base class called BaseDeduplicator that has the abstract methods (feel free to suggest)
Exact and Fuzzy by default will keep randomly 1 of the documents in the "matched" groups, however users who have opinions on which dupe to keep, can break it into identify and dedup
Whether we output ids_to_keep or ids_to_remove is to be decided as we learn more on the performance implications in a dask merge
The text was updated successfully, but these errors were encountered:
identify
/remove
andidentify_and_remove
?__call__
should behave asidentify_and_remove
and advanced users who need to configure which dupe among dupes to keep (for exact / fuzzy) can call identify and remove separately?Architectural Design
BaseDeduplicator
that has the abstract methods (feel free to suggest)ids_to_keep
orids_to_remove
is to be decided as we learn more on the performance implications in a dask mergeThe text was updated successfully, but these errors were encountered: