You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I find there is a bug in LSH part of tutorial https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
According to #430, false_positive_check=True should be added in the fuzzy dedup part.
#Run LSH()
lsh = LSH(
cache_dir=lsh_output_dir,
num_hashes=minhash_length,
num_buckets=num_bands,
buckets_per_shuffle=buckets_per_shuffle, false_positive_check=True,
id_fields=["dataset_id", "doc_id"],
minhash_field=minhash_field,
logger=lsh_log_dir,
)
The text was updated successfully, but these errors were encountered:
Thanks for opening @yangjingyi . Is this something you'd be interested in putting in a fix for? (No worries if not).
You're right, after #326 merged in, the arg needs to be added here to ensure that the results can be consumed downstream.
I do plan on refactoring the tutorial a bit more after #386 merges in but that's still ongoing.
I find there is a bug in LSH part of tutorial https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
According to #430, false_positive_check=True should be added in the fuzzy dedup part.
#Run LSH()
lsh = LSH(
cache_dir=lsh_output_dir,
num_hashes=minhash_length,
num_buckets=num_bands,
buckets_per_shuffle=buckets_per_shuffle,
false_positive_check=True,
id_fields=["dataset_id", "doc_id"],
minhash_field=minhash_field,
logger=lsh_log_dir,
)
The text was updated successfully, but these errors were encountered: