-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running paragraph level deduplication on c4 #150
Comments
uh, that is pretty confusing! could you post a sample of the data in your yaml file? |
Were you able to resolve this? @andrewhojel |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I am trying to run paragraph level deduplication using the dolma library and wanted to test it on c4. I downloaded
allenai/c4
from huggingface, updated the schema to betext (string, doc content), id (long, unique id), source ("c4")
, and saved it asjson.gz
files that are~250MB/file
. Any time I rundolma -c c4-dedupe.yaml dedupe
the output attribute is always an empty list. Here is theyaml
I am using (which is almost identical to the one provided atconfigs/dolma-v1_5/para_dedupe/c4.yaml
the machine I am using has
360 vCPU
and is runningDebian 11, Python 3.10
. I tried usingpip install dolma
and downloading the library directly from the repo (neither worked). I built a small example input as I saw in this discussion which worked totally fine. Pretty confused about this result.I would really appreciate help / any thoughts why this might be the case.
The text was updated successfully, but these errors were encountered: