Running paragraph level deduplication on c4 #150

andrewhojel · 2024-04-20T00:51:08Z

I am trying to run paragraph level deduplication using the dolma library and wanted to test it on c4. I downloaded allenai/c4 from huggingface, updated the schema to be text (string, doc content), id (long, unique id), source ("c4"), and saved it as json.gz files that are ~250MB/file. Any time I run dolma -c c4-dedupe.yaml dedupe the output attribute is always an empty list. Here is the yaml I am using (which is almost identical to the one provided at configs/dolma-v1_5/para_dedupe/c4.yaml

documents:
  - /home/c4/v0/documents/*.gz

dedupe:
  name: dedupe_paragraphs
  paragraphs:
    attribute_name: bff_duplicate_paragraph_spans
  skip_empty: true

bloom_filter:
  file: /tmp/c4.bloom
  read_only: false
  estimated_doc_count: 30000000000
  desired_false_positive_rate: 1e-06

processes: 350

the machine I am using has 360 vCPU and is running Debian 11, Python 3.10. I tried using pip install dolma and downloading the library directly from the repo (neither worked). I built a small example input as I saw in this discussion which worked totally fine. Pretty confused about this result.

I would really appreciate help / any thoughts why this might be the case.

The text was updated successfully, but these errors were encountered:

soldni · 2024-05-08T07:46:00Z

uh, that is pretty confusing! could you post a sample of the data in your yaml file?

riturajj-cerebras · 2024-05-22T06:49:12Z

Were you able to resolve this? @andrewhojel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running paragraph level deduplication on c4 #150

Running paragraph level deduplication on c4 #150

andrewhojel commented Apr 20, 2024

soldni commented May 8, 2024

riturajj-cerebras commented May 22, 2024

Running paragraph level deduplication on c4 #150

Running paragraph level deduplication on c4 #150

Comments

andrewhojel commented Apr 20, 2024

soldni commented May 8, 2024

riturajj-cerebras commented May 22, 2024