-
Notifications
You must be signed in to change notification settings - Fork 147
Description
I'm reaching out to better understand some aspects of the Dolma dataset, especially regarding deduplication strategies and content categorization. We’re exploring how we might contribute to or build upon this impressive resource, so any insights would be greatly appreciated.
- Temporal Deduplication
Does Dolma perform deduplication across different time points? For example:
If a webpage with the same URL (or near-identical content) appears in multiple years (e.g., 2013, 2014, 2015), is it deduplicated?
If so:
Which version is retained by default — the earliest, latest, or one determined as highest quality?
Are there specific tools or metrics used to evaluate content equivalence over time?
2. Proposed Deduplication Strategy
We’re considering implementing a deduplication strategy based on the following rules:
Default Rule: Retain only the first occurrence of a webpage.
Exception Rule: Keep a subsequent crawl if:
The content has undergone significant modification (e.g., expanded depth/breadth), or
The new version is of higher quality.
Could the community share thoughts on this approach? Specifically:
Does this align with best practices?
Are there known pitfalls or alternative strategies we should consider?
Has something like this already been implemented or is it under development in Dolma?
If not, are there recommended tools or toolchains for implementing such a strategy at scale?
How can one define "significant content difference" or "higher quality"? E.g.:
Semantic similarity thresholds?
Content length or structural changes?
Quality heuristics (e.g., readability, domain authority)?
3. Document Categories
Lastly, could anyone confirm whether Dolma includes documents from the following categories, and how they are identified or tagged?
Academic papers, including:
Computer Science
Mathematics, Physics, Medical Sciences
Economics and Finance
Other Humanities and Social Sciences
News articles, particularly:
Financial and Economic news
Political and Societal news
Other news categories
Textbooks
Any information on how these types are represented or labeled in the dataset would be extremely helpful.
Thank you all very much for your time and contributions to this project. Looking forward to hearing your thoughts!