LLaMa 2 Pre-training data #1

kibitzing · 2024-06-16T10:55:29Z

What kind of data were used for training LLaMa 2?

kibitzing · 2024-06-16T11:03:33Z

LLaMa 2

Pre-training data

whole data
- quantity: 2 trillion tokens of data
- by data source
- data filtering method
up-sampling the most factual sources in an effort to increase knowledge and dampen hallucinations.
includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services
remove data from certain sites known to contain a high volume of personal information about private individuals

kibitzing · 2024-06-16T11:20:13Z

It is important to understand what is in the pretraining data both to increase transparency and to shed light on root causes of potential downstream issues, such as potential biases.

We followed Meta’s standard privacy and legal review processes for each dataset used in training.

We did not use any Meta user data in training
We excluded data from certain sites known to contain a high volume of personal information about private individuals.
No additional filtering was conducted on the datasets, to allow Llama 2 to be more widely usable across tasks
- it can be better used for hate speech classification
- avoiding the potential for the accidental demographic erasure sometimes caused by over-scrubbing

kibitzing · 2024-06-16T11:28:44Z

Demographic Representation

Pronouns

She: "she", "her", "hers", "herself"
He: "he", "him", "his", "himself"
Unknown: "they", "them", "their", "theirs", "theirself", "themself", "themselves"
1st-person: "I", "me", "my", "mine", "myself", "we", "us", "our", "ours", "ourselves"
2nd-person: "you", "your", "yours", "yourself", "yourselves"
3rd-person: "she", "her", "hers", "herself", "he", "him", "his", "himself", "they", "them", "their", "theirs", "theirself", "themself", "themselves", "it", "its", "itself"

Identities

We compute frequencies for each descriptor term in the pretraining corpus. We group descriptors into 5 axes (Religion, Gender and Sex, Nationality, Race and Ethnicity, and Sexual Orientation)

we remove a few terms (from the table) such as “straight,” “white,” and “black,” because these terms have frequent uses beyond demographic mentions (e.g., as basic color terms)
while She pronouns are mentioned in fewer documents, the term “female” is present in a larger percentage of documents. This could imply that while there is less frequent context about She pronouns, comments about “females” are more prevalent
For Nationality, Race and Ethnicity, and Religion, we observe a Western skew
- the term “American” is mentioned in 69.4% of the references, the term “European” is more prevalent than other race and ethnicity
- “Christian” is the most represented religion followed by “Catholic” and “Jewish.”

kibitzing · 2024-06-16T11:40:15Z

Category from “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset

kibitzing · 2024-06-16T11:46:19Z

Data Toxicity

We score each line of a document separately and average them to assign a document score.
-> The toxicity can be diluted when averaged

Figure 13 shows the distribution of scores in a 10% random sample of the full corpus.
About 0.2% of documents evaluated are assigned a likelihood score of 0.5
meaning there is a small amount of toxicity in our pretraining data.
-> Maybe, but maybe not, because 0.5 might mean that 50% of the document is toxic speech, and it could also mean that 49%-toxic documents are not counted.

kibitzing · 2024-06-16T11:56:17Z

Language identification

subsetted to those found in more than 0.005% of the documents
a threshold of 0.5 for the language detection

https://fasttext.cc/docs/en/language-identification.html

kibitzing · 2024-06-16T14:26:07Z

Pre-training data summary:

They did not use Meta's data
They filtered the web pages (which are known to contain a high volume of personal information about private individuals)

SNS like Twitter and LinkedIn?

They did not filter out any pre-training data
Instead, they provided multiple analyses of the data.

kibitzing self-assigned this Jun 16, 2024

kibitzing changed the title ~~llama familiy~~ LLaMa 2 Jun 16, 2024

kibitzing changed the title ~~LLaMa 2~~ LLaMa 2 Pre-training data Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaMa 2 Pre-training data #1

LLaMa 2 Pre-training data #1

kibitzing commented Jun 16, 2024 •

edited

Loading

kibitzing commented Jun 16, 2024 •

edited

Loading

kibitzing commented Jun 16, 2024

kibitzing commented Jun 16, 2024 •

edited

Loading

kibitzing commented Jun 16, 2024

kibitzing commented Jun 16, 2024 •

edited

Loading

kibitzing commented Jun 16, 2024

kibitzing commented Jun 16, 2024 •

edited

Loading

LLaMa 2 Pre-training data #1

LLaMa 2 Pre-training data #1

Comments

kibitzing commented Jun 16, 2024 • edited Loading

kibitzing commented Jun 16, 2024 • edited Loading

kibitzing commented Jun 16, 2024

We followed Meta’s standard privacy and legal review processes for each dataset used in training.

kibitzing commented Jun 16, 2024 • edited Loading

Demographic Representation

Pronouns

Identities

kibitzing commented Jun 16, 2024

kibitzing commented Jun 16, 2024 • edited Loading

Data Toxicity

kibitzing commented Jun 16, 2024

Language identification

kibitzing commented Jun 16, 2024 • edited Loading

kibitzing commented Jun 16, 2024 •

edited

Loading

kibitzing commented Jun 16, 2024 •

edited

Loading

kibitzing commented Jun 16, 2024 •

edited

Loading

kibitzing commented Jun 16, 2024 •

edited

Loading

kibitzing commented Jun 16, 2024 •

edited

Loading