Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaMa 2 Pre-training data #1

Open
kibitzing opened this issue Jun 16, 2024 · 7 comments
Open

LLaMa 2 Pre-training data #1

kibitzing opened this issue Jun 16, 2024 · 7 comments
Assignees

Comments

@kibitzing
Copy link
Owner

kibitzing commented Jun 16, 2024

What kind of data were used for training LLaMa 2?

@kibitzing kibitzing self-assigned this Jun 16, 2024
@kibitzing
Copy link
Owner Author

kibitzing commented Jun 16, 2024

LLaMa 2

  1. Pre-training data
  • whole data

    • quantity: 2 trillion tokens of data
    • by data source
    • data filtering method
  • up-sampling the most factual sources in an effort to increase knowledge and dampen hallucinations.

  • includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services

  • remove data from certain sites known to contain a high volume of personal information about private individuals

@kibitzing
Copy link
Owner Author

It is important to understand what is in the pretraining data both to increase transparency and to shed light on root causes of potential downstream issues, such as potential biases.

We followed Meta’s standard privacy and legal review processes for each dataset used in training.

  • We did not use any Meta user data in training
  • We excluded data from certain sites known to contain a high volume of personal information about private individuals.
  • No additional filtering was conducted on the datasets, to allow Llama 2 to be more widely usable across tasks
    • it can be better used for hate speech classification
    • avoiding the potential for the accidental demographic erasure sometimes caused by over-scrubbing

@kibitzing
Copy link
Owner Author

kibitzing commented Jun 16, 2024

Demographic Representation

Pronouns

Screenshot 2024-06-16 at 8 24 28 PM
  • She: "she", "her", "hers", "herself"
  • He: "he", "him", "his", "himself"
  • Unknown: "they", "them", "their", "theirs", "theirself", "themself", "themselves"
  • 1st-person: "I", "me", "my", "mine", "myself", "we", "us", "our", "ours", "ourselves"
  • 2nd-person: "you", "your", "yours", "yourself", "yourselves"
  • 3rd-person: "she", "her", "hers", "herself", "he", "him", "his", "himself", "they", "them", "their", "theirs", "theirself", "themself", "themselves", "it", "its", "itself"

Identities

  • We compute frequencies for each descriptor term in the pretraining corpus. We group descriptors into 5 axes (Religion, Gender and Sex, Nationality, Race and Ethnicity, and Sexual Orientation)
Screenshot 2024-06-16 at 8 33 45 PM
  • we remove a few terms (from the table) such as “straight,” “white,” and “black,” because these terms have frequent uses beyond demographic mentions (e.g., as basic color terms)
  • while She pronouns are mentioned in fewer documents, the term “female” is present in a larger percentage of documents. This could imply that while there is less frequent context about She pronouns, comments about “females” are more prevalent
  • For Nationality, Race and Ethnicity, and Religion, we observe a Western skew
    • the term “American” is mentioned in 69.4% of the references, the term “European” is more prevalent than other race and ethnicity
    • “Christian” is the most represented religion followed by “Catholic” and “Jewish.”

@kibitzing kibitzing changed the title llama familiy LLaMa 2 Jun 16, 2024
@kibitzing
Copy link
Owner Author

@kibitzing
Copy link
Owner Author

kibitzing commented Jun 16, 2024

Data Toxicity

  • We score each line of a document separately and average them to assign a document score.
    -> The toxicity can be diluted when averaged
Screenshot 2024-06-16 at 8 46 39 PM
  • Figure 13 shows the distribution of scores in a 10% random sample of the full corpus.
  • About 0.2% of documents evaluated are assigned a likelihood score of 0.5
  • meaning there is a small amount of toxicity in our pretraining data.
    -> Maybe, but maybe not, because 0.5 might mean that 50% of the document is toxic speech, and it could also mean that 49%-toxic documents are not counted.

@kibitzing
Copy link
Owner Author

Language identification

  • subsetted to those found in more than 0.005% of the documents
  • a threshold of 0.5 for the language detection
Screenshot 2024-06-16 at 8 55 22 PM

https://fasttext.cc/docs/en/language-identification.html

@kibitzing
Copy link
Owner Author

kibitzing commented Jun 16, 2024

Pre-training data summary:

  1. They did not use Meta's data
  2. They filtered the web pages (which are known to contain a high volume of personal information about private individuals)
  • SNS like Twitter and LinkedIn?
  1. They did not filter out any pre-training data
  2. Instead, they provided multiple analyses of the data.

@kibitzing kibitzing changed the title LLaMa 2 LLaMa 2 Pre-training data Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant