Google Research Datasets

All

172 repositories

artydiqa
Public
ArTyDi-QA is a dataset for Question Answering (QA) and Question Generation (QG) in Modern Standard Arabic (MSA), adapted from TyDiQA. It features extractive QA where models find answer spans or identify unanswerable questions, and a QG task involving formulating questions from context and answer pairs.
0•0•0•0•Updated Dec 18, 2025Dec 18, 2025
ssa-ai-terminologies
Public
This dataset provides a glossary of AI terms in Swahili, Zulu, Xhosa, Afrikaans, English (as the common core), and other languages widely spoken in Africa. It's a JSON file, covering “Basic” and “Advanced” levels, to improve AI literacy.
HTML
•
Creative Commons Attribution Share Alike 4.0 International
•1•3•0•0•Updated Dec 17, 2025Dec 17, 2025
MGSM-Rev2
Public
To improve the MGSM benchmark, we corrected two erroneous English questions and rephrased others to remove ambiguity. We then used Gemini to retranslate all questions and subsequently used Gemini to verify that every question in the benchmark is now answerable.
0•0•0•0•Updated Nov 10, 2025Nov 10, 2025
wit-retrieval
Public
Other
•0•5•1•0•Updated Oct 13, 2025Oct 13, 2025
Amplify_SSA
Public
An annotated dataset of 9,003 adversarial queries in seven Sub-Saharan African languages.
Jupyter Notebook
•3•3•0•0•Updated Sep 17, 2025Sep 17, 2025
cultural_familiarity_annotations
Public
The dataset consists of AI generated stories and accompanied human ratings on their cultural fluency and relevance.
Apache License 2.0
•0•1•0•0•Updated Aug 6, 2025Aug 6, 2025
tydiqa-wana
Public
Jupyter Notebook
•
Apache License 2.0
•0•0•0•0•Updated Jul 30, 2025Jul 30, 2025
conceptual-12m
Public
Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
vision-and-language pre-training multimodal-dataset
Other
•18•411•6•0•Updated Jul 14, 2025Jul 14, 2025
sanpo_dataset
Public
Python
•
Apache License 2.0
•1•48•5•3•Updated Jun 27, 2025Jun 27, 2025
common-crawl-domain-names
Public
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
MIT License
•1•20•1•0•Updated Jun 16, 2025Jun 16, 2025
rag_conflicts
Public
CONFLICTS is a QA dataset annotated with knowledge conflict types. Each instance comprises a query, a set of retrieved relevant passages, a corresponding conflict type label, and, for specific types, the ground truth correct answer
Apache License 2.0
•1•13•1•0•Updated Jun 11, 2025Jun 11, 2025
egotempo
Public
Jupyter Notebook
•
Creative Commons Attribution 4.0 International
•0•26•3•0•Updated Apr 26, 2025Apr 26, 2025
web-images
Public
Images gathered from the Internet in 2023 and some metadata
HTML
•
Other
•1•3•0•0•Updated Mar 19, 2025Mar 19, 2025
screen_qa
Public
ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.
Python
•
Creative Commons Attribution 4.0 International
•9•134•4•0•Updated Feb 7, 2025Feb 7, 2025
adversarial-nibbler
Public
This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).
Creative Commons Attribution 4.0 International
•4•25•0•0•Updated Feb 3, 2025Feb 3, 2025
cube
Public
CUBE is a benchmark to evaluate the Cultural Competence of T2I models
Creative Commons Attribution 4.0 International
•1•8•3•0•Updated Jan 20, 2025Jan 20, 2025
global_streamflow_model_paper
Public
Jupyter Notebook
•
Apache License 2.0
•17•66•3•0•Updated Jan 17, 2025Jan 17, 2025
hiertext
Public
The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.
Jupyter Notebook
•
Creative Commons Attribution Share Alike 4.0 International
•28•301•0•1•Updated Dec 2, 2024Dec 2, 2024
scin
Public
The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels. The dataset also contains estimated Fitzpatrick skin type and Monk Skin Tone.
Jupyter Notebook
•
Other
•16•149•2•0•Updated Nov 23, 2024Nov 23, 2024
MISeD
Public
MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transcripts from the QMSum dataset. MISeD is described in detail in the paper: Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts.
3•13•0•0•Updated Nov 20, 2024Nov 20, 2024
uicrit
Public
UICrit is a dataset containing human-generated natural language design critiques, corresponding bounding boxes for each critique, and design quality ratings for 1,000 mobile UIs from RICO. This dataset was collected for our UIST '24 paper: https://arxiv.org/abs/2407.08850.
0•26•1•0•Updated Nov 19, 2024Nov 19, 2024
WordGraph
Public
The WordGraph dataset contains multilingual lexicon entries linked to wikipedia entities, focusing on human-denoting names and demonym adjectives. Each lexicon entries contain inflected word-form and morphological information all locales.
Creative Commons Zero v1.0 Universal
•1•1•0•0•Updated Nov 7, 2024Nov 7, 2024
Education-Dialogue-Dataset
Public archive
Dataset of conversations, generated by prompting Gemini Ultra. These are conversations between a teacher and a student, where the teacher is prompted with specific topic to teach the student, and the student is prompted with their learning preferences. https://arxiv.org/abs/2405.14655
5•31•1•0•Updated Oct 29, 2024Oct 29, 2024
GeniL
Public
GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, MS, and ID and is annotated by native speakers of each language. Each sentence is collected from a public corpora of language and contains at least one identity group name and an attribute.
Creative Commons Attribution 4.0 International
•0•3•0•0•Updated Oct 18, 2024Oct 18, 2024
tap-typing-with-touch-sensing-images
Public archive
The Tap Typing with Touch Sensing Images (TSI) dataset contains data of user taps on a mobile touchscreen keyboard, including elliptical features and capacitive sensing images of the taps. The dataset aligns each tap with a key the user intended to type during data collection so it can be used for keyboard decoder training and/or evaluation.
Creative Commons Attribution 4.0 International
•1•2•0•0•Updated Oct 15, 2024Oct 15, 2024
mittens
Public
Datasets for measuring misgendering in translation
Other
•0•5•0•0•Updated Oct 4, 2024Oct 4, 2024
wit
Public archive
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.
multilingual nlp machine-learning wikipedia multimodal cc-by-sa-3
Other
•45•1.1k•1•0•Updated Sep 27, 2024Sep 27, 2024
C4_200M-synthetic-dataset-for-grammatical-error-correction
Public
This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (https://www.aclweb.org/anthology/2021.bea-1.4/)
Python
•
Creative Commons Attribution 4.0 International
•23•162•0•0•Updated Sep 24, 2024Sep 24, 2024
SeeGULL-Multilingual
Public archive
SeeGULL Multilingual is a multilingual and multicultural dataset of stereotypes. It consists of stereotypes in 20 languages with human annotations across 23 languages, including annotations on their degree of offensiveness.
Creative Commons Attribution 4.0 International
•1•8•0•0•Updated Sep 19, 2024Sep 19, 2024
ToTTo
Public
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.
37•460•6•0•Updated Sep 11, 2024Sep 11, 2024