Skip to content
Change the repository type filter

All

    Repositories list

    • artydiqa

      Public
      ArTyDi-QA is a dataset for Question Answering (QA) and Question Generation (QG) in Modern Standard Arabic (MSA), adapted from TyDiQA. It features extractive QA where models find answer spans or identify unanswerable questions, and a QG task involving formulating questions from context and answer pairs.
      0000Updated Dec 18, 2025Dec 18, 2025
    • This dataset provides a glossary of AI terms in Swahili, Zulu, Xhosa, Afrikaans, English (as the common core), and other languages widely spoken in Africa. It's a JSON file, covering “Basic” and “Advanced” levels, to improve AI literacy.
      HTML
      1300Updated Dec 17, 2025Dec 17, 2025
    • MGSM-Rev2

      Public
      To improve the MGSM benchmark, we corrected two erroneous English questions and rephrased others to remove ambiguity. We then used Gemini to retranslate all questions and subsequently used Gemini to verify that every question in the benchmark is now answerable.
      0000Updated Nov 10, 2025Nov 10, 2025
    • 0510Updated Oct 13, 2025Oct 13, 2025
    • An annotated dataset of 9,003 adversarial queries in seven Sub-Saharan African languages.
      Jupyter Notebook
      3300Updated Sep 17, 2025Sep 17, 2025
    • The dataset consists of AI generated stories and accompanied human ratings on their cultural fluency and relevance.
      0100Updated Aug 6, 2025Aug 6, 2025
    • Jupyter Notebook
      0000Updated Jul 30, 2025Jul 30, 2025
    • Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
      1841160Updated Jul 14, 2025Jul 14, 2025
    • Python
      14853Updated Jun 27, 2025Jun 27, 2025
    • Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
      12010Updated Jun 16, 2025Jun 16, 2025
    • CONFLICTS is a QA dataset annotated with knowledge conflict types. Each instance comprises a query, a set of retrieved relevant passages, a corresponding conflict type label, and, for specific types, the ground truth correct answer
      11310Updated Jun 11, 2025Jun 11, 2025
    • egotempo

      Public
      Jupyter Notebook
      02630Updated Apr 26, 2025Apr 26, 2025
    • Images gathered from the Internet in 2023 and some metadata
      HTML
      1300Updated Mar 19, 2025Mar 19, 2025
    • screen_qa

      Public
      ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.
      Python
      913440Updated Feb 7, 2025Feb 7, 2025
    • This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).
      42500Updated Feb 3, 2025Feb 3, 2025
    • cube

      Public
      CUBE is a benchmark to evaluate the Cultural Competence of T2I models
      1830Updated Jan 20, 2025Jan 20, 2025
    • Jupyter Notebook
      176630Updated Jan 17, 2025Jan 17, 2025
    • hiertext

      Public
      The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.
      Jupyter Notebook
      2830101Updated Dec 2, 2024Dec 2, 2024
    • scin

      Public
      The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels. The dataset also contains estimated Fitzpatrick skin type and Monk Skin Tone.
      Jupyter Notebook
      1614920Updated Nov 23, 2024Nov 23, 2024
    • MISeD

      Public
      MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transcripts from the QMSum dataset. MISeD is described in detail in the paper: Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts.
      31300Updated Nov 20, 2024Nov 20, 2024
    • uicrit

      Public
      UICrit is a dataset containing human-generated natural language design critiques, corresponding bounding boxes for each critique, and design quality ratings for 1,000 mobile UIs from RICO. This dataset was collected for our UIST '24 paper: https://arxiv.org/abs/2407.08850.
      02610Updated Nov 19, 2024Nov 19, 2024
    • WordGraph

      Public
      The WordGraph dataset contains multilingual lexicon entries linked to wikipedia entities, focusing on human-denoting names and demonym adjectives. Each lexicon entries contain inflected word-form and morphological information all locales.
      1100Updated Nov 7, 2024Nov 7, 2024
    • Dataset of conversations, generated by prompting Gemini Ultra. These are conversations between a teacher and a student, where the teacher is prompted with specific topic to teach the student, and the student is prompted with their learning preferences. https://arxiv.org/abs/2405.14655
      53110Updated Oct 29, 2024Oct 29, 2024
    • GeniL

      Public
      GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, MS, and ID and is annotated by native speakers of each language. Each sentence is collected from a public corpora of language and contains at least one identity group name and an attribute.
      0300Updated Oct 18, 2024Oct 18, 2024
    • The Tap Typing with Touch Sensing Images (TSI) dataset contains data of user taps on a mobile touchscreen keyboard, including elliptical features and capacitive sensing images of the taps. The dataset aligns each tap with a key the user intended to type during data collection so it can be used for keyboard decoder training and/or evaluation.
      1200Updated Oct 15, 2024Oct 15, 2024
    • mittens

      Public
      Datasets for measuring misgendering in translation
      0500Updated Oct 4, 2024Oct 4, 2024
    • wit

      Public archive
      WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.
      451.1k10Updated Sep 27, 2024Sep 27, 2024
    • This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (https://www.aclweb.org/anthology/2021.bea-1.4/)
      Python
      2316200Updated Sep 24, 2024Sep 24, 2024
    • SeeGULL-Multilingual

      Public archive
      SeeGULL Multilingual is a multilingual and multicultural dataset of stereotypes. It consists of stereotypes in 20 languages with human annotations across 23 languages, including annotations on their degree of offensiveness.
      1800Updated Sep 19, 2024Sep 19, 2024
    • ToTTo

      Public
      ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.
      3746060Updated Sep 11, 2024Sep 11, 2024