Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend TOFU Dataset #2

Open
philswatton opened this issue Mar 28, 2024 · 3 comments
Open

Extend TOFU Dataset #2

philswatton opened this issue Mar 28, 2024 · 3 comments
Assignees
Milestone

Comments

@philswatton
Copy link
Collaborator

  • We would like to extend the TOFU dataset in the following directions:
    • Amount of data
    • Different levels of granularity
    • Different levels of interconnection
  • Implement functions to generate additional TOFU data deterministically given a seed number (or create separate dataset?)
  • As in TOFU dataset in pipeline #1, we want to be able to manage these for creating retain/forget sets later on
@jack89roberts
Copy link
Contributor

Summary of data conversation with Matt:

  • People, details of individuals, relationships between people, relationships between people and organisations/non-person entities for both 1) Fake authors, and 2) Real authors. To check - are synthetic and real authors similar enough?

  • Questions & answers only vs. full documents? Questions & answers only probably an ok simplification to make, but if we're generating our own data we can make profiles etc., plus experiments on real authors would probe this.

@jack89roberts
Copy link
Contributor

jack89roberts commented May 22, 2024

Generate a new dataset:

  1. Decide what relationships we want to have (e.g. genres, publishers, co-authors, ...)
  2. Create & sample relationships (e.g. via a graph/rule-based mechanism/prompt engineering)
  3. Decide what data we want to have (e.g. questions/answers, author profiles, book summaries, ...)
  4. Generate data (including paraphrased & perturbed)

@jack89roberts jack89roberts added this to the Milestone 2 milestone May 22, 2024
@jack89roberts
Copy link
Contributor

Defer for now: Real dataset (authors/facts etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants