Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic upload traces to hugging-face #53

Open
recursix opened this issue Oct 7, 2024 · 10 comments
Open

Automatic upload traces to hugging-face #53

recursix opened this issue Oct 7, 2024 · 10 comments
Assignees

Comments

@recursix
Copy link
Collaborator

recursix commented Oct 7, 2024

Make tools to simplify adding traces of agents to an ever growing huggingfaces dataset.

  • create 2 datasets on hugging face

    • one that would be an index to be able to easily retrieve traces based on attributes similar to the dataframe when we run load_result_df
    • one that contains actual zipped traces that can be retrieved from a pointer in the index
  • make code to upload a study trace by trace and easy way to group the traces by study in the index.

  • legality:

    • limit adding only from the domains that are whitlisted (e.g. our benchmarks or a subset of them)
    • based on which LLM and which benchmarks attribute a specific license to it.
@recursix recursix self-assigned this Oct 7, 2024
@recursix
Copy link
Collaborator Author

we can leverage the exp_args.exp_id (a uuid) as a unique reference for each trace

@RohitP2005
Copy link

So @recursix can i work on this if u dont mind ?

@recursix
Copy link
Collaborator Author

recursix commented Jan 7, 2025

That would be awesome as we've been running out of time to work on this.

I have something specific in mind, and there are other stakeholders that might have opinions on how it will be designed. You probably also have an idea of how you want to design it. So we should probably start with a more elaborated set of specs / API. Would you want to start with what you have in mind?

@recursix
Copy link
Collaborator Author

@RohitP2005, still interested?

@RohitP2005
Copy link

Yeah, I just need some more time. Is that ok with you

@recursix
Copy link
Collaborator Author

Yes it's good. Would you like to meet next week?

@RohitP2005
Copy link

Yeah, Sounds good @recursix

@RohitP2005
Copy link

From my side, I’m thinking of structuring the design as follows:

Dataset Structure

  • Index Dataset: Stores metadata (exp_id, study name, LLM, benchmark, license).
  • Traces Dataset: Stores zipped trace files, referenced by exp_id.

API Functionality

  • Trace Upload API: Uploads traces with metadata, ensuring only whitelisted domains/benchmarks are added.
  • Index Query API: Queries index dataset to retrieve trace pointers based on attributes.
  • License Management: Automatically assigns and validates licenses based on benchmark and LLM.

Legal Compliance

  • Integrates checks for domain whitelisting and license attribution to ensure data integrity and compliance.

Looking forward to refining these specs and aligning with everyone’s input! Let me know if there's anything you'd like to add!

@recursix
Copy link
Collaborator Author

sorry for late reply. That sounds good overall. I would still like to discuss this with you over e.g. zoom or find a place to chat. Can you contact me by email? [email protected]

@RohitP2005
Copy link

Yeah sure @recursix , I will contact u through email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants