Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Change document to source #661

Open
anik120 opened this issue Apr 5, 2024 · 3 comments
Open

Proposal: Change document to source #661

anik120 opened this issue Apr 5, 2024 · 3 comments
Labels
stale stale-bot has marked you as stale

Comments

@anik120
Copy link
Contributor

anik120 commented Apr 5, 2024

Capturing a discussion with @shivchander:

I was writing up a test case for the lmdk cli to test knowledge workflow, but the way that I laid out my qna.yaml is as follows:

test_knowledge_valid = b"""created_by: test-bot
seed_examples:
- question: What is Operator Framework? 
  answer: 'The Operator Framework is a set of Kubernetes components and developer tools, 
  that aid in Operator development and central management on a multi-tenant cluster.'
- question: What is an Operator? 
  answer: 'The goal of an Operator is to put operational knowledge into software. 
  Previously this knowledge only resided in the minds of administrators, 
  various combinations of shell scripts or automation software like Ansible. 
  It was outside of your Kubernetes cluster and hard to integrate. 
  With Operators, CoreOS changed that. Operators implement and automate 
  common Day-1 (installation, configuration, etc.) and Day-2 (re-configuration, 
  update, backup, failover, restore, etc.) activities in a piece of software running 
  inside your Kubernetes cluster, by integrating natively with Kubernetes concepts and APIs. 
  We call this a Kubernetes-native application. 
  With Operators you can stop treating an application as a collection of primitives like Pods, 
  Deployments, Services or ConfigMaps, but instead as a single object that only exposes the knobs 
  that make sense for the application.'
- question: What is Operator Lifecycle Manager? 
  answer: 'OLM is a component of the Operator Framework, 
  an open source toolkit to manage Kubernetes native applications, 
  called Operators, in an effective, automated, and scalable way. 
  OLM extends Kubernetes to provide a declarative way to install, 
  manage, and upgrade Operators and their dependencies in a cluster.
task_description: to teach a large language model about the Operator Framework
document:
  repo: https://github.com/anik120/knowledge-doc-test
  commit: bf78d868f544e55d8e1d99f68d9105fc3b8751bd
  patterns:
  - operator-framework*.md

Essentially, the seed_example question/answers I have there are from the overarching project websites https://operatorframework.io/, https://olm.operatorframework.io/ and https://sdk.operatorframework.io/, and the documents I have in https://github.com/anik120/knowledge-doc-test are README.mds from the components' GitHub repositories. In other words, the seed_example question/answers do not actually come from the documents hosted in document.repo.

The way I laid things out, the seed_examples are "product pitch/summary description" and document.repo contains all the docs I want the model to learn about.

Shiv tells me that that's the wrong way of thinking about it, and the verb document should be source in reality, and seed_examples are examples of questions/answers that can be answered by the model once it's been trained on the docs hosted in docs.repo.

Eureka moment: Even after learning* how the taxonomy interacts with the model, I was thinking about the structure of my qna.yaml, the wrong way. It's likely that other users will also confuse the taxonomy/model interactions and lay out the qna.yaml files the wrong way, leading to PR submissions that'll likely not improve model quality.
*only a little while ago, ie fresh info being processed by brain still

Proposed fix: Change document to source

cc: @xukai92 @abhi1092 @aldopareja

@anik120
Copy link
Contributor Author

anik120 commented Apr 5, 2024

Capture same comment here too instructlab/instructlab#776 (comment)

PS: this issue is just a proposal, in the hopes that a discussion will ensue about the priority of this work. Totally reasonable to just ignore is if others don't see it as a high priority issue/change will take a lot of effort to get through before opening and team does not have cycles to implement the change 😀

@anik120 anik120 changed the title Change document to source Proposal: Change document to source Apr 5, 2024
@bjhargrave
Copy link
Contributor

This issue should probably be in the https://github.com/instruct-lab/schema/ repo.

@bjhargrave
Copy link
Contributor

@anik120 I think it is beyond when this change could be made. Perhaps we can close this issue?

@github-actions github-actions bot added the stale stale-bot has marked you as stale label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stale-bot has marked you as stale
Projects
None yet
Development

No branches or pull requests

2 participants