-
Notifications
You must be signed in to change notification settings - Fork 128
Evaluations 1.5 docs #1064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluations 1.5 docs #1064
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @neoxelox, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
This pull request focuses on updating the documentation for the Evaluations feature in the Latitude platform. The changes include renaming 'Manual evaluations' to 'Human-in-the-Loop' evaluations, clarifying the purpose and usage of different evaluation types (LLM-as-judge, Programmatic rules, and Human-in-the-loop), and providing more detailed explanations of how to run evaluations and interpret the results. The updates aim to improve user understanding and adoption of the evaluation tools available in Latitude.
Highlights
- Evaluation Types: The documentation now clearly distinguishes between LLM-as-judge, Programmatic rules, and Human-in-the-loop evaluations, outlining their respective strengths and use cases.
- Human-in-the-Loop Evaluations: The 'Manual evaluations' section has been renamed to 'Human-in-the-Loop' and updated to reflect the use of human feedback in the evaluation process.
- Running Evaluations: The documentation provides detailed steps on how to run evaluations in the playground, live mode, and batch mode, with considerations for different evaluation types.
- Prompt Suggestions: A new section on Prompt Suggestions has been added, explaining how Latitude automatically analyzes evaluation results to generate recommendations for improving prompts.
Changelog
Click here to see the changelog
- docs/guides/datasets/overview.mdx
- Updated the description of datasets to include their use as expected outputs (labels) for evaluations.
- docs/guides/evaluations/evaluation-templates.mdx
- Replaced 'LLM outputs' with 'LLM responses' for clarity.
- docs/guides/evaluations/llm_as_judge_evaluations.mdx
- Renamed the title to 'LLM as Judge' and updated the description to focus on using LLMs to evaluate prompt quality.
- Revised the content to explain the use cases and trade-offs of LLM as judge evaluations, and how they compare to other evaluation types.
- docs/guides/evaluations/manual_evaluations.mdx
- Renamed the title to 'Human-in-the-Loop' and updated the description to emphasize human feedback in evaluating prompts.
- Revised the content to explain the use cases and trade-offs of human-in-the-loop evaluations, and how to submit evaluation results through the dashboard, API, or SDK.
- docs/guides/evaluations/overview.mdx
- Expanded the description of evaluation types to include Programmatic rules and Human-in-the-loop, and added a section on negative evaluations.
- Updated the content to explain how to run evaluations in live and batch mode, and introduced the concept of prompt suggestions.
- docs/guides/evaluations/programmatic_rules.mdx
- Added a new page describing Programmatic Rules evaluations, including use cases, trade-offs, available metrics, and how to create them.
- docs/guides/evaluations/prompt-suggestions.mdx
- Added a new page explaining Prompt Suggestions, how they are generated, and how to use them to improve prompts.
- docs/guides/evaluations/running-evaluations.mdx
- Revised the content to explain how to run evaluations in the playground, live mode, and batch mode, with considerations for different evaluation types.
- docs/guides/getting-started/concepts.mdx
- Updated the description of evaluations to include Programmatic Rules and Human-in-the-loop, and linked to the updated Evaluations guide.
- docs/guides/logs/upload-logs.mdx
- Updated instructions for evaluating uploaded logs to reflect the new evaluation types and configurations.
- docs/mint.json
- Updated the navigation structure to reflect the new evaluation types and pages, and renamed 'LLM as judge evaluations' to 'LLM as Judge'.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
A judge of code,
LLM's wisdom, a guide,
Prompts find their worth.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request updates the documentation for evaluations, including adding new guides for programmatic rules and prompt suggestions. The changes aim to provide a more comprehensive overview of the evaluation features and how to use them effectively. Overall, the changes are well-structured and improve the clarity of the documentation.
Summary of Findings
- Inconsistent terminology: The documentation uses both 'LLM outputs' and 'LLM responses'. Consistent terminology would improve clarity.
- Missing context for 'Negative evaluations': The section on 'Negative evaluations' could benefit from a brief example to illustrate its use case.
- Inaccurate description of live evaluation support: The documentation states that evaluations requiring expected output do not support live evaluation, but this is not entirely accurate as some programmatic rules with expected output do support live evaluation.
Merge Readiness
The pull request is almost ready for merging. Addressing the identified inconsistencies and inaccuracies would further enhance the quality of the documentation. I am unable to directly approve the pull request, and recommend that others review and approve this code before merging. I recommend that the pull request not be merged until the high severity issues are addressed.
<Note> | ||
Evaluations that require an expected output, or human verification, do not | ||
support live evaluation. | ||
</Note> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This note states that evaluations requiring an expected output do not support live evaluation. However, some programmatic rules with expected output (e.g., Exact Match) do not support live evaluation, while others (e.g., Regular Expression) do. Clarify this statement to reflect the nuances of live evaluation support for different evaluation types.
Evaluations that require human verification do not support live evaluation. Some evaluation types that require an expected output, such as Exact Match, also do not support live evaluation.
### Negative evaluations | ||
|
||
Usually, a higher score is always better in an evaluation. However, if you want to measure negative traits such as hallucinations or perplexity, you may want to use negative evaluations. Negative evaluations are evaluations that measure a negative criteria, where a lower score is better. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **LLM evaluations**: You can use large language models to evaluate the output of other models. This is useful when you have a large number of logs and need to evaluate them quickly. | ||
- **Human evaluations (HITL) [Coming soon]**: You—or your team—can manually review the logs and score them based on your criteria. | ||
- **LLM-as-judge**: Large language models are used to evaluate the output of other models. This is useful when the evaluated criteria is subjective and complex. | ||
- **Programmatic Rules**: Simple, algorithmic rules, that evaluate your prompt based on some metric. Perfect for ground truth testing and objective criterias, such as enforcing specific lengths or validating formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phrase "Human evaluations (HITL) [Coming soon]" is outdated. It should be updated to reflect the current status of Human-in-the-loop evaluations.
- **Human-in-the-loop**: You (or your team) manually review the logs and evaluate them based on your criteria. This is ideal when you need human verification.
Related to: #922 and #912