Skip to content

Evaluations 1.5 docs #1064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 1, 2025
Merged

Evaluations 1.5 docs #1064

merged 1 commit into from
Apr 1, 2025

Conversation

neoxelox
Copy link
Contributor

@neoxelox neoxelox commented Apr 1, 2025

Related to: #922 and #912

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @neoxelox, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request focuses on updating the documentation for the Evaluations feature in the Latitude platform. The changes include renaming 'Manual evaluations' to 'Human-in-the-Loop' evaluations, clarifying the purpose and usage of different evaluation types (LLM-as-judge, Programmatic rules, and Human-in-the-loop), and providing more detailed explanations of how to run evaluations and interpret the results. The updates aim to improve user understanding and adoption of the evaluation tools available in Latitude.

Highlights

  • Evaluation Types: The documentation now clearly distinguishes between LLM-as-judge, Programmatic rules, and Human-in-the-loop evaluations, outlining their respective strengths and use cases.
  • Human-in-the-Loop Evaluations: The 'Manual evaluations' section has been renamed to 'Human-in-the-Loop' and updated to reflect the use of human feedback in the evaluation process.
  • Running Evaluations: The documentation provides detailed steps on how to run evaluations in the playground, live mode, and batch mode, with considerations for different evaluation types.
  • Prompt Suggestions: A new section on Prompt Suggestions has been added, explaining how Latitude automatically analyzes evaluation results to generate recommendations for improving prompts.

Changelog

Click here to see the changelog
  • docs/guides/datasets/overview.mdx
    • Updated the description of datasets to include their use as expected outputs (labels) for evaluations.
  • docs/guides/evaluations/evaluation-templates.mdx
    • Replaced 'LLM outputs' with 'LLM responses' for clarity.
  • docs/guides/evaluations/llm_as_judge_evaluations.mdx
    • Renamed the title to 'LLM as Judge' and updated the description to focus on using LLMs to evaluate prompt quality.
    • Revised the content to explain the use cases and trade-offs of LLM as judge evaluations, and how they compare to other evaluation types.
  • docs/guides/evaluations/manual_evaluations.mdx
    • Renamed the title to 'Human-in-the-Loop' and updated the description to emphasize human feedback in evaluating prompts.
    • Revised the content to explain the use cases and trade-offs of human-in-the-loop evaluations, and how to submit evaluation results through the dashboard, API, or SDK.
  • docs/guides/evaluations/overview.mdx
    • Expanded the description of evaluation types to include Programmatic rules and Human-in-the-loop, and added a section on negative evaluations.
    • Updated the content to explain how to run evaluations in live and batch mode, and introduced the concept of prompt suggestions.
  • docs/guides/evaluations/programmatic_rules.mdx
    • Added a new page describing Programmatic Rules evaluations, including use cases, trade-offs, available metrics, and how to create them.
  • docs/guides/evaluations/prompt-suggestions.mdx
    • Added a new page explaining Prompt Suggestions, how they are generated, and how to use them to improve prompts.
  • docs/guides/evaluations/running-evaluations.mdx
    • Revised the content to explain how to run evaluations in the playground, live mode, and batch mode, with considerations for different evaluation types.
  • docs/guides/getting-started/concepts.mdx
    • Updated the description of evaluations to include Programmatic Rules and Human-in-the-loop, and linked to the updated Evaluations guide.
  • docs/guides/logs/upload-logs.mdx
    • Updated instructions for evaluating uploaded logs to reflect the new evaluation types and configurations.
  • docs/mint.json
    • Updated the navigation structure to reflect the new evaluation types and pages, and renamed 'LLM as judge evaluations' to 'LLM as Judge'.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


A judge of code,
LLM's wisdom, a guide,
Prompts find their worth.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request updates the documentation for evaluations, including adding new guides for programmatic rules and prompt suggestions. The changes aim to provide a more comprehensive overview of the evaluation features and how to use them effectively. Overall, the changes are well-structured and improve the clarity of the documentation.

Summary of Findings

  • Inconsistent terminology: The documentation uses both 'LLM outputs' and 'LLM responses'. Consistent terminology would improve clarity.
  • Missing context for 'Negative evaluations': The section on 'Negative evaluations' could benefit from a brief example to illustrate its use case.
  • Inaccurate description of live evaluation support: The documentation states that evaluations requiring expected output do not support live evaluation, but this is not entirely accurate as some programmatic rules with expected output do support live evaluation.

Merge Readiness

The pull request is almost ready for merging. Addressing the identified inconsistencies and inaccuracies would further enhance the quality of the documentation. I am unable to directly approve the pull request, and recommend that others review and approve this code before merging. I recommend that the pull request not be merged until the high severity issues are addressed.

Comment on lines +24 to +27
<Note>
Evaluations that require an expected output, or human verification, do not
support live evaluation.
</Note>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This note states that evaluations requiring an expected output do not support live evaluation. However, some programmatic rules with expected output (e.g., Exact Match) do not support live evaluation, while others (e.g., Regular Expression) do. Clarify this statement to reflect the nuances of live evaluation support for different evaluation types.

Evaluations that require human verification do not support live evaluation. Some evaluation types that require an expected output, such as Exact Match, also do not support live evaluation.

Comment on lines +14 to +17
### Negative evaluations

Usually, a higher score is always better in an evaluation. However, if you want to measure negative traits such as hallucinations or perplexity, you may want to use negative evaluations. Negative evaluations are evaluations that measure a negative criteria, where a lower score is better.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The section on 'Negative evaluations' could benefit from a brief example to illustrate its use case. For instance, mention a scenario where minimizing hallucinations is desired.

- **LLM evaluations**: You can use large language models to evaluate the output of other models. This is useful when you have a large number of logs and need to evaluate them quickly.
- **Human evaluations (HITL) [Coming soon]**: You—or your team—can manually review the logs and score them based on your criteria.
- **LLM-as-judge**: Large language models are used to evaluate the output of other models. This is useful when the evaluated criteria is subjective and complex.
- **Programmatic Rules**: Simple, algorithmic rules, that evaluate your prompt based on some metric. Perfect for ground truth testing and objective criterias, such as enforcing specific lengths or validating formats.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrase "Human evaluations (HITL) [Coming soon]" is outdated. It should be updated to reflect the current status of Human-in-the-loop evaluations.

- **Human-in-the-loop**: You (or your team) manually review the logs and evaluate them based on your criteria. This is ideal when you need human verification.

@neoxelox neoxelox merged commit dd5089b into main Apr 1, 2025
4 checks passed
@neoxelox neoxelox deleted the feature/912-evaluations-v1.5-docs branch April 1, 2025 09:29
@github-actions github-actions bot locked and limited conversation to collaborators Apr 1, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant