Evaluations 1.5 docs #1064

neoxelox · 2025-04-01T09:26:23Z

Related to: #922 and #912

gemini-code-assist

Hello @neoxelox, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request focuses on updating the documentation for the Evaluations feature in the Latitude platform. The changes include renaming 'Manual evaluations' to 'Human-in-the-Loop' evaluations, clarifying the purpose and usage of different evaluation types (LLM-as-judge, Programmatic rules, and Human-in-the-loop), and providing more detailed explanations of how to run evaluations and interpret the results. The updates aim to improve user understanding and adoption of the evaluation tools available in Latitude.

Highlights

Evaluation Types: The documentation now clearly distinguishes between LLM-as-judge, Programmatic rules, and Human-in-the-loop evaluations, outlining their respective strengths and use cases.
Human-in-the-Loop Evaluations: The 'Manual evaluations' section has been renamed to 'Human-in-the-Loop' and updated to reflect the use of human feedback in the evaluation process.
Running Evaluations: The documentation provides detailed steps on how to run evaluations in the playground, live mode, and batch mode, with considerations for different evaluation types.
Prompt Suggestions: A new section on Prompt Suggestions has been added, explaining how Latitude automatically analyzes evaluation results to generate recommendations for improving prompts.

Changelog

Click here to see the changelog

docs/guides/datasets/overview.mdx
- Updated the description of datasets to include their use as expected outputs (labels) for evaluations.
docs/guides/evaluations/evaluation-templates.mdx
- Replaced 'LLM outputs' with 'LLM responses' for clarity.
docs/guides/evaluations/llm_as_judge_evaluations.mdx
- Renamed the title to 'LLM as Judge' and updated the description to focus on using LLMs to evaluate prompt quality.
- Revised the content to explain the use cases and trade-offs of LLM as judge evaluations, and how they compare to other evaluation types.
docs/guides/evaluations/manual_evaluations.mdx
- Renamed the title to 'Human-in-the-Loop' and updated the description to emphasize human feedback in evaluating prompts.
- Revised the content to explain the use cases and trade-offs of human-in-the-loop evaluations, and how to submit evaluation results through the dashboard, API, or SDK.
docs/guides/evaluations/overview.mdx
- Expanded the description of evaluation types to include Programmatic rules and Human-in-the-loop, and added a section on negative evaluations.
- Updated the content to explain how to run evaluations in live and batch mode, and introduced the concept of prompt suggestions.
docs/guides/evaluations/programmatic_rules.mdx
- Added a new page describing Programmatic Rules evaluations, including use cases, trade-offs, available metrics, and how to create them.
docs/guides/evaluations/prompt-suggestions.mdx
- Added a new page explaining Prompt Suggestions, how they are generated, and how to use them to improve prompts.
docs/guides/evaluations/running-evaluations.mdx
- Revised the content to explain how to run evaluations in the playground, live mode, and batch mode, with considerations for different evaluation types.
docs/guides/getting-started/concepts.mdx
- Updated the description of evaluations to include Programmatic Rules and Human-in-the-loop, and linked to the updated Evaluations guide.
docs/guides/logs/upload-logs.mdx
- Updated instructions for evaluating uploaded logs to reflect the new evaluation types and configurations.
docs/mint.json
- Updated the navigation structure to reflect the new evaluation types and pages, and renamed 'LLM as judge evaluations' to 'LLM as Judge'.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

A judge of code,
LLM's wisdom, a guide,
Prompts find their worth.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request updates the documentation for evaluations, including adding new guides for programmatic rules and prompt suggestions. The changes aim to provide a more comprehensive overview of the evaluation features and how to use them effectively. Overall, the changes are well-structured and improve the clarity of the documentation.

Summary of Findings

Inconsistent terminology: The documentation uses both 'LLM outputs' and 'LLM responses'. Consistent terminology would improve clarity.
Missing context for 'Negative evaluations': The section on 'Negative evaluations' could benefit from a brief example to illustrate its use case.
Inaccurate description of live evaluation support: The documentation states that evaluations requiring expected output do not support live evaluation, but this is not entirely accurate as some programmatic rules with expected output do support live evaluation.

Merge Readiness

The pull request is almost ready for merging. Addressing the identified inconsistencies and inaccuracies would further enhance the quality of the documentation. I am unable to directly approve the pull request, and recommend that others review and approve this code before merging. I recommend that the pull request not be merged until the high severity issues are addressed.

gemini-code-assist · 2025-04-01T09:26:49Z

docs/guides/evaluations/running-evaluations.mdx

+<Note>
+  Evaluations that require an expected output, or human verification, do not
+  support live evaluation.
+</Note>


This note states that evaluations requiring an expected output do not support live evaluation. However, some programmatic rules with expected output (e.g., Exact Match) do not support live evaluation, while others (e.g., Regular Expression) do. Clarify this statement to reflect the nuances of live evaluation support for different evaluation types.

Evaluations that require human verification do not support live evaluation. Some evaluation types that require an expected output, such as Exact Match, also do not support live evaluation.

gemini-code-assist · 2025-04-01T09:26:49Z

docs/guides/evaluations/overview.mdx

+### Negative evaluations
+
+Usually, a higher score is always better in an evaluation. However, if you want to measure negative traits such as hallucinations or perplexity, you may want to use negative evaluations. Negative evaluations are evaluations that measure a negative criteria, where a lower score is better.
+


The section on 'Negative evaluations' could benefit from a brief example to illustrate its use case. For instance, mention a scenario where minimizing hallucinations is desired.

gemini-code-assist · 2025-04-01T09:26:49Z

docs/guides/getting-started/concepts.mdx

- **LLM evaluations**: You can use large language models to evaluate the output of other models. This is useful when you have a large number of logs and need to evaluate them quickly.
- **Human evaluations (HITL) [Coming soon]**: You—or your team—can manually review the logs and score them based on your criteria.
+- **LLM-as-judge**: Large language models are used to evaluate the output of other models. This is useful when the evaluated criteria is subjective and complex.
+- **Programmatic Rules**: Simple, algorithmic rules, that evaluate your prompt based on some metric. Perfect for ground truth testing and objective criterias, such as enforcing specific lengths or validating formats.


The phrase "Human evaluations (HITL) [Coming soon]" is outdated. It should be updated to reflect the current status of Human-in-the-loop evaluations.

- **Human-in-the-loop**: You (or your team) manually review the logs and evaluate them based on your criteria. This is ideal when you need human verification.

Evaluations 1.5 docs

0f9c76f

gemini-code-assist bot reviewed Apr 1, 2025

View reviewed changes

gemini-code-assist bot suggested changes Apr 1, 2025

View reviewed changes

neoxelox merged commit dd5089b into main Apr 1, 2025
4 checks passed

neoxelox deleted the feature/912-evaluations-v1.5-docs branch April 1, 2025 09:29

github-actions bot locked and limited conversation to collaborators Apr 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluations 1.5 docs #1064

Evaluations 1.5 docs #1064

neoxelox commented Apr 1, 2025

gemini-code-assist bot left a comment

gemini-code-assist bot left a comment

gemini-code-assist bot Apr 1, 2025

gemini-code-assist bot Apr 1, 2025

gemini-code-assist bot Apr 1, 2025

		### Negative evaluations

		Usually, a higher score is always better in an evaluation. However, if you want to measure negative traits such as hallucinations or perplexity, you may want to use negative evaluations. Negative evaluations are evaluations that measure a negative criteria, where a lower score is better.

Evaluations 1.5 docs #1064

Evaluations 1.5 docs #1064

Conversation

neoxelox commented Apr 1, 2025

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

gemini-code-assist bot Apr 1, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Apr 1, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Apr 1, 2025

Choose a reason for hiding this comment