Generative AI Evaluations Workshop

This workshop teaches systematic approaches to evaluating Generative AI workloads for production use. You'll learn to build evaluation frameworks that go beyond basic metrics to ensure reliable model performance while optimizing cost and performance.

Click here for a slide deck which covers the basics of evaluations and includes an overview of this workshop.

How to use this repository

We strongly recommend going in order through the Foundational Evaluations modules. These cover the core of generative AI evaluations which will be critical in all workloads. After that, please feel free to select any of the workload- and framework-specific modules in any order, according to what is most relevant to you.

Interactive Learning

As an alternative to working through the Jupyter notebooks directly, you can use the Interactive Learning module as a skill file in your AI coding assistant (Kiro, Claude, or similar). The assistant reads the workshop's notebooks and skill docs, then guides you through hands-on challenges — asking you to write code, explain concepts, and debug configurations rather than passively reading. Tell the agent which module you want to work on and it will present exercises one at a time, check your understanding, provide hints when you're stuck, and adapt to your pace.

What You'll Learn

Foundational Evaluations - Do all of these in order!

01 Operational Metrics: evaluate how your workload is running in terms of cost and performance.
02 Quality Metrics: evaluate and tune the quality of your results.
03 Understanding Failures: discover failure patterns by reading agent traces.
04 Agentic Metrics: evaluate your agents and use agents for evaluation.

Optional Modules - Do any of these in any order!

Workload Specific Evaluations
- Intelligent Document Processing: evaluate structured data extraction accuracy with field-level precision/recall.
- Guardrails: configure and test content filters, grounding checks, alignment, and operational controls.
- Basic RAG: evaluate retrieval quality and end-to-end answer generation with precision@k, NDCG, and faithfulness scoring.
- MultiModal RAG: evaluate retrieval across text, vision, and audio modalities using ImageBind embeddings.
- Speech to Speech: end-to-end evaluation of Nova Sonic interactions using CloudWatch telemetry and LLM-as-Judge.
- Automated Reasoning Evaluations: verify LLM outputs against formal policy rules using SMT solver-based guardrails.
- Tool Calling: evaluate agent tool-calling behavior without real tool execution using five progressively sophisticated approaches.
- Chatbot: evaluate multi-turn conversational AI with simulated users, custom binary evaluators, and synthetic data generation.
- Red Teaming: systematically probe AI systems with adversarial inputs using Promptfoo across LLM apps, RAG, agents, and guardrails.
- Multiagent Shared Context Evaluation: measure memory coordination quality in multi-agent systems across hub-spoke and peer-to-peer patterns.
- Coding Assistant: coming soon.
Framework Specific Evaluations
- Prompt Foo: configure YAML-based evaluations, write assertion test cases, and compare models across providers.
- Strands: evaluate agents using the Strands Evals SDK with output quality, trajectory, and custom evaluators.
- AgentCore: evaluate agents deployed on Amazon Bedrock AgentCore Runtime with LLM-as-Judge and CloudWatch log analysis.
- AgentCore Runtime Evals: run native AgentCore Evaluations API with built-in evaluators for helpfulness and tool selection accuracy.
- DSPy: optimize prompts automatically with BootstrapFewShot and measure improvement with custom metrics.
- MLflow: track and compare evaluation experiments using MLflow with Amazon Bedrock.
- DeepEval: coming soon.
Interactive Learning Mode: guided challenges, exercises, and real-time feedback with an interactive tutor.

Prerequisites

AWS account with Amazon Bedrock enabled
Basic Python and ML familiarity
No prior AI evaluation experience required

Getting Started

Clone the repository
Configure AWS credentials
Work through the Foundational Evaluations modules in order
Pick from the Workload Specific and Framework Specific modules based on what's relevant to you

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 228 Commits
Foundational Evaluations		Foundational Evaluations
Framework Specific Evaluations		Framework Specific Evaluations
Interactive Learning		Interactive Learning
Workload Specific Evaluations		Workload Specific Evaluations
.gitattributes		.gitattributes
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
evals workshop.png		evals workshop.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative AI Evaluations Workshop

How to use this repository

Interactive Learning

What You'll Learn

Foundational Evaluations - Do all of these in order!

Optional Modules - Do any of these in any order!

Prerequisites

Getting Started

Security

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Generative AI Evaluations Workshop

How to use this repository

Interactive Learning

What You'll Learn

Foundational Evaluations - Do all of these in order!

Optional Modules - Do any of these in any order!

Prerequisites

Getting Started

Security

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages