evaluation
Here are 1,079 public repositories matching this topic...
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
-
Updated
May 15, 2024 - Python
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
-
Updated
May 15, 2024
Pip compatible CodeBLEU metric implementation available for linux/macos/win
-
Updated
May 15, 2024 - Python
Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.
-
Updated
May 15, 2024 - TypeScript
Data release for the ImageInWords (IIW) paper.
-
Updated
May 15, 2024 - JavaScript
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs.
-
Updated
May 15, 2024 - Jupyter Notebook
Documentation for langsmith
-
Updated
May 15, 2024 - MDX
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
-
Updated
May 15, 2024
🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
-
Updated
May 15, 2024 - TypeScript
Toolkit for evaluating and monitoring AI models in clinical settings
-
Updated
May 14, 2024 - Python
LangEvals aggregates various language model evaluators into a single platform, providing a standard interface for a multitude of scores and LLM guardrails, for you to protect and benchmark your LLM models and pipelines.
-
Updated
May 14, 2024 - Python
The production toolkit for LLMs. Observability, prompt management and evaluations.
-
Updated
May 14, 2024 - TypeScript
Official Implementation of paper "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs"(https://arxiv.org/abs/2402.11199).
-
Updated
May 14, 2024 - Python
Graphical tool for creating verification plots of weather forecasts
-
Updated
May 14, 2024 - Python
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
-
Updated
May 14, 2024 - Jupyter Notebook
LangSmith Client SDK Implementations
-
Updated
May 15, 2024 - Python
An open-source visual programming environment for battle-testing prompts to LLMs.
-
Updated
May 15, 2024 - TypeScript
A version of eval for R that returns more information about what happened
-
Updated
May 14, 2024 - R
A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models
-
Updated
May 14, 2024 - Python
Improve this page
Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."