evaluation

Star

Here are 1,079 public repositories matching this topic...

EXP-Tools / steam-discount

Star

steam 特惠游戏榜单（自动刷新）

steam crawler evaluation rank discount zero playing

Updated May 15, 2024
Python

open-compass / opencompass

Star

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

Updated May 15, 2024
Python

ziqihuangg / Awesome-Evaluation-of-Visual-Generation

Star

A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems

benchmark awesome evaluation image-generation evaluation-metrics generative-models video-generation evaluation-system

Updated May 15, 2024

k4black / codebleu

Star

Pip compatible CodeBLEU metric implementation available for linux/macos/win

code evaluation code-generation code-evaluation evaluation-metrics codebleu

Updated May 15, 2024
Python

Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated May 15, 2024
TypeScript

google / imageinwords

Star

Data release for the ImageInWords (IIW) paper.

evaluation dataset image-captioning dataset-generation image-to-text image-descriptions image-text human-annotation t2i i2t detailed-descriptions detailed-annotations

Updated May 15, 2024
JavaScript

deshwalmahesh / PHUDGE

Star

Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs.

nlp ai evaluation ml pytorch judge feedback-collection sota custom-dataset finetuning llm phi-3

Updated May 15, 2024
Jupyter Notebook

langchain-ai / langsmith-docs

Star

Documentation for langsmith

testing documentation evaluation tracing langchain langsmith

Updated May 15, 2024
MDX

onejune2018 / Awesome-LLM-Eval

Star

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

nlp benchmark machine-learning leaderboard evaluation dataset openai llama bert rag awsome-list gpt3 llm awsome-lists chatgpt large-language-model chatglm qwen llm-evaluation

Updated May 15, 2024

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated May 15, 2024
TypeScript

VectorInstitute / cyclops

Star

Toolkit for evaluating and monitoring AI models in clinical settings

machine-learning deep-learning evaluation healthcare physionet mimic-iii electronic-health-record clinical-research eicu-crd clinical-data clinical-decision-support drift-detection model-monitoring data-drift omop-cdm mimic-iv electronic-medical-record

Updated May 14, 2024
Python

langwatch / langevals

Star

LangEvals aggregates various language model evaluators into a single platform, providing a standard interface for a multitude of scores and LLM guardrails, for you to protect and benchmark your LLM models and pipelines.

evaluation openai guardrails llm

Updated May 14, 2024
Python

lunary-ai / lunary

Star

The production toolkit for LLMs. Observability, prompt management and evaluations.

testing ai monitoring evaluation logs self-hosted openai hacktoberfest observability prompts llm langchain

Updated May 14, 2024
TypeScript

MinhVuong2000 / LLMReasonCert

Star

Official Implementation of paper "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs"(https://arxiv.org/abs/2402.11199).

framework evaluation knowledge-graph reasoning evaluation-framework llms faithfulness