Vector-Eval

Description

Overview

The Vector RAG Evaluation framework is designed to be an intuitive and flexible tool for benchmarking the performance of RAG systems. The framework exposes an Evaluator that is configured using three components: Systems, Tasks, and Metrics.

Systems encapsulate a RAG system. Systems must adhere to a common interface but can be implemented by users with arbitrary complexity. Several simple baseline systems are implemented within the framework.
Tasks represent RAG datasets (inspired by the lm-evaluation-harness implementation). A Task is composed of a set of Documents and a set of Task Instances for evaluation.
Metrics measure various aspects of the RAG systems, including accuracy, relevance, groundedness, and hallucination detection. Metrics can be user-defined or imported from existing frameworks such as RAGAS, TruLens, Rageval and DeepEval.

Evaluation

RAG systems evaluation is a difficult task, there are many variables and hyper-parameters that can be manipulated from underlying model performance to system design. Most RAG systems (although not always) are currently developed for Q/A applications. The following elements of a RAG system can be useful for Q/A evaluation:

Q - Query/Question
C - Retrieved Context
A - Generated Answer
C* - Ground Truth Context
A* - Ground Truth Answer

Not all of the elements will necessarily be available. Some evaluation can be performed without ground truth context (C*), or ground truth answers (A*). Evaluation without ground truth is relevant when monitoring a system deployed in production. Ultimately, this is a somewhat simplistic view of system elements. A complex system may have many elements of intermediate state that should be evaluated. For example, a re-ranking system should evaluate the context before and after re-ranking to rigorously evaluate the impact of the re-ranking model.

Evaluation Without Ground Truth

Relevance between Query and Generated Answer (relevance_query_answer): Evaluate the relevance of the generated answer (A) to the original query (Q).
Groundedness of Answers (roundedness_context_answer): Assess how well the answer (A) is supported by the retrieved contexts (C).
Relevance between Query and Retrieved Context (relevance_query_context): Evaluate the relevance of the retrieved context (C) to the original query (Q).

Evaluation With Ground Truth

Compare Generated and GT Answers (correctness_answer): Many evaluation techniques compare the generated answer (A) with the GT answer (A*).

🧑🏿‍💻 Developing

Installing dependencies

Create a new env and install the required packages:

python3 -m venv env
source env/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt

Running evaluation

Run evaluation for TASK (for e.g. pubmedqa) using SYSTEM (for e.g. basic_rag):

python3 veval/run.py --task <TASK> --sys <SYSTEM>

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github		.github
veval		veval
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
codecov.yml		codecov.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vector-Eval

Description

Overview

Evaluation

Evaluation Without Ground Truth

Evaluation With Ground Truth

🧑🏿‍💻 Developing

Installing dependencies

Running evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

VectorInstitute/vector-eval

Folders and files

Latest commit

History

Repository files navigation

Vector-Eval

Description

Overview

Evaluation

Evaluation Without Ground Truth

Evaluation With Ground Truth

🧑🏿‍💻 Developing

Installing dependencies

Running evaluation

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages