Skip to content

Commit 29e4b18

Browse files
authored
Merge pull request #2 from VectorInstitute/develop
Develop
2 parents 6573704 + 1693479 commit 29e4b18

File tree

1 file changed

+32
-1
lines changed

1 file changed

+32
-1
lines changed

README.md

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,38 @@
88
[![codecov](https://codecov.io/gh/VectorInstitute/aieng-template/branch/main/graph/badge.svg)](https://codecov.io/gh/VectorInstitute/aieng-template)
99
[![license](https://img.shields.io/github/license/VectorInstitute/aieng-template.svg)](https://github.com/VectorInstitute/aieng-template/blob/main/LICENSE)
1010

11-
A repository for evaluating RAG systems.
11+
## Description
12+
13+
### Overview
14+
15+
The Vector RAG Evaluation framework is designed to be an intuitive and flexible tool for benchmarking the performance of RAG systems. The framework exposes an Evaluator that is configured using three components: Systems, Tasks, and Metrics.
16+
17+
- **Systems** encapsulate a RAG system. Systems must adhere to a common interface but can be implemented by users with arbitrary complexity. Several simple baseline systems are implemented within the framework.
18+
- **Tasks** represent RAG datasets (inspired by the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) implementation). A Task is composed of a set of Documents and a set of Task Instances for evaluation.
19+
- **Metrics** measure various aspects of the RAG systems, including accuracy, relevance, groundedness, and hallucination detection. Metrics can be user-defined or imported from existing frameworks such as [RAGAS](https://docs.ragas.io/en/stable/), [TruLens](https://www.trulens.org/), [Rageval](https://github.com/gomate-community/rageval) and [DeepEval](https://docs.confident-ai.com/).
20+
21+
### Evaluation
22+
23+
RAG systems evaluation is a difficult task, there are many variables and hyper-parameters that can be manipulated from underlying model performance to system design. Most RAG systems (although not always) are currently developed for Q/A applications. The following elements of a RAG system can be useful for Q/A evaluation:
24+
25+
- **Q** - Query/Question
26+
- **C** - Retrieved Context
27+
- **A** - Generated Answer
28+
- **C\*** - Ground Truth Context
29+
- **A\*** - Ground Truth Answer
30+
31+
Not all of the elements will necessarily be available. Some evaluation can be performed without ground truth context (**C\***), or ground truth answers (**A\***). Evaluation without ground truth is relevant when monitoring a system deployed in production. Ultimately, this is a somewhat simplistic view of system elements. A complex system may have many elements of intermediate state that should be evaluated. For example, a re-ranking system should evaluate the context before and after re-ranking to rigorously evaluate the impact of the re-ranking model.
32+
33+
#### Evaluation Without Ground Truth
34+
35+
- Relevance between Query and Generated Answer (*relevance_query_answer*): Evaluate the relevance of the generated answer (**A**) to the original query (**Q**).
36+
- Groundedness of Answers (*roundedness_context_answer*): Assess how well the answer (**A**) is supported by the retrieved contexts (**C**).
37+
- Relevance between Query and Retrieved Context (*relevance_query_context*): Evaluate the relevance of the retrieved context (**C**) to the original query (**Q**).
38+
39+
#### Evaluation With Ground Truth
40+
41+
- Compare Generated and GT Answers (*answer_correctness*): Many evaluation techniques compare the generated answer (**A**) with the GT answer (**A\***).
42+
1243

1344
## 🧑🏿‍💻 Developing
1445

0 commit comments

Comments
 (0)