[Codex] Add handling for Conversational RAG to Validator API #84

ulya-tkch · 2025-05-30T22:05:10Z

No description provided.

jwmueller · 2025-05-30T23:52:37Z

tests/internal/test_validator.py

@@ -108,3 +112,40 @@ def test_update_scores_based_on_thresholds() -> None:
    for metric, expected in expected_is_bad.items():
        assert scores[metric]["is_bad"] is expected
    assert all(scores[k]["score"] == raw_scores[k]["score"] for k in raw_scores)
+
+
+def test_prompt_tlm_with_message_history() -> None:


add test to confirm there is no query rewriting happening, whenever this is the first user message

add test to confirm that the primary TrustworthyRAG.score(prompt, response) call happens with prompt reflecting the full chat history, not with prompt reflecting the rewritten query.

Confirm you are using this TLM utils method:
cleanlab/cleanlab-tlm@a479e32

to turn the chat history into a prompt string.

elisno · 2025-06-03T03:07:12Z

src/cleanlab_codex/internal/validator.py

+Query: {query}
+
+--
+Message History: \n{messages}


You're already using triple-quotes.

Suggested change

Message History: \n{messages}

Message History:

{messages}

elisno · 2025-06-03T03:07:43Z

src/cleanlab_codex/internal/validator.py

+Message History: \n{messages}
+
+--
+Remember, return the Query as-is except in cases where the Query is missing key words or has content that should be additionally clarified."""


Suggested change

Remember, return the Query as-is except in cases where the Query is missing key words or has content that should be additionally clarified."""

Remember, return the Query as-is, except in cases where the Query is missing key words or has content that should be additionally clarified."""

So when the Query is missing key words or has content that should be additionally clarified, what should it do then?

elisno · 2025-06-03T03:09:39Z

src/cleanlab_codex/internal/validator.py

+    """Get the default configuration for the TLM."""
+
+    return {
+        "quality_preset": "medium",


Any reason for using this default preset here? How were these config options chosen?

AFAICT, the quality preset and verbostiy flag are the same by default, but we use gpt-4.1-mini by default instead of gpt-4.1-nano?

I'd expect us to pick defaults that favor lower latency, right?

default TLM trustworthiness score in Validator must remain identical to default TLM at all times. Unless there is a spec explicitly written to change it.

This whole config should not be hardcoded here I think. Instead, it can use these helper methods from the cleanlab_tlm library:

https://github.com/cleanlab/cleanlab-tlm/blob/main/src/cleanlab_tlm/utils/config.py

elisno · 2025-06-03T03:11:51Z

src/cleanlab_codex/internal/validator.py

@@ -108,3 +132,38 @@ def is_bad(metric: str) -> bool:
    if is_bad("trustworthiness"):
        return "hallucination"
    return "other_issues"
+
+
+def validate_messages(messages: Optional[list[dict[str, Any]]] = None) -> None:


I think this name validate_messages should be more carefully chosen when the entire validator module reserves the name method validate in Validator for looking at the trustworthiness & Eval scores.

I'd bet we wouldn't change the Validator.validate api, but we could find a different name for validate_messages since it behaves quite differently.

Consider having validate_messages take messages as a required (positional argument):

Suggested change

def validate_messages(messages: Optional[list[dict[str, Any]]] = None) -> None:

def validate_messages(messages: list[dict[str, Any]]) -> None:

Everywhere it's being called, it takes in a messages argument.
The caller already sets a default value for that argument, so I'd advise against setting default values in two function signatures.

elisno · 2025-06-03T03:20:49Z

src/cleanlab_codex/internal/validator.py

+    messages_str = ""
+    for message in messages:
+        messages_str += f"{message['role']}: {message['content']}\n"


Nit. Just use the list comprehension to join the different substrings instead of leaving an extra whitespace for the last entry.

Suggested change

messages_str = ""

for message in messages:

messages_str += f"{message['role']}: {message['content']}\n"

messages_str = "\n".join([f"{m['role']}: {m['content']}" for m in messages])

elisno · 2025-06-03T03:26:17Z

src/cleanlab_codex/validator.py

@@ -296,6 +318,25 @@ def _remediate(self, *, query: str, metadata: dict[str, Any] | None = None) -> s
        codex_answer, _ = self._project.query(question=query, metadata=metadata)
        return codex_answer

+    def _maybe_rewrite_query(self, *, query: str, messages: list[dict[str, Any]]) -> str:


This _maybe... prefix implies that we might get something different from the method, other than a string. Should the check for self._tlm be done by the caller?

jwmueller · 2025-06-03T04:29:56Z

src/cleanlab_codex/internal/validator.py

+
+
+def prompt_tlm_for_rewrite_query(query: str, messages: list[dict[str, Any]], tlm: TLM) -> TLMResponse:
+    """Given the query and message history, prompt the TLM for a response that could possibly be self contained.


did we decide this should be TLM in the end? I thought we were thinking this can just be regular openAI call.

If sticking w TLM here, you must ensure:

this is a different instance of TLM than the one being use for trustworthiness scoring of the response

this instance of TLM is minimal latency (gpt-4.1-nano model, quality_preset='base')

add to validator

4ae1897

ulya-tkch requested a review from elisno May 30, 2025 22:06

ulya-tkch added 4 commits May 30, 2025 15:27

add tlm key check

dd91b4c

add tlm key check test

8e977bf

fix type

ebfefd7

fix tests

e75d38c

jwmueller reviewed May 30, 2025

View reviewed changes

fix tests

fc8dc9d

elisno reviewed Jun 3, 2025

View reviewed changes

jwmueller reviewed Jun 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Codex] Add handling for Conversational RAG to Validator API #84

[Codex] Add handling for Conversational RAG to Validator API #84

Uh oh!

ulya-tkch commented May 30, 2025

Uh oh!

jwmueller May 30, 2025

Uh oh!

jwmueller May 30, 2025

Uh oh!

elisno Jun 3, 2025

Uh oh!

elisno Jun 3, 2025

Uh oh!

elisno Jun 3, 2025

Uh oh!

elisno Jun 3, 2025 •

edited

Loading

Uh oh!

elisno Jun 3, 2025

Uh oh!

jwmueller Jun 3, 2025

Uh oh!

elisno Jun 3, 2025 •

edited

Loading

Uh oh!

elisno Jun 3, 2025 •

edited

Loading

Uh oh!

elisno Jun 3, 2025

Uh oh!

elisno Jun 3, 2025

Uh oh!

jwmueller Jun 3, 2025

Uh oh!

Uh oh!

	Remember, return the Query as-is except in cases where the Query is missing key words or has content that should be additionally clarified."""
	Remember, return the Query as-is, except in cases where the Query is missing key words or has content that should be additionally clarified."""

	def validate_messages(messages: Optional[list[dict[str, Any]]] = None) -> None:
	def validate_messages(messages: list[dict[str, Any]]) -> None:



		def prompt_tlm_for_rewrite_query(query: str, messages: list[dict[str, Any]], tlm: TLM) -> TLMResponse:
		"""Given the query and message history, prompt the TLM for a response that could possibly be self contained.

[Codex] Add handling for Conversational RAG to Validator API #84

Are you sure you want to change the base?

[Codex] Add handling for Conversational RAG to Validator API #84

Uh oh!

Conversation

ulya-tkch commented May 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elisno Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elisno Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elisno Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elisno Jun 3, 2025 •

edited

Loading

elisno Jun 3, 2025 •

edited

Loading

elisno Jun 3, 2025 •

edited

Loading