-
Notifications
You must be signed in to change notification settings - Fork 3
[Codex] Add handling for Conversational RAG to Validator API #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -108,3 +112,40 @@ def test_update_scores_based_on_thresholds() -> None: | |||
for metric, expected in expected_is_bad.items(): | |||
assert scores[metric]["is_bad"] is expected | |||
assert all(scores[k]["score"] == raw_scores[k]["score"] for k in raw_scores) | |||
|
|||
|
|||
def test_prompt_tlm_with_message_history() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add test to confirm there is no query rewriting happening, whenever this is the first user message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add test to confirm that the primary TrustworthyRAG.score(prompt, response)
call happens with prompt
reflecting the full chat history, not with prompt
reflecting the rewritten query.
Confirm you are using this TLM utils method:
cleanlab/cleanlab-tlm@a479e32
to turn the chat history into a prompt string.
Query: {query} | ||
|
||
-- | ||
Message History: \n{messages} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're already using triple-quotes.
Message History: \n{messages} | |
Message History: | |
{messages} |
Message History: \n{messages} | ||
|
||
-- | ||
Remember, return the Query as-is except in cases where the Query is missing key words or has content that should be additionally clarified.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember, return the Query as-is except in cases where the Query is missing key words or has content that should be additionally clarified.""" | |
Remember, return the Query as-is, except in cases where the Query is missing key words or has content that should be additionally clarified.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when the Query is missing key words or has content that should be additionally clarified, what should it do then?
"""Get the default configuration for the TLM.""" | ||
|
||
return { | ||
"quality_preset": "medium", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for using this default preset here? How were these config options chosen?
AFAICT, the quality preset and verbostiy flag are the same by default, but we use gpt-4.1-mini by default instead of gpt-4.1-nano?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect us to pick defaults that favor lower latency, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default TLM trustworthiness score in Validator must remain identical to default TLM at all times. Unless there is a spec explicitly written to change it.
This whole config should not be hardcoded here I think. Instead, it can use these helper methods from the cleanlab_tlm library:
https://github.com/cleanlab/cleanlab-tlm/blob/main/src/cleanlab_tlm/utils/config.py
@@ -108,3 +132,38 @@ def is_bad(metric: str) -> bool: | |||
if is_bad("trustworthiness"): | |||
return "hallucination" | |||
return "other_issues" | |||
|
|||
|
|||
def validate_messages(messages: Optional[list[dict[str, Any]]] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this name validate_messages
should be more carefully chosen when the entire validator
module reserves the name method validate
in Validator for looking at the trustworthiness & Eval scores.
I'd bet we wouldn't change the Validator.validate api, but we could find a different name for validate_messages
since it behaves quite differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider having validate_messages
take messages
as a required (positional argument):
def validate_messages(messages: Optional[list[dict[str, Any]]] = None) -> None: | |
def validate_messages(messages: list[dict[str, Any]]) -> None: |
Everywhere it's being called, it takes in a messages
argument.
The caller already sets a default value for that argument, so I'd advise against setting default values in two function signatures.
messages_str = "" | ||
for message in messages: | ||
messages_str += f"{message['role']}: {message['content']}\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. Just use the list comprehension to join the different substrings instead of leaving an extra whitespace for the last entry.
messages_str = "" | |
for message in messages: | |
messages_str += f"{message['role']}: {message['content']}\n" | |
messages_str = "\n".join([f"{m['role']}: {m['content']}" for m in messages]) |
@@ -296,6 +318,25 @@ def _remediate(self, *, query: str, metadata: dict[str, Any] | None = None) -> s | |||
codex_answer, _ = self._project.query(question=query, metadata=metadata) | |||
return codex_answer | |||
|
|||
def _maybe_rewrite_query(self, *, query: str, messages: list[dict[str, Any]]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This _maybe... prefix implies that we might get something different from the method, other than a string. Should the check for self._tlm
be done by the caller?
|
||
|
||
def prompt_tlm_for_rewrite_query(query: str, messages: list[dict[str, Any]], tlm: TLM) -> TLMResponse: | ||
"""Given the query and message history, prompt the TLM for a response that could possibly be self contained. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we decide this should be TLM in the end? I thought we were thinking this can just be regular openAI call.
If sticking w TLM here, you must ensure:
- this is a different instance of TLM than the one being use for trustworthiness scoring of the response
- this instance of TLM is minimal latency (gpt-4.1-nano model, quality_preset='base')
No description provided.