Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queryMessages field added & query generation optimization #653

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Vegoo89
Copy link

@Vegoo89 Vegoo89 commented Sep 20, 2023

Closes #641

Purpose

  • Added option to use queryMessages to generate optimized search query
  • Option is enabled by default and can't be switched off in the Settings. When switched off, standard history will be used as before

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[x] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test

  • Test the code
build the frontend & run unit tests
unit tests have been updated and can be ran as usual

What to Check

Verify that the following are valid

  • queryMessages field should be present in request and response json body
  • new elements should be added to this field as conversation flows
  • on chat clear, this field value should be reset

Other Information

@Vegoo89
Copy link
Author

Vegoo89 commented Sep 20, 2023

@microsoft-github-policy-service agree

@Vegoo89
Copy link
Author

Vegoo89 commented Sep 27, 2023

@pamelafox
Can I ask for review of this functionality? Thanks!


# STEP 1: Generate an optimized keyword search query based on the chat history and the last question
messages = self.get_messages_from_history(
self.query_prompt_template,
self.chatgpt_model,
history,
query_history_input,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our prompt says "Below is a history of the conversation so far, and a new question asked by the user that needs to be answered by searching in a knowledge base about employee healthcare plans and the employee handbook."
I'm surprised you got good results by passing in the query history since it would seem to be in disagreement with the prompt. You didn't need to alter the prompt at all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yea our prompt is different, customizable for every use case. But is it in a disagreement with the this prompt? I don't think so, since the structure contains history but only in form of query messages and bot generated queries. This is basically list of few shots.

@pamelafox
Copy link
Collaborator

Could you give examples of query generations that worked before after this change? I'm looking into add evaluation metrics to this repository so we can measure changes like this, but it's difficult to evaluate without good test data.

@Vegoo89
Copy link
Author

Vegoo89 commented Sep 29, 2023

Could you give examples of query generations that worked before after this change? I'm looking into add evaluation metrics to this repository so we can measure changes like this, but it's difficult to evaluate without good test data.

Most test queries in our benchmarks are more stable after this change was implemented, but we have a bit different use cases, where different teams can alter queries for their needs.

We perform single queries and conversation tests with OpenAI rating / similarity evaluation.

I don't know which examples you would like, but I can't paste you mine due to internal company policies (these are internal data sets).

@pamelafox
Copy link
Collaborator

Could you give examples of query generations that worked before after this change? I'm looking into add evaluation metrics to this repository so we can measure changes like this, but it's difficult to evaluate without good test data.

Most test queries in our benchmarks are more stable after this change was implemented, but we have a bit different use cases, where different teams can alter queries for their needs.

We perform single queries and conversation tests with OpenAI rating / similarity evaluation.

I don't know which examples you would like, but I can't paste you mine due to internal company policies (these are internal data sets).

Okay, thanks for the additional information! I think your code change looks good, but I want to evaluate it using a new evaluation pipeline I'm working on in another branch. I'll add multi-turn evaluation to it soon which will enable me to test out this change. Sorry for the delay, but this is a great opportunity to try that out.

Also, if you can share anything about how you run evaluations, would love to hear more, as we're trying to figure out good developer flows for evaluation locally and in CI/CD.

@Vegoo89
Copy link
Author

Vegoo89 commented Oct 4, 2023

Could you give examples of query generations that worked before after this change? I'm looking into add evaluation metrics to this repository so we can measure changes like this, but it's difficult to evaluate without good test data.

Most test queries in our benchmarks are more stable after this change was implemented, but we have a bit different use cases, where different teams can alter queries for their needs.
We perform single queries and conversation tests with OpenAI rating / similarity evaluation.
I don't know which examples you would like, but I can't paste you mine due to internal company policies (these are internal data sets).

Okay, thanks for the additional information! I think your code change looks good, but I want to evaluate it using a new evaluation pipeline I'm working on in another branch. I'll add multi-turn evaluation to it soon which will enable me to test out this change. Sorry for the delay, but this is a great opportunity to try that out.

Also, if you can share anything about how you run evaluations, would love to hear more, as we're trying to figure out good developer flows for evaluation locally and in CI/CD.

Sure, I will wait till you try to run it - then you can ping me and I can rebase/merge latest changes to my branch so there are no conflicts.

About evaluation, there are many possibilities, but GPT is pretty good at such tasks so you can perform standard evaluation by e.g calculating embeddings of the ground truth and bot answer and then comparing them with basic similarity metric (cosine, euclidian) AND you can leverage GPT model and ask him to compare ground truth vs bot answer on the scale of your choosing (just add some few shots so it knows what to do). Pretty sure you got similar ideas in mind already so you can use few scores and blend them to get 'final one' or just use a single metric.

For us, stability of the solution is the most important thing. We all know that when you screw up even one single part of the conversation, it is still kept in history and may break things later on, so in our tests we focus mostly on it.

Also I want to mention that our tests are nowhere near perfect or complete. We are still evolving them and adjust to our needs, so I am also waiting to see your approach :)

Copy link

github-actions bot commented Dec 4, 2023

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed.

@github-actions github-actions bot added the Stale label Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better generation of optimized search query
2 participants