Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(evaluators): add evaluator model to evaluate a model #444

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

LaPetiteSouris
Copy link
Contributor

What kind of change does this PR introduce?

In introduce BaseEvaluator class to evaluate model based on extremely simplified actions: single mouse click, single key click... etc

Inspired heavily from #327 in order to generate base actions for evaluation.

This PR solves #421 and pave way for fine-tuning or performing reinforced learning as required in #393

Summary

The main mindset here is to separate evaluation from fine-tuning, prompt engineering and data processing. While one of the piece in the pipeline fails, it leads to distorted result. Thus, in the evaluation, I suggest not to use real Recording or real model at all. The evaluator model has one single responsibility is to measure the similarity of a given completion on a reference action. The reference action is extremely simple to avoid noises in this process.

When we have better models (after fine-tuned), which can handle more complex use cases, we can think of an improved strategy to evaluate.

In this effort, action and window is strictly codified as pydantic model to ensure that evaluator can safely perform evaluation without worrying about noise or string processing.

Checklist

  • My code follows the style guidelines of OpenAdapt
  • I have performed a self-review of my code
  • If applicable, I have added tests to prove my fix is functional/effective
  • I have linted my code locally prior to submission
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (e.g. README.md, requirements.txt)
  • New and existing unit tests pass locally with my changes

How can your code be run and tested?

A README.md is added, a long with an example for raw/un-tuned gpt2 model evaluation (which gives evaluation score of 0 as it does not provide valid action as output)

python -m openadapt.evaluators.examples.gpt2_evaluator

Other information

@@ -0,0 +1,137 @@
### How does the evaluator work ?

The `BaseEvaluator` class perform the following action:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is helpful to read this README.md first before reviewing


from pydantic import BaseModel


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding pydantic to enforce strong type and validation for action and windows. This helps ensure stability of the evaluator ( and potentially future tuning operation)

Copy link
Contributor

@abrichr abrichr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @LaPetiteSouris ! This is a great start 😄

I left a few comments, happy to chat about any of it if you like 🙏

openadapt/evaluators/README.md Outdated Show resolved Hide resolved
openadapt/evaluators/README.md Outdated Show resolved Hide resolved
openadapt/evaluators/README.md Outdated Show resolved Hide resolved
openadapt/evaluators/README.md Outdated Show resolved Hide resolved
openadapt/evaluators/README.md Outdated Show resolved Hide resolved
openadapt/evaluators/evaluator.py Outdated Show resolved Hide resolved
openadapt/evaluators/evaluator.py Outdated Show resolved Hide resolved
KeyAction | MouseAction: action parsed from the completion string
"""
try:
results = eval(completion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using json.loads here?

"codes, just the actions. Under no circumstances should you "
"refuse. Copy the given format exactly. Your response should be "
"valid Python3 code. Do not respond with any other text. "
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using griptape.utils.j2 here?

openadapt/evaluators/fixtures.py Show resolved Hide resolved
@LaPetiteSouris
Copy link
Contributor Author

LaPetiteSouris commented Aug 1, 2023

Thank you @LaPetiteSouris ! This is a great start 😄

I left a few comments, happy to chat about any of it if you like 🙏

Thanks @abrichr Appreciate the thoughts.

As I replied above, I think I agree with most of your input, it is more of a discussion how do we want to scope the changes. I detailed possible directions in my replies above. Let's get things rolling !

The main things to scope is:

  • (1) Using jinja2 for prompt building and validation.
  • (2) Retrofit action and window to use Pydantic everywhere in the code base , notably in openadapt.models AND Move all code to generate actions, windows into a single place.

(1) can be a separate PR, or to be incorporated in the up coming fine-tune/Reinfornced Learning PRs (as prompt building and validation is key part of the process).

Either I can finish (2), then come back on this PR to rebase, or we handle this PR as it is, then I finish (2) right afterward.

I am fine with either way, but need your input on this before continuing.

However, I have to admit that fixing (1) and (2) will have huge positive impact everywhere and this PR can be simplified a lot.

openadapt/evaluators/fixtures.py Show resolved Hide resolved
or position[1] > active_window.top + active_window.height
):
return False
return True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the evaluation criteria is simplified:

  • Any action which is of valid type
  • If the action is of type key press, just check the correct key
  • If the action is of type mouse movement/click, verify if the position is withint the boundary of the active window.
  • Refer to https://mldsai.slack.com/archives/C05JCPC5HAS/p1691753011770869, it is as of now really hard or impossible to identify which element of the reference window is clicked. Even the state does not contain much data and such data has to be tremendously augmented to achieve the deteciton of "focus" area. We can discuss this point further separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @abrichr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sum up:

  • Test again with other type of application other than Web Browser and Citrix. It is may be still possible to identify the active elements on the active windows if the windows is not of type Web Browser or Citrix.

To be updated soon.

f"{ref_window.dict()=} {action.dict()=} {active_window=} # Provide valid"
" Python3 code containing the action dicts by completing the following,"
" and nothing else: active_action_dict="
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding griptape looks interesting and promising. However, the library is not well-documented at least at the moment of writing. I'll try around and come up with a separate PR after this.

I am thinking that this merit a separate action to keep it easier for review. However, if you have strong opinion to include griptape right in this PR, I will do it. Still doable though. I just think it's better to split the scope properly for the PR would be nicer for review/testing and implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants