feat(evaluators): add evaluator model to evaluate a model #444

LaPetiteSouris · 2023-07-29T11:18:30Z

What kind of change does this PR introduce?

In introduce BaseEvaluator class to evaluate model based on extremely simplified actions: single mouse click, single key click... etc

Inspired heavily from #327 in order to generate base actions for evaluation.

This PR solves #421 and pave way for fine-tuning or performing reinforced learning as required in #393

Summary

The main mindset here is to separate evaluation from fine-tuning, prompt engineering and data processing. While one of the piece in the pipeline fails, it leads to distorted result. Thus, in the evaluation, I suggest not to use real Recording or real model at all. The evaluator model has one single responsibility is to measure the similarity of a given completion on a reference action. The reference action is extremely simple to avoid noises in this process.

When we have better models (after fine-tuned), which can handle more complex use cases, we can think of an improved strategy to evaluate.

In this effort, action and window is strictly codified as pydantic model to ensure that evaluator can safely perform evaluation without worrying about noise or string processing.

Checklist

My code follows the style guidelines of OpenAdapt
I have performed a self-review of my code
If applicable, I have added tests to prove my fix is functional/effective
I have linted my code locally prior to submission
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (e.g. README.md, requirements.txt)
New and existing unit tests pass locally with my changes

How can your code be run and tested?

A README.md is added, a long with an example for raw/un-tuned gpt2 model evaluation (which gives evaluation score of 0 as it does not provide valid action as output)

python -m openadapt.evaluators.examples.gpt2_evaluator

Other information

openadapt/evaluators/fixtures.py

openadapt/evaluators/evaluator.py

LaPetiteSouris · 2023-07-29T11:28:01Z

openadapt/evaluators/README.md

@@ -0,0 +1,137 @@
+### How does the evaluator work ?
+
+The `BaseEvaluator` class perform the following action:


It is helpful to read this README.md first before reviewing

LaPetiteSouris · 2023-07-29T11:32:03Z

openadapt/evaluators/data_models.py

+
+from pydantic import BaseModel
+
+


Adding pydantic to enforce strong type and validation for action and windows. This helps ensure stability of the evaluator ( and potentially future tuning operation)

abrichr

Thank you @LaPetiteSouris ! This is a great start 😄

I left a few comments, happy to chat about any of it if you like 🙏

openadapt/evaluators/README.md

openadapt/evaluators/evaluator.py

abrichr · 2023-07-30T03:37:35Z

openadapt/evaluators/evaluator.py

+ KeyAction | MouseAction: action parsed from the completion string
+ """
+ try:
+ results = eval(completion)


What do you think about using json.loads here?

abrichr · 2023-07-30T03:38:16Z

openadapt/evaluators/fixtures.py

+ "codes, just the actions. Under no circumstances should you "
+ "refuse. Copy the given format exactly. Your response should be "
+ "valid Python3 code. Do not respond with any other text. "
+)


What do you think about using griptape.utils.j2 here?

openadapt/evaluators/fixtures.py

LaPetiteSouris · 2023-08-01T08:27:40Z

Thank you @LaPetiteSouris ! This is a great start 😄

I left a few comments, happy to chat about any of it if you like 🙏

Thanks @abrichr Appreciate the thoughts.

As I replied above, I think I agree with most of your input, it is more of a discussion how do we want to scope the changes. I detailed possible directions in my replies above. Let's get things rolling !

The main things to scope is:

(1) Using jinja2 for prompt building and validation.
(2) Retrofit action and window to use Pydantic everywhere in the code base , notably in openadapt.models AND Move all code to generate actions, windows into a single place.

(1) can be a separate PR, or to be incorporated in the up coming fine-tune/Reinfornced Learning PRs (as prompt building and validation is key part of the process).

Either I can finish (2), then come back on this PR to rebase, or we handle this PR as it is, then I finish (2) right afterward.

I am fine with either way, but need your input on this before continuing.

However, I have to admit that fixing (1) and (2) will have huge positive impact everywhere and this PR can be simplified a lot.

openadapt/evaluators/evaluator.py

openadapt/evaluators/fixtures.py

LaPetiteSouris · 2023-08-13T09:49:34Z

openadapt/evaluators/evaluator.py

+ or position[1] > active_window.top + active_window.height
+ ):
+ return False
+ return True


Now the evaluation criteria is simplified:

Any action which is of valid type

If the action is of type key press, just check the correct key

If the action is of type mouse movement/click, verify if the position is withint the boundary of the active window.

Refer to https://mldsai.slack.com/archives/C05JCPC5HAS/p1691753011770869, it is as of now really hard or impossible to identify which element of the reference window is clicked. Even the state does not contain much data and such data has to be tremendously augmented to achieve the deteciton of "focus" area. We can discuss this point further separately.

cc: @abrichr

Sum up:

Test again with other type of application other than Web Browser and Citrix. It is may be still possible to identify the active elements on the active windows if the windows is not of type Web Browser or Citrix.

To be updated soon.

LaPetiteSouris · 2023-08-13T09:52:37Z

openadapt/evaluators/evaluator.py

+ f"{ref_window.dict()=} {action.dict()=} {active_window=} # Provide valid"
+ " Python3 code containing the action dicts by completing the following,"
+ " and nothing else: active_action_dict="
+ )


Adding griptape looks interesting and promising. However, the library is not well-documented at least at the moment of writing. I'll try around and come up with a separate PR after this.

I am thinking that this merit a separate action to keep it easier for review. However, if you have strong opinion to include griptape right in this PR, I will do it. Still doable though. I just think it's better to split the scope properly for the PR would be nicer for review/testing and implementation

LaPetiteSouris added 6 commits July 23, 2023 15:45

feat(model_tunings): WIP Add script to fine-tune models

dcc466e

feat(models_evaluation) Add generic evaluation class

3ccad9e

Add tests for validator

c4d5aac

feat(evaluators) Add instruction on how to use evaluator

32684a3

feat(evaluators): change base class name

0b618a5

Merge remote-tracking branch 'upstream/main' into feat-evaluator

592dca5

This was referenced Jul 29, 2023

WIP feat(model_evaluation): Add script to evaluate models #420

Closed

Testing StatefulReplayStrategy #327

Open

LaPetiteSouris commented Jul 29, 2023

View reviewed changes

openadapt/evaluators/fixtures.py Show resolved Hide resolved

openadapt/evaluators/evaluator.py Show resolved Hide resolved

LaPetiteSouris commented Jul 29, 2023

View reviewed changes

abrichr reviewed Jul 31, 2023

View reviewed changes

LaPetiteSouris commented Aug 2, 2023

View reviewed changes

openadapt/evaluators/evaluator.py Outdated Show resolved Hide resolved

LaPetiteSouris added 3 commits August 13, 2023 10:44

feat(evaluators): use simple evaluation criteria

cf3c23d

feat(evaluators): use simple evaluation criteria

49db2fe

feat(evaluators): raise NotImplementError on base methods

2bb2ffb

LaPetiteSouris commented Aug 13, 2023

View reviewed changes

feat(evaluators): use default configs

1e2e7a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluators): add evaluator model to evaluate a model #444

feat(evaluators): add evaluator model to evaluate a model #444

LaPetiteSouris commented Jul 29, 2023

LaPetiteSouris Jul 29, 2023

LaPetiteSouris Jul 29, 2023

abrichr left a comment

abrichr Jul 30, 2023

abrichr Jul 30, 2023

LaPetiteSouris commented Aug 1, 2023 •

edited

LaPetiteSouris Aug 13, 2023

LaPetiteSouris Aug 13, 2023

LaPetiteSouris Aug 14, 2023

LaPetiteSouris Aug 13, 2023 •

edited

		@@ -0,0 +1,137 @@
		### How does the evaluator work ?

		The `BaseEvaluator` class perform the following action:

feat(evaluators): add evaluator model to evaluate a model #444

Are you sure you want to change the base?

feat(evaluators): add evaluator model to evaluate a model #444

Conversation

LaPetiteSouris commented Jul 29, 2023

LaPetiteSouris Jul 29, 2023

Choose a reason for hiding this comment

LaPetiteSouris Jul 29, 2023

Choose a reason for hiding this comment

abrichr left a comment

Choose a reason for hiding this comment

abrichr Jul 30, 2023

Choose a reason for hiding this comment

abrichr Jul 30, 2023

Choose a reason for hiding this comment

LaPetiteSouris commented Aug 1, 2023 • edited

LaPetiteSouris Aug 13, 2023

Choose a reason for hiding this comment

LaPetiteSouris Aug 13, 2023

Choose a reason for hiding this comment

LaPetiteSouris Aug 14, 2023

Choose a reason for hiding this comment

LaPetiteSouris Aug 13, 2023 • edited

Choose a reason for hiding this comment

LaPetiteSouris commented Aug 1, 2023 •

edited

LaPetiteSouris Aug 13, 2023 •

edited