-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iFrames support #362
iFrames support #362
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for looking into this! I was excited to even review it 🤩
I have some questions and concerns around some major points. Let me know if you want to hop on a call and discuss things. We can definitely resolve it async as well. Both works for me.
@@ -14,6 +14,7 @@ Reply in JSON format with the following keys: | |||
"confidence_float": float, // The confidence of the action. Pick a number between 0.0 and 1.0. 0.0 means no confidence, 1.0 means full confidence | |||
"action_type": str, // It's a string enum: "CLICK", "INPUT_TEXT", "UPLOAD_FILE", "SELECT_OPTION", "WAIT", "SOLVE_CAPTCHA", "COMPLETE", "TERMINATE". "CLICK" is an element you'd like to click. "INPUT_TEXT" is an element you'd like to input text into. "UPLOAD_FILE" is an element you'd like to upload a file into. "SELECT_OPTION" is an element you'd like to select an option from. "WAIT" action should be used if there are no actions to take and there is some indication on screen that waiting could yield more actions. "WAIT" should not be used if there are actions to take. "SOLVE_CAPTCHA" should be used if there's a captcha to solve on the screen. "COMPLETE" is used when the user goal has been achieved AND if there's any data extraction goal, you should be able to get data from the page. Never return a COMPLETE action unless the user goal is achieved. "TERMINATE" is used to terminate the whole task with a failure when it doesn't seem like the user goal can be achieved. Do not use "TERMINATE" if waiting could lead the user towards the goal. Only return "TERMINATE" if you are on a page where the user goal cannot be achieved. All other actions are ignored when "TERMINATE" is returned. | |||
"id": str, // The id of the element to take action on. The id has to be one from the elements list | |||
"frame": str, // It's either "main" or the id of the iframe element from the elements list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep the LLM unaware of frames. Can we create a mapping unique_id -> frame
and decide which frame we're going to use based on the unique_id
LLM returns?
Do you see any potential problems with the proposed approach ☝🏼 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all. It should also reduce any potential errors when LLM would not return the correct frame.
skyvern/webeye/actions/actions.py
Outdated
@@ -36,6 +36,7 @@ class Action(BaseModel): | |||
|
|||
|
|||
class WebAction(Action, abc.ABC): | |||
frame: str | int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, I think we only need to update the action handling logic. Specifically how playwright interacts with the page.
We can build and store the element_id_to_frame
mapping in the ScrapedPage
object we build as a result of scrape_web_unsafe
.
Then, we can pass that info to resolve_locator
and trace back to the original frame there without changing the way LLM, Action objects work.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I have just pushed the update with id -> frame mapping. Please let me know, if everything is in order.
} else if (element.tagName.toLowerCase() === 'iframe') { | ||
let iframeElementObject = buildElementObject(element, true); | ||
|
||
elements.push(iframeElementObject); | ||
resultArray.push(iframeElementObject); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we skip the iframes here?
function buildElementObject(element, interactable) {
Interactable here means that LLM can take an action such as CLICK/INPUT_TEXT/SELECT_OPTION etc on this specific element. Is that possible for this root iframe element here? Or are we making it possible via building an element tree for each of the iframes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention of including iframes in the elements list was to set unique id on the iframe for mapping later on and build element tree for each iframe. Not sure how to implement that without building the element object.
…access in O(1) instead
close to #405 |
Problem
Crawling iframes is currently not supported, but to properly navigate some websites it is crucial to interact with iframes.
Solution
domUtils.js
into each iframe and building element tree by combining elements from the main page and iframes.Note: the proposed implementation works with iframes one level deep, meaning iframes in iframes are not supported.
The PR is based on the #361
Closes #8