Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iFrames support #362

Closed
wants to merge 6 commits into from
Closed

Conversation

alexzarbn
Copy link
Contributor

Problem

Crawling iframes is currently not supported, but to properly navigate some websites it is crucial to interact with iframes.

Solution

  • Crawl iframes by injecting domUtils.js into each iframe and building element tree by combining elements from the main page and iframes.
  • When processing an action, resolve locator instances based on the frame the action has to take place (main page or a specific iframe).

Note: the proposed implementation works with iframes one level deep, meaning iframes in iframes are not supported.

The PR is based on the #361

Closes #8

Copy link
Contributor

@ykeremy ykeremy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for looking into this! I was excited to even review it 🤩

I have some questions and concerns around some major points. Let me know if you want to hop on a call and discuss things. We can definitely resolve it async as well. Both works for me.

@@ -14,6 +14,7 @@ Reply in JSON format with the following keys:
"confidence_float": float, // The confidence of the action. Pick a number between 0.0 and 1.0. 0.0 means no confidence, 1.0 means full confidence
"action_type": str, // It's a string enum: "CLICK", "INPUT_TEXT", "UPLOAD_FILE", "SELECT_OPTION", "WAIT", "SOLVE_CAPTCHA", "COMPLETE", "TERMINATE". "CLICK" is an element you'd like to click. "INPUT_TEXT" is an element you'd like to input text into. "UPLOAD_FILE" is an element you'd like to upload a file into. "SELECT_OPTION" is an element you'd like to select an option from. "WAIT" action should be used if there are no actions to take and there is some indication on screen that waiting could yield more actions. "WAIT" should not be used if there are actions to take. "SOLVE_CAPTCHA" should be used if there's a captcha to solve on the screen. "COMPLETE" is used when the user goal has been achieved AND if there's any data extraction goal, you should be able to get data from the page. Never return a COMPLETE action unless the user goal is achieved. "TERMINATE" is used to terminate the whole task with a failure when it doesn't seem like the user goal can be achieved. Do not use "TERMINATE" if waiting could lead the user towards the goal. Only return "TERMINATE" if you are on a page where the user goal cannot be achieved. All other actions are ignored when "TERMINATE" is returned.
"id": str, // The id of the element to take action on. The id has to be one from the elements list
"frame": str, // It's either "main" or the id of the iframe element from the elements list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep the LLM unaware of frames. Can we create a mapping unique_id -> frame and decide which frame we're going to use based on the unique_id LLM returns?

Do you see any potential problems with the proposed approach ☝🏼 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all. It should also reduce any potential errors when LLM would not return the correct frame.

@@ -36,6 +36,7 @@ class Action(BaseModel):


class WebAction(Action, abc.ABC):
frame: str | int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I think we only need to update the action handling logic. Specifically how playwright interacts with the page.

We can build and store the element_id_to_frame mapping in the ScrapedPage object we build as a result of scrape_web_unsafe.

Then, we can pass that info to resolve_locator and trace back to the original frame there without changing the way LLM, Action objects work.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I have just pushed the update with id -> frame mapping. Please let me know, if everything is in order.

Comment on lines +753 to +757
} else if (element.tagName.toLowerCase() === 'iframe') {
let iframeElementObject = buildElementObject(element, true);

elements.push(iframeElementObject);
resultArray.push(iframeElementObject);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we skip the iframes here?

function buildElementObject(element, interactable) {

Interactable here means that LLM can take an action such as CLICK/INPUT_TEXT/SELECT_OPTION etc on this specific element. Is that possible for this root iframe element here? Or are we making it possible via building an element tree for each of the iframes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of including iframes in the elements list was to set unique id on the iframe for mapping later on and build element tree for each iframe. Not sure how to implement that without building the element object.

@alexzarbn alexzarbn requested a review from ykeremy May 28, 2024 09:27
@LawyZheng
Copy link
Collaborator

close to #405

@LawyZheng LawyZheng closed this Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support interacting elements within iframes
3 participants