Tarsier

🙈 Vision utilities for web interaction agents 🙈

Tarsier

If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:

How do you map LLM responses back into web elements?
How can you mark up a page for an LLM better understand its action space?
How do you feed a "screenshot" to a text-only LLM?

At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects. Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier! The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.

tarsier.mp4

How does it work?

Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]. In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon. We define interactable elements as buttons, links, or input fields that are visible on the page.

Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs. This is important to note given performance issues with existing vision language models. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

Installation

pip install tarsier

Usage

Visit our cookbook for agent examples using Tarsier:

An autonomous LangChain web agent 🦜⛓️
An autonomous LlamaIndex web agent 🦙

Otherwise, basic Tarsier usage might look like the following:

import asyncio

from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService

async def main():
    google_cloud_credentials = {}

    ocr_service = GoogleVisionOCRService(google_cloud_credentials)
    tarsier = Tarsier(ocr_service)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://news.ycombinator.com")

        page_text, tag_to_xpath = await tarsier.page_to_text(page)

        print(tag_to_xpath)  # Mapping of tags to x_paths
        print(page_text)  # My Text representation of the page


if __name__ == '__main__':
    asyncio.run(main())

Supported OCR Services

Google Cloud Vision
Amazon Textract (Coming Soon)
Microsoft Azure Computer Vision (Coming Soon)

Roadmap

Add documentation and examples
Clean up interfaces and add unit tests
Launch
Improve OCR text performance
Add options to customize tagging styling
Add support for other browsers drivers as necessary
Add support for other OCR services as necessary

Citations

bibtex
@misc{reworkd2023tarsier,
  title        = {Tarsier},
  author       = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
  year         = {2023},
  howpublished = {GitHub},
  url          = {https://github.com/reworkd/tarsier}
}

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github		.github
cookbook		cookbook
tarsier		tarsier
tests		tests
.gitignore		.gitignore
CITATION		CITATION
LICENSE		LICENSE
README.md		README.md
format.sh		format.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tarsier

How does it work?

Installation

Usage

Supported OCR Services

Roadmap

Citations

About

Releases

Packages

Languages

License

therealron/tarsier

Folders and files

Latest commit

History

Repository files navigation

Tarsier

How does it work?

Installation

Usage

Supported OCR Services

Roadmap

Citations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages