Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web Search: Playwright, spatial parsing, markdown #1094

Merged
merged 16 commits into from May 13, 2024

Conversation

Saghen
Copy link
Collaborator

@Saghen Saghen commented May 1, 2024

Context

The existing web search implementation naively scrapes and chunks content for passing to the LLM. A brief explanation of how the relevant components work:

  • Scraping: Fetch the page's HTML, statically parse it with selectors
    • document.querySelectorAll('p, table, pre, ul, ol')
  • Chunking: Concatenate the text of all elements with spaces, chunk the resulting text so that the length of each chunk is less than the maximum embedding length
  • Embedding: Get sentence similarity for each chunk, pass top 8 chunks to the LLM

Solution

  • Scraping: Load the page into Playwright and perform spatial parsing
    • The spatial parser uses a clustering technique based on the position to find the primary content. So i.e., you might end up with a cluster for the header, footer, primary content and sidebar. Heuristics, such as text density, find the critical cluster which should contain the primary content
    • Metadata scraping: title, description, site name, author, updated at, created at
  • Conversion to markdown/chunking:
    • Convert the resulting list of HTML elements into a tree like h1 [h2 [p p blockquote] h2 [h3 [...] ] ]
    • Convert the HTML elements into their markdown equivalents
  • Chunking: Treat each markdown element as a chunk. Split elements where element.text.length > embeddingMaxLength based on sentence boundaries
  • Embedding:
    • Get sentence similiarty for each markdown element
    • Get top chunks and their parent heading (based on the tree from conversion to markdown) until embedding distance increases beyond a threshold, or a character limit is hit

Dynamically includes anywhere from 3000 chars -> 8000 chars based on embedding distance. May result in longer search queries when using local CPU embedding

Spatial parsing implementation written by @Aaditya-Sahay

@mishig25
Copy link
Collaborator

mishig25 commented May 2, 2024

for the ones who are testing, make sure to run

npm ci
npx playwright install

to get the necessary deps before running

@Aaditya-Sahay
Copy link
Contributor

@Saghen Since we are using playwright as a library, we should add @playwright/browser-chromium to our list of dependencies so it automatically installs for people . See here

@Saghen Saghen marked this pull request as ready for review May 3, 2024 23:09
@Saghen
Copy link
Collaborator Author

Saghen commented May 3, 2024

For nix users, pin the playwright version to the latest version in nixpkgs (currently 1.40.0) via npm i [email protected]. Then launch a nix shell with the following config:

{ pkgs ? import <nixpkgs> { } }:
pkgs.mkShell {
  nativeBuildInputs = with pkgs; [ playwright-driver.browsers ];

  shellHook = ''
    export PLAYWRIGHT_BROWSERS_PATH=${pkgs.playwright-driver.browsers}
    export PLAYWRIGHT_SKIP_VALIDATE_HOST_REQUIREMENTS=true
  '';
}

@gary149
Copy link
Collaborator

gary149 commented May 10, 2024

Let's go with JS disabled by default :)

@gary149 gary149 requested a review from mishig25 May 10, 2024 15:32
README.md Show resolved Hide resolved
@mishig25 mishig25 requested a review from nsarrazin May 10, 2024 16:02
.env Outdated Show resolved Hide resolved
src/lib/utils/url.ts Outdated Show resolved Hide resolved
Copy link
Collaborator

@mishig25 mishig25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Great work 🔥

@mishig25 mishig25 merged commit 9ec5d84 into huggingface:main May 13, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants