New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Web Search: Playwright, spatial parsing, markdown #1094
Web Search: Playwright, spatial parsing, markdown #1094
Conversation
for the ones who are testing, make sure to run
to get the necessary deps before running |
0aac87f
to
8c3db9a
Compare
For nix users, pin the playwright version to the latest version in nixpkgs (currently 1.40.0) via { pkgs ? import <nixpkgs> { } }:
pkgs.mkShell {
nativeBuildInputs = with pkgs; [ playwright-driver.browsers ];
shellHook = ''
export PLAYWRIGHT_BROWSERS_PATH=${pkgs.playwright-driver.browsers}
export PLAYWRIGHT_SKIP_VALIDATE_HOST_REQUIREMENTS=true
'';
} |
Let's go with JS disabled by default :) |
Co-authored-by: Aaditya Sahay <[email protected]>
86bca67
to
f08f092
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! Great work 🔥
Context
The existing web search implementation naively scrapes and chunks content for passing to the LLM. A brief explanation of how the relevant components work:
document.querySelectorAll('p, table, pre, ul, ol')
Solution
h1 [h2 [p p blockquote] h2 [h3 [...] ] ]
element.text.length > embeddingMaxLength
based on sentence boundariesDynamically includes anywhere from 3000 chars -> 8000 chars based on embedding distance. May result in longer search queries when using local CPU embedding
Spatial parsing implementation written by @Aaditya-Sahay