Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add websearch controls for assistants #812

Merged
merged 50 commits into from Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
6f7b200
remove query modifiers from generateQuery
nsarrazin Feb 10, 2024
6ba10db
Add backend for assistant RAG
nsarrazin Feb 10, 2024
a200884
Add front-end for updating RAG assistant
nsarrazin Feb 10, 2024
59066f2
enable web parser to return plaintext directly for matching headers
nsarrazin Feb 10, 2024
42b3d29
Update websearch flow for handling assistant rag preferences
nsarrazin Feb 10, 2024
df0db81
Add our old blocklist to .env.template
nsarrazin Feb 10, 2024
384b52c
Enable websearch to run on messages depending on assistant requirements
nsarrazin Feb 10, 2024
6778a6e
Merge branch 'main' into feature/rag_on_assistant
nsarrazin Feb 10, 2024
ba39acb
Merge branch 'main' into feature/rag_on_assistant
mishig25 Feb 15, 2024
059795d
reorganized imports
nsarrazin Feb 15, 2024
fff5e21
Rename vars
nsarrazin Feb 19, 2024
465b817
use projection
nsarrazin Feb 26, 2024
873133c
Add environment variable for assistant rag
nsarrazin Feb 26, 2024
268b8f4
Merge branch 'main' into feature/rag_on_assistant
nsarrazin Feb 26, 2024
620b58e
fix assistant rag on runwebsearch
nsarrazin Feb 26, 2024
5eb3c03
Merge branch 'main' into feature/rag_on_assistant
nsarrazin Feb 26, 2024
af8b3bf
fix styling if rag is disabled
nsarrazin Feb 26, 2024
2fbb28f
make sure we always omit credentials when fetching web pages
nsarrazin Feb 26, 2024
920a72e
Add new checks for SSRF, with a new env var `ENABLE_LOCAL_FETCH`
nsarrazin Feb 26, 2024
fc29655
Use DNS to check if the links are local or not
nsarrazin Mar 4, 2024
37fb70b
Merge branch 'main' into feature/rag_on_assistant
nsarrazin Mar 4, 2024
70d7c16
Merge branch 'main' into feature/rag_on_assistant
nsarrazin Mar 6, 2024
52e010c
Merge branch 'main' into feature/rag_on_assistant
nsarrazin Mar 6, 2024
b73a99f
Add a websearch indicator
nsarrazin Mar 7, 2024
c4eb1d4
Merge branch 'main' into feature/rag_on_assistant
nsarrazin Mar 11, 2024
9be64c6
Add more tags to parser
nsarrazin Mar 12, 2024
58dc4ec
Add indicators
nsarrazin Mar 12, 2024
eb23dc7
Display RAG options in settings view
nsarrazin Mar 12, 2024
0398ec6
ui
gary149 Mar 13, 2024
c6ffb01
fix rag detection
nsarrazin Mar 13, 2024
877d3cb
bit more spacing
nsarrazin Mar 13, 2024
5164deb
fix button position in assistant form
nsarrazin Mar 13, 2024
f34bed4
wording (mainly)
gary149 Mar 13, 2024
dfe3481
reduce number of tags
nsarrazin Mar 13, 2024
6a7c3b0
Bump max URLs from 3 to 10
nsarrazin Mar 14, 2024
6f822dc
add ul and ol to parseWeb
gary149 Mar 14, 2024
b8ba360
change splitting string
nsarrazin Mar 14, 2024
e6cd56d
link style
gary149 Mar 14, 2024
aab0c39
wording
gary149 Mar 14, 2024
039e3c4
add feedback link
gary149 Mar 14, 2024
822c3f3
Update src/routes/settings/(nav)/assistants/[assistantId]/+page.svelte
nsarrazin Mar 14, 2024
2c44a6f
Update src/routes/settings/(nav)/assistants/[assistantId]/+page.svelte
nsarrazin Mar 14, 2024
2314b24
Update src/routes/assistants/+page.svelte
nsarrazin Mar 14, 2024
940edd0
Update src/routes/settings/(nav)/assistants/[assistantId]/+page.svelte
nsarrazin Mar 14, 2024
a693c07
Update src/lib/components/chat/ChatWindow.svelte
nsarrazin Mar 14, 2024
a68ae9f
Update src/routes/settings/(nav)/assistants/[assistantId]/+page.svelte
nsarrazin Mar 14, 2024
ebaafaa
Update src/lib/components/AssistantSettings.svelte
nsarrazin Mar 14, 2024
bf2cb52
lint
nsarrazin Mar 14, 2024
1aac902
throw error if not a string
nsarrazin Mar 14, 2024
0760cbf
simplify rag check
nsarrazin Mar 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions .env.template
Expand Up @@ -248,3 +248,5 @@ EXPOSE_API=true
ALTERNATIVE_REDIRECT_URLS=`[
huggingchat://login/callback
]`

WEBSEARCH_BLOCKLIST=`["youtube.com", "twitter.com"]`
112 changes: 110 additions & 2 deletions src/lib/components/AssistantSettings.svelte
Expand Up @@ -71,6 +71,14 @@
let deleteExistingAvatar = false;

let loading = false;

let ragMode: false | "links" | "domains" | "all" = assistant?.rag?.allowAll
? "all"
: assistant?.rag?.links?.length ?? 0 > 0
? "links"
: (assistant?.rag?.allowList?.length ?? 0) > 0
? "domains"
: false;
</script>

<form
Expand Down Expand Up @@ -103,6 +111,24 @@
}
}

formData.delete("ragMode");

if (ragMode === false) {
formData.set("ragAllowAll", "false");
formData.set("ragLinkList", "");
formData.set("ragDomainList", "");
} else if (ragMode === "all") {
formData.set("ragAllowAll", "true");
formData.set("ragLinkList", "");
formData.set("ragDomainList", "");
} else if (ragMode === "links") {
formData.set("ragAllowAll", "false");
formData.set("ragDomainList", "");
} else if (ragMode === "domains") {
formData.set("ragAllowAll", "false");
formData.set("ragLinkList", "");
}

return async ({ result }) => {
loading = false;
await applyAction(result);
Expand Down Expand Up @@ -255,7 +281,7 @@
</label>
</div>

<label class="flex flex-col">
<div class="flex flex-col">
<span class="mb-1 text-sm font-semibold"> Instructions (system prompt) </span>
<textarea
name="preprompt"
Expand All @@ -264,7 +290,89 @@
value={assistant?.preprompt ?? ""}
/>
<p class="text-xs text-red-500">{getError("preprompt", form)}</p>
</label>
<div class="flex min-h-44 flex-col flex-nowrap">
<span class="my-2 text-smd font-semibold"> RAG Settings</span>

<label>
<input
checked={!ragMode}
on:change={() => (ragMode = false)}
type="radio"
name="ragMode"
value={false}
/>
<span class="my-2 text-sm" class:font-semibold={!ragMode}> Disabled </span>
{#if !ragMode}
<span class="block text-xs text-gray-500">
Assistant won't look for information on the web.
</span>
{/if}
</label>

<label>
<input
checked={ragMode === "all"}
on:change={() => (ragMode = "all")}
type="radio"
name="ragMode"
value={"all"}
/>
<span class="my-2 text-sm" class:font-semibold={ragMode === "all"}> Enabled </span>
{#if ragMode === "all"}
<span class="block text-xs text-gray-500">
Assistant can access any content on the web.
</span>
{/if}
</label>
<label>
<input
checked={ragMode === "links"}
on:change={() => (ragMode = "links")}
type="radio"
name="ragMode"
value={false}
/>
<span class="my-2 text-sm" class:font-semibold={ragMode === "links"}> Links </span>
</label>
{#if ragMode === "links"}
<span class="mb-2 text-xs text-gray-500">
Specify max 3 direct URLs the assistant will access. HTML & plaintext only. Separate the
list elements with a semicolon.
</span>
<input
name="ragLinkList"
class="w-full rounded-lg border-2 border-gray-200 bg-gray-100 p-2"
placeholder="https://raw.githubusercontent.com/huggingface/chat-ui/main/README.md"
value={assistant?.rag?.links.join(";") ?? ""}
/>
<p class="text-xs text-red-500">{getError("ragLinkList", form)}</p>
{/if}

<label>
<input
checked={ragMode === "domains"}
on:change={() => (ragMode = "domains")}
type="radio"
name="ragMode"
value={false}
/>
<span class="my-2 text-sm" class:font-semibold={ragMode === "domains"}> Domains </span>
</label>
{#if ragMode === "domains"}
<span class="mb-2 text-xs text-gray-500">
Specify allowed domains for web search, separe the list elements with a semicolon.
</span>

<input
name="ragDomainList"
class="w-full rounded-lg border-2 border-gray-200 bg-gray-100 p-2"
placeholder="wikipedia.org;bbc.com"
value={assistant?.rag?.allowList?.join(";") ?? ""}
/>
<p class="text-xs text-red-500">{getError("ragDomainList", form)}</p>
{/if}
</div>
</div>
</div>

<div class="mt-5 flex justify-end gap-2">
Expand Down
15 changes: 1 addition & 14 deletions src/lib/server/websearch/generateQuery.ts
@@ -1,19 +1,6 @@
import type { Message } from "$lib/types/Message";
import { format } from "date-fns";
import { generateFromDefaultEndpoint } from "../generateFromDefaultEndpoint";
import { WEBSEARCH_ALLOWLIST, WEBSEARCH_BLOCKLIST } from "$env/static/private";
import { z } from "zod";
import JSON5 from "json5";

const listSchema = z.array(z.string()).default([]);

const allowList = listSchema.parse(JSON5.parse(WEBSEARCH_ALLOWLIST));
const blockList = listSchema.parse(JSON5.parse(WEBSEARCH_BLOCKLIST));

const queryModifier = [
...allowList.map((item) => "site:" + item),
...blockList.map((item) => "-site:" + item),
].join(" ");

export async function generateQuery(messages: Message[]) {
const currentDate = format(new Date(), "MMMM d, yyyy");
Expand Down Expand Up @@ -79,5 +66,5 @@ Current Question: Where is it being hosted?`,
preprompt: `You are tasked with generating web search queries. Give me an appropriate query to answer my question for google search. Answer with only the query. Today is ${currentDate}`,
});

return (queryModifier + " " + webQuery).trim();
return webQuery.trim();
}
49 changes: 28 additions & 21 deletions src/lib/server/websearch/parseWeb.ts
Expand Up @@ -3,30 +3,37 @@ import { JSDOM, VirtualConsole } from "jsdom";
export async function parseWeb(url: string) {
const abortController = new AbortController();
setTimeout(() => abortController.abort(), 10000);
const htmlString = await fetch(url, { signal: abortController.signal })
.then((response) => response.text())
.catch();
const r = await fetch(url, { signal: abortController.signal }).catch();

const virtualConsole = new VirtualConsole();
virtualConsole.on("error", () => {
// No-op to skip console errors.
});
if (r.headers.get("content-type")?.includes("text/html")) {
const virtualConsole = new VirtualConsole();
virtualConsole.on("error", () => {
// No-op to skip console errors.
});

// put the html string into a DOM
const dom = new JSDOM(htmlString ?? "", {
virtualConsole,
});
// put the html string into a DOM
const dom = new JSDOM((await r.text()) ?? "", {
virtualConsole,
});

const { document } = dom.window;
const textElTags = "p";
const paragraphs = document.querySelectorAll(textElTags);
if (!paragraphs.length) {
throw new Error(`webpage doesn't have any "${textElTags}" element`);
}
const paragraphTexts = Array.from(paragraphs).map((p) => p.textContent);
const { document } = dom.window;
const textElTags = "p";
const paragraphs = document.querySelectorAll(textElTags);
if (!paragraphs.length) {
throw new Error(`webpage doesn't have any "${textElTags}" element`);
}
const paragraphTexts = Array.from(paragraphs).map((p) => p.textContent);

// combine text contents from paragraphs and then remove newlines and multiple spaces
const text = paragraphTexts.join(" ").replace(/ {2}|\r\n|\n|\r/gm, "");
// combine text contents from paragraphs and then remove newlines and multiple spaces
const text = paragraphTexts.join(" ").replace(/ {2}|\r\n|\n|\r/gm, "");

return text;
return text;
} else if (
r.headers.get("content-type")?.includes("text/plain") ||
r.headers.get("content-type")?.includes("text/markdown")
) {
return r.text();
} else {
throw new Error("Unsupported content type");
}
}
71 changes: 51 additions & 20 deletions src/lib/server/websearch/runWebSearch.ts
Expand Up @@ -13,12 +13,21 @@ import { defaultEmbeddingModel, embeddingModels } from "$lib/server/embeddingMod
const MAX_N_PAGES_SCRAPE = 10 as const;
const MAX_N_PAGES_EMBED = 5 as const;

const DOMAIN_BLOCKLIST = ["youtube.com", "twitter.com"];
import { WEBSEARCH_ALLOWLIST, WEBSEARCH_BLOCKLIST } from "$env/static/private";
nsarrazin marked this conversation as resolved.
Show resolved Hide resolved
import { z } from "zod";
import JSON5 from "json5";
import type { Assistant } from "$lib/types/Assistant";

const listSchema = z.array(z.string()).default([]);

const allowList = listSchema.parse(JSON5.parse(WEBSEARCH_ALLOWLIST));
const blockList = listSchema.parse(JSON5.parse(WEBSEARCH_BLOCKLIST));

export async function runWebSearch(
conv: Conversation,
prompt: string,
updatePad: (upd: MessageUpdate) => void
updatePad: (upd: MessageUpdate) => void,
ragSettings?: Assistant["rag"]
) {
const messages = (() => {
return [...conv.messages, { content: prompt, from: "user", id: crypto.randomUUID() }];
Expand All @@ -39,26 +48,48 @@ export async function runWebSearch(
}

try {
webSearch.searchQuery = await generateQuery(messages);
const searchProvider = getWebSearchProvider();
appendUpdate(`Searching ${searchProvider}`, [webSearch.searchQuery]);
const results = await searchWeb(webSearch.searchQuery);
webSearch.results =
(results.organic_results &&
results.organic_results.map((el: { title?: string; link: string; text?: string }) => {
try {
const { title, link, text } = el;
const { hostname } = new URL(link);
return { title, link, hostname, text };
} catch (e) {
// Ignore Errors
return null;
}
})) ??
[];
// if the assistant specified direct links, skip the websearch
if (ragSettings && ragSettings?.links.length > 0) {
appendUpdate("Using links specified in assistant directly. Skipping websearch");
webSearch.results = ragSettings.links.map((link) => {
return { link, hostname: new URL(link).hostname, title: "", text: "" };
});
} else {
webSearch.searchQuery = await generateQuery(messages);
const searchProvider = getWebSearchProvider();
appendUpdate(`Searching ${searchProvider}`, [webSearch.searchQuery]);

if (ragSettings && ragSettings?.allowList.length > 0) {
appendUpdate("Filtering results to only domains specified in assistant");
webSearch.searchQuery +=
" " + ragSettings.allowList.map((item) => "site:" + item).join(" ");
}

// handle the global lists
webSearch.searchQuery +=
allowList.map((item) => "site:" + item).join(" ") +
" " +
blockList.map((item) => "-site:" + item).join(" ");

const results = await searchWeb(webSearch.searchQuery);
webSearch.results =
(results.organic_results &&
results.organic_results.map((el: { title?: string; link: string; text?: string }) => {
try {
const { title, link, text } = el;
const { hostname } = new URL(link);
return { title, link, hostname, text };
} catch (e) {
// Ignore Errors
return null;
}
})) ??
[];
}

webSearch.results = webSearch.results.filter((value) => value !== null);
webSearch.results = webSearch.results
.filter(({ link }) => !DOMAIN_BLOCKLIST.some((el) => link.includes(el))) // filter out blocklist links
.filter(({ link }) => !blockList.some((el) => link.includes(el))) // filter out blocklist links
.slice(0, MAX_N_PAGES_SCRAPE); // limit to first 10 links only

// fetch the model
Expand Down
5 changes: 5 additions & 0 deletions src/lib/types/Assistant.ts
Expand Up @@ -14,4 +14,9 @@ export interface Assistant extends Timestamps {
preprompt: string;
userCount?: number;
featured?: boolean;
rag?: {
allowAll: boolean;
allowList: string[];
links: string[];
nsarrazin marked this conversation as resolved.
Show resolved Hide resolved
};
}
10 changes: 10 additions & 0 deletions src/lib/utils/parseStringToList.ts
@@ -0,0 +1,10 @@
export function parseStringToList(links: unknown): string[] {
if (typeof links !== "string") {
return [];
nsarrazin marked this conversation as resolved.
Show resolved Hide resolved
}

return links
.split(";")
.map((link) => link.trim())
.filter((link) => link.length > 0);
}
24 changes: 21 additions & 3 deletions src/routes/conversation/[id]/+server.ts
Expand Up @@ -250,13 +250,31 @@ export async function POST({ request, locals, params, getClientAddress }) {

let webSearchResults: WebSearch | undefined;

if (webSearch && !isContinue && !conv.assistantId) {
webSearchResults = await runWebSearch(conv, messages[messages.length - 1].content, update);
// check if assistant has a rag
const assistantRAG = await collections.assistants
.findOne({ _id: conv.assistantId })
nsarrazin marked this conversation as resolved.
Show resolved Hide resolved
.then((a) => {
return a?.rag;
});

const assistantHasRAG =
assistantRAG &&
(assistantRAG.links.length > 0 ||
assistantRAG.allowList.length > 0 ||
assistantRAG.allowAll);

// if websearch is enabled and the assistant is not specified or it is and has a rag
if (!isContinue && ((webSearch && !conv.assistantId) || assistantHasRAG)) {
webSearchResults = await runWebSearch(
conv,
messages[messages.length - 1].content,
update,
assistantRAG
);
messages[messages.length - 1].webSearch = webSearchResults;
} else if (isContinue) {
webSearchResults = messages[messages.length - 1].webSearch;
}

conv.messages = messages;

const previousContent = isContinue
Expand Down