Support for indexing webpages #423

sabaimran · 2023-08-06T21:24:40Z

sabaimran
Aug 6, 2023
Maintainer

Hi all! There's been some discussion in the Discord about support online webpages as datasources. I wanted to get a better understanding of what people are interested in, and what should be supported here.

I was thinking over use cases for why URL-indexing might be important. One thought here is that it's a good resource for data-augmented retrieval, given that the LLMs have a fixed-in-time view of the world after they are done training. To mitigate this problem, we can add an additional search layer to lookup data on the web before responding, if a URL is provided.

I can also imagine having a URL resource provided as the reference data and wanting to chat with it.

What are some use cases you all are thinking of? This will help define the specifications for this integration.

Ellen7ions · 2023-08-07T01:13:01Z

Ellen7ions
Aug 7, 2023

Hi @sabaimran, I noticed that you plan to add support for indexing plain text files #420. Do you want to treat html as a plain text file as well? If so, PR #415 may not be necessary, which is a loss of extensibility 😥.

1 reply

debanjum Aug 10, 2023
Maintainer

Hey @Ellen7ions, your PR is fine. I don't think it conflicts with Saba's. I've answered your concern in more detail on your PR. Let me know if you still have concerns about that.

But as I said in that PR comment, before you continue work on your PR, we should discuss if we do need support for indexing webpages in the main khoj repository and what shape such a feature should take

Pipboyguy · 2023-08-10T09:05:25Z

Pipboyguy
Aug 10, 2023

I believe it will be incredibly useful for a variety of reasons.

Recursively indexing an entire website to use as a corpus is something I've been requested multiple times by clients who are looking to use semantic search with a GenAI layer on-top. I believe this functionality will put Khoj on the radar of more interested parties.
Said users, and most people, don't have the skillset to crawl and index their desired website.
Web retrieval augmented GenAI like Perplexity AI are gaining in popularity for a reason.

1 reply

TomLucidor Oct 21, 2023

combing webpage scraping (+archiving with IA or ArchiveToday) with LogSeq or Obsidian would have a compounding effect as well, imagine if the Wiki knows its sources.

FetchFast · 2023-10-20T00:35:05Z

FetchFast
Oct 20, 2023

At the moment, we can manually export browser bookmarks from Raindrop etc to a html file and import that, but I don't think Khoj really handles that like it should...
unless it can be told that it's an index and to try to read the descriptions of the bookmarks?

In general, I don't think Khoj is really working like an operating system with memory where it can be told to remember things via the semantic chat between restarts like MemGPT?

Bookmarks are a useful indicator of our interests, so it might be better to rank those higher than search results.

1 reply

debanjum Oct 21, 2023
Maintainer

At the moment, we can manually export browser bookmarks from Raindrop etc to a html file and import that, but I don't think Khoj really handles that like it should...
unless it can be told that it's an index and to try to read the descriptions of the bookmarks?

Can you clarify your flow of trying to make khoj index bookmarked pages? How do you try import the downloaded html files into Khoj?

I don't think Khoj is really working like an operating system with memory where it can be told to remember things via the semantic chat between restarts like MemGPT?

Yeah, Khoj doesn't currently look up chat messages older than a few conversation turns. But we plan to add that soon after the re-architecture work we're doing is complete (in ~2 weeks). The new push based mechanism to make Khoj index stuff should make that easy.

P.S: The operating system with memory phrase is neat 👌🏽

TomLucidor · 2023-10-21T04:55:10Z

TomLucidor
Oct 21, 2023

Some potential issues with webpage indexing:

The way articles are extracted varies from library to library https://github.com/scrapinghub/article-extraction-benchmark
Webpage indexing is hard when authentication and Ad-Blocking exists (e.g. archive.today)

0 replies

FetchFast · 2023-10-21T06:59:56Z

FetchFast
Oct 21, 2023

I've seen those issues with other programs such as omnivore, the read it later alternative. I wonder, wouldn't it be better to do it via what the user sees when they're browsing the page in their browser; to record what is being viewed rather than use the open web, which is dying. Is it that recording what is seen is something browsers actively fight against and thus, it becomes a 2 section design necessary, Much as we've seen with browser video downloaders breaking etc?

…

On Sat, 21 Oct 2023, 12:55 TomLucidor, ***@***.***> wrote: Some potential issues with webpage indexing: - The way articles are extracted varies from library to library https://github.com/scrapinghub/article-extraction-benchmark - Webpage indexing is hard when authentication and Ad-Blocking exists (e.g. archive.today) — Reply to this email directly, view it on GitHub <#423 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APNFJRFXODBLUBAMCPIJ7DTYANITVAVCNFSM6AAAAAA3GE7HNKVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TGNBUHAYTI> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

TomLucidor Oct 23, 2023

Mixing this with internet archivers would also be good, Zotero has both "memento" (web archive) and the default archiver (local cache) which supplements each other.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for indexing webpages #423

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Support for indexing webpages #423

sabaimran Aug 6, 2023 Maintainer

Replies: 5 comments · 4 replies

Ellen7ions Aug 7, 2023

debanjum Aug 10, 2023 Maintainer

Pipboyguy Aug 10, 2023

TomLucidor Oct 21, 2023

FetchFast Oct 20, 2023

debanjum Oct 21, 2023 Maintainer

TomLucidor Oct 21, 2023

FetchFast Oct 21, 2023

TomLucidor Oct 23, 2023

sabaimran
Aug 6, 2023
Maintainer

Replies: 5 comments 4 replies

Ellen7ions
Aug 7, 2023

debanjum Aug 10, 2023
Maintainer

Pipboyguy
Aug 10, 2023

FetchFast
Oct 20, 2023

debanjum Oct 21, 2023
Maintainer

TomLucidor
Oct 21, 2023

FetchFast
Oct 21, 2023