Replies: 5 comments 4 replies
-
Hi @sabaimran, I noticed that you plan to add support for indexing plain text files #420. Do you want to treat html as a plain text file as well? If so, PR #415 may not be necessary, which is a loss of extensibility 😥. |
Beta Was this translation helpful? Give feedback.
-
I believe it will be incredibly useful for a variety of reasons.
|
Beta Was this translation helpful? Give feedback.
-
At the moment, we can manually export browser bookmarks from Raindrop etc to a html file and import that, but I don't think Khoj really handles that like it should... In general, I don't think Khoj is really working like an operating system with memory where it can be told to remember things via the semantic chat between restarts like MemGPT? Bookmarks are a useful indicator of our interests, so it might be better to rank those higher than search results. |
Beta Was this translation helpful? Give feedback.
-
Some potential issues with webpage indexing:
|
Beta Was this translation helpful? Give feedback.
-
I've seen those issues with other programs such as omnivore, the read it
later alternative.
I wonder, wouldn't it be better to do it via what the user sees when
they're browsing the page in their browser; to record what is being viewed
rather than use the open web, which is dying.
Is it that recording what is seen is something browsers actively fight
against and thus, it becomes a 2 section design necessary,
Much as we've seen with browser video downloaders breaking etc?
…On Sat, 21 Oct 2023, 12:55 TomLucidor, ***@***.***> wrote:
Some potential issues with webpage indexing:
- The way articles are extracted varies from library to library
https://github.com/scrapinghub/article-extraction-benchmark
- Webpage indexing is hard when authentication and Ad-Blocking exists
(e.g. archive.today)
—
Reply to this email directly, view it on GitHub
<#423 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/APNFJRFXODBLUBAMCPIJ7DTYANITVAVCNFSM6AAAAAA3GE7HNKVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TGNBUHAYTI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi all! There's been some discussion in the Discord about support online webpages as datasources. I wanted to get a better understanding of what people are interested in, and what should be supported here.
I was thinking over use cases for why URL-indexing might be important. One thought here is that it's a good resource for data-augmented retrieval, given that the LLMs have a fixed-in-time view of the world after they are done training. To mitigate this problem, we can add an additional search layer to lookup data on the web before responding, if a URL is provided.
I can also imagine having a URL resource provided as the reference data and wanting to chat with it.
What are some use cases you all are thinking of? This will help define the specifications for this integration.
Beta Was this translation helpful? Give feedback.
All reactions