This example will show how to crawl a website, process HTML files and query them using RAG.
Here is the process:
website ---(crawler) --> HTML files --- (html2pq)--> markdown content ---(llama-index)--> save to vector db (Milvus) ---(query)---> LLM
conda create -n dpk-html-processing-py311 python=3.11
conda activate dpk-html-processing-py311
If on linux do this
conda install -y gcc_linux-64
conda install -y gxx_linux-64
Install modules
pip install -r requirements.txt
Inspect configuration here: my_config.py
Here you can set:
- site to crawl
- how many files to download and crawl depth
- embedding model
- LLM to use
This step will crawl a site and download HTML files in input
directory
We will process downloaded HTML files and extract the text as markdown. The output will be saved in theoutput/2-markdown
directory in markdown format
2_extract_text_from_html.ipynb
We will save the extracted text (markdown) into a vector database (Milvus)
For this step, we will be using Replicate API service. We need a Replicate API token for this step.
Follow these steps:
- Get a free account at replicate
- Use this invite to add some credit 💰 to your Replicate account!
- Create an API token on Replicate dashboard
Once you have an API token, add it to the project like this:
- Copy the file
env.sample.txt
into.env
(note the dot in the beginning of the filename) - Add your token to
REPLICATE_API_TOKEN
in the .env file.
Query documents using LLM