Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build embeddings #506

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open

Build embeddings #506

wants to merge 28 commits into from

Conversation

mishig25
Copy link
Contributor

@mishig25 mishig25 commented Jun 10, 2024

Add new command to doc-builder that creates embeddings from docs of huggingface libraries

Usage of the new command:

doc-builder embeddings [lib name] [docs path]
# example: doc-builder embeddings diffusers ~/diffusers/docs/source/en

How it works

  1. Step 1: Create chunks from docs files
interface Chunk {
    text: string, // "# Effective and efficient diffusion\nGetting the `DiffusionPipeline` to generate images in a certain style or include what you want can be tricky ...
    source: string, // source url ex: hf.co/docs/transformers/model_doc/bert#transformers.BertConfig
    package_name: string, // ex: transformers, diffusers
}

doc-builder handles all the autodoc derivatives to crawl correct python objects and get their docstrings in markdowns. For example, find diffusers chunks at diffuers-chunks.json

  1. Step 2: Embed those chunks using HF Inference Endpoints. In this case, We're using Snowflake/snowflake-arctic-embed-m
// same as Chunk & { embedding: number[] }
interface Embedding {
    text: string, // "# Effective and efficient diffusion\nGetting the `DiffusionPipeline` to generate images in a certain style or include what you want can be tricky ...
    source: string, // source url ex: hf.co/docs/transformers/model_doc/bert#transformers.BertConfig
    package_name: string, // ex: transformers, diffusers
    embedding: number[] // ex: [ 0.6372, -0.2380,  0.5643, -0.2773, -1.1663,  0.6670,  2.9874, -0.2245, -1.4203,  ...]
}

To save cost (for efficiency), I've turned on HF Inference Endpoints settings that makes it go to sleep if there was no usage in the last 15 mins. Therefore, I have this warmup check here:

# warm up API call
output_warmup = query({"inputs": "Hello World!"})
if isinstance(output_warmup, dict) and "error" in output_warmup:
if output_warmup["error"] == "503 Service Unavailable":
print("Waking up Embedding Inference Endpoints. Retrying in 5 minutes")
time.sleep(300) # 5 minutes

  1. Step 3: Upload the embeddings to vector database. We will be using Meilisearch Vector Store

Note: almost all changes of this PR is addition (i.e. it does not break any existing doc-builder functionalities, I will merge the PR once I test with meilisearch)

@mishig25 mishig25 force-pushed the build_embeddings branch 10 times, most recently from 21084f4 to d281834 Compare June 12, 2024 08:00
@mishig25 mishig25 changed the title [wip] Build embeddings Build embeddings Jun 12, 2024
@mishig25 mishig25 marked this pull request as ready for review June 12, 2024 08:15
Comment on lines 353 to 355
if output_warmup["error"] == "503 Service Unavailable":
print("Waking up Embedding Inference Endpoints. Retrying in 5 minutes")
time.sleep(300) # 5 minutes
Copy link
Member

@julien-c julien-c Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be more satisfying to receive an event when the endpoint is available (HTTP SSE events for instance), but i guess that's painful to do in Python

Copy link
Contributor Author

@mishig25 mishig25 Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does HF ednpoints provide events? cc: @philschmid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a huggingface_hub.InferenceEndpoint client in Python if you want but it does the same as you (wait for model to be available).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If instead of an API url you have the name of an inference endpoint, you can do:

from huggingface_hub import get_inference_endpoint

endpoint = get_inference_endpoint(name="...")
client = endpoint.wait().client
client.text_embedding(...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mishig25 Little update on inference endpoint. Before the .wait() call, you must do .resume(). If the endpoint scaled down to zero or have been paused, it will be restarted. So the snippet should look like this:

from huggingface_hub import get_inference_endpoint

endpoint = get_inference_endpoint(name="...")
client = endpoint.resume().wait().client
client.feature_extraction(...)

print(len(embeddings))

# Step 3: push embeddings to vector database
# TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping again when you have this part i'd be interested in seeing it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool project! Looking forward to see it live 🎉

I've made a rough review of the current implementation. Feel free to ignore comments if you feel differently (especially if it's a POC for now)

src/doc_builder/commands/embeddings.py Show resolved Hide resolved
src/doc_builder/build_embeddings.py Outdated Show resolved Hide resolved
src/doc_builder/build_embeddings.py Outdated Show resolved Hide resolved
src/doc_builder/build_embeddings.py Outdated Show resolved Hide resolved
src/doc_builder/build_embeddings.py Outdated Show resolved Hide resolved
src/doc_builder/build_embeddings.py Outdated Show resolved Hide resolved
src/doc_builder/build_embeddings.py Outdated Show resolved Hide resolved
src/doc_builder/build_embeddings.py Outdated Show resolved Hide resolved
src/doc_builder/build_embeddings.py Outdated Show resolved Hide resolved
src/doc_builder/build_embeddings.py Outdated Show resolved Hide resolved
@mishig25 mishig25 force-pushed the build_embeddings branch 3 times, most recently from ccf14ca to 48672d1 Compare June 18, 2024 15:28
@mishig25 mishig25 force-pushed the build_embeddings branch 3 times, most recently from 554aa51 to 2d417a9 Compare June 21, 2024 14:10
@mishig25 mishig25 force-pushed the build_embeddings branch 3 times, most recently from 6a7b825 to ae19f1c Compare June 26, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants