-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] Pinecone tutorial #26775
[docs] Pinecone tutorial #26775
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a code stand point this is great. I'll let @neverett respond more to the prose itself.
I did find myself lacking understanding of core vector DB concepts, so I think we could do better at explaining that along the way.
I'd also like for us to try and use the CodeExample
component for this tutorial, I help refactor the component to make that easier.
Overall I really like how this turned out, awesome work!
We will be working with review data from Goodreads. These reviews exist as a collection of JSON files categorized by different genres. We will focus on just the files for graphic novels to limit the size of the files we will process. Within this domain, the files we will be working with are `goodreads_books_comics_graphic.json.gz` and `goodreads_reviews_comics_graphic.json.gz`. Since the data is normalized across these two files, we will want to combine information before feeding it into our vector database. | ||
|
||
One way to handle preprocessing of the data is with [DuckDB](https://duckdb.org/). DuckDB is an in-process database, similar to SQLite, optimized for analytical workloads. We will start by creating two Dagster assets to load in the data. Each will load one of the files and create a DuckDB table (`graphic_novels` and `reviews`): | ||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use this tutorial as an example of embedding the code with CodeExample
?
Right now docs_beta_snippets
is hard coded, but we could improve the component to take in a prop
of the snippet location, and default to docs_beta_snippets
if not provided.
import(`!!raw-loader!/../../examples/docs_beta_snippets/docs_beta_snippets/${filePath}`)
https://github.com/dagster-io/dagster/blob/master/docs/docs-beta/src/components/CodeExample.tsx#L44
Now that the data has been prepared, we are ready to work with our vector database. | ||
|
||
### Vector Database | ||
We can begin by creating the index within our vector database. A vector database is a specialized database designed to store, manage, and retrieve high-dimensional vector embeddings, enabling efficient similarity search and machine learning tasks. There are many different vector databases available. For this demo we will use Pinecone which is a cloud based vector database that offers a [free tier](https://app.pinecone.io/) that can help us get started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point in the tutorial I had the question, "Why do we need both DuckDB and Pinecone databases?"
I understand that they serve different purposes, but we could maybe clarify more upfront on their usage.
```python | ||
# assets.py | ||
@dg.asset( | ||
kinds={"Pinecone"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# assets.py | ||
@dg.asset( | ||
kinds={"Pinecone"}, | ||
group_name="processing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would "embeddings" be a more accurate group name for the vector db related assets?
def pinecone_index( | ||
pinecone_resource: PineconeResource, | ||
): | ||
spec = ServerlessSpec(cloud="aws", region="us-east-1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point I was confused by ServerlessSpec
. A short blurb explaining that this is a Pinecone concept for compute (I think...?) might be helpful.
... | ||
``` | ||
|
||
For our methods we will want the ability to create an index, retrieve a index so we can upsert records, and the ability to embed inputs. You can see all the details in the repo but this is what the `create_index` method looks like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a silly suggestion, but can we explain what an index is in the context of vector databases?
```python | ||
# assets.py | ||
@dg.asset( | ||
kinds={"Pinecone"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kinds={"Pinecone"}, | |
kinds={"pinecone"}, |
kinds={"Pinecone"}, | ||
group_name="processing", | ||
) | ||
def pinecone_embeddings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to return MaterializeResult
from these assets to include meaningful metadata.
``` | ||
|
||
# Going Forward | ||
This was a relatively simple example but you see the amount of coordination that is needed to power realworld AI applications. As you add more and more sources of unstructured data that updates on different cadences and powers multiple downstream tasks you will want to think of operational details for building around AI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prob don't need to call it out being simple.
@@ -0,0 +1,311 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we standardize project naming convention?
Modal and Bluesky are named examples/project_
whereas this is examples/tutorial_
.
Going to repurpose this as a different use case. Closing in favor of #26795 |
Summary & Motivation
Tutorial and code for uploading embeddings to vector database (Pinecone). Given the other guides in the works, stripped out anything AI that wasn't strictly Pinecone.
How I Tested These Changes
Changelog