Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Pinecone tutorial #26775

Closed

Conversation

dehume
Copy link
Contributor

@dehume dehume commented Dec 31, 2024

Summary & Motivation

Tutorial and code for uploading embeddings to vector database (Pinecone). Given the other guides in the works, stripped out anything AI that wasn't strictly Pinecone.

How I Tested These Changes

Changelog

Insert changelog entry or delete this section.

Copy link
Contributor

@cmpadden cmpadden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a code stand point this is great. I'll let @neverett respond more to the prose itself.

I did find myself lacking understanding of core vector DB concepts, so I think we could do better at explaining that along the way.

I'd also like for us to try and use the CodeExample component for this tutorial, I help refactor the component to make that easier.

Overall I really like how this turned out, awesome work!

We will be working with review data from Goodreads. These reviews exist as a collection of JSON files categorized by different genres. We will focus on just the files for graphic novels to limit the size of the files we will process. Within this domain, the files we will be working with are `goodreads_books_comics_graphic.json.gz` and `goodreads_reviews_comics_graphic.json.gz`. Since the data is normalized across these two files, we will want to combine information before feeding it into our vector database.

One way to handle preprocessing of the data is with [DuckDB](https://duckdb.org/). DuckDB is an in-process database, similar to SQLite, optimized for analytical workloads. We will start by creating two Dagster assets to load in the data. Each will load one of the files and create a DuckDB table (`graphic_novels` and `reviews`):
```python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use this tutorial as an example of embedding the code with CodeExample?

Right now docs_beta_snippets is hard coded, but we could improve the component to take in a prop of the snippet location, and default to docs_beta_snippets if not provided.

    import(`!!raw-loader!/../../examples/docs_beta_snippets/docs_beta_snippets/${filePath}`)

https://github.com/dagster-io/dagster/blob/master/docs/docs-beta/src/components/CodeExample.tsx#L44

Now that the data has been prepared, we are ready to work with our vector database.

### Vector Database
We can begin by creating the index within our vector database. A vector database is a specialized database designed to store, manage, and retrieve high-dimensional vector embeddings, enabling efficient similarity search and machine learning tasks. There are many different vector databases available. For this demo we will use Pinecone which is a cloud based vector database that offers a [free tier](https://app.pinecone.io/) that can help us get started.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point in the tutorial I had the question, "Why do we need both DuckDB and Pinecone databases?"

I understand that they serve different purposes, but we could maybe clarify more upfront on their usage.

```python
# assets.py
@dg.asset(
kinds={"Pinecone"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically use lowercase for kinds.

Suggested change
kinds={"Pinecone"},
kinds={"pinecone"},
image

# assets.py
@dg.asset(
kinds={"Pinecone"},
group_name="processing",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would "embeddings" be a more accurate group name for the vector db related assets?

def pinecone_index(
pinecone_resource: PineconeResource,
):
spec = ServerlessSpec(cloud="aws", region="us-east-1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I was confused by ServerlessSpec. A short blurb explaining that this is a Pinecone concept for compute (I think...?) might be helpful.

...
```

For our methods we will want the ability to create an index, retrieve a index so we can upsert records, and the ability to embed inputs. You can see all the details in the repo but this is what the `create_index` method looks like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a silly suggestion, but can we explain what an index is in the context of vector databases?

```python
# assets.py
@dg.asset(
kinds={"Pinecone"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kinds={"Pinecone"},
kinds={"pinecone"},

kinds={"Pinecone"},
group_name="processing",
)
def pinecone_embeddings(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to return MaterializeResult from these assets to include meaningful metadata.

```

# Going Forward
This was a relatively simple example but you see the amount of coordination that is needed to power realworld AI applications. As you add more and more sources of unstructured data that updates on different cadences and powers multiple downstream tasks you will want to think of operational details for building around AI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prob don't need to call it out being simple.

@@ -0,0 +1,311 @@
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we standardize project naming convention?

Modal and Bluesky are named examples/project_ whereas this is examples/tutorial_.

@dehume
Copy link
Contributor Author

dehume commented Jan 3, 2025

Going to repurpose this as a different use case. Closing in favor of #26795

@dehume dehume closed this Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants