New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[docs] Pinecone tutorial #26775

Closed

dehume wants to merge 2 commits into master from dennis/doc-652-moce-pinecone-tutorial-to-beta-docs

Contributor

dehume commented Dec 31, 2024 •

edited

Loading

Summary & Motivation

Tutorial and code for uploading embeddings to vector database (Pinecone). Given the other guides in the works, stripped out anything AI that wasn't strictly Pinecone.

How I Tested These Changes

Changelog

Insert changelog entry or delete this section.

dehume added 2 commits

December 31, 2024 14:21


          Pinecone tutorial

6799cee


          Remove static dir

2274dc0

dehume requested review from cmpadden and C00ldudeNoonan

December 31, 2024 20:24

cmpadden reviewed

View reviewed changes

Contributor

cmpadden left a comment

From a code stand point this is great. I'll let @neverett respond more to the prose itself.

I did find myself lacking understanding of core vector DB concepts, so I think we could do better at explaining that along the way.

I'd also like for us to try and use the CodeExample component for this tutorial, I help refactor the component to make that easier.

Overall I really like how this turned out, awesome work!

docs/docs-beta/docs/tutorial/pinecone.md

+              We will be working with review data from Goodreads. These reviews exist as a collection of JSON files categorized by different genres. We will focus on just the files for graphic novels to limit the size of the files we will process. Within this domain, the files we will be working with are `goodreads_books_comics_graphic.json.gz` and `goodreads_reviews_comics_graphic.json.gz`. Since the data is normalized across these two files, we will want to combine information before feeding it into our vector database.
+              One way to handle preprocessing of the data is with [DuckDB](https://duckdb.org/). DuckDB is an in-process database, similar to SQLite, optimized for analytical workloads. We will start by creating two Dagster assets to load in the data. Each will load one of the files and create a DuckDB table (`graphic_novels` and `reviews`):
+              ```python

Contributor

cmpadden Jan 2, 2025

Should we use this tutorial as an example of embedding the code with CodeExample?

Right now docs_beta_snippets is hard coded, but we could improve the component to take in a prop of the snippet location, and default to docs_beta_snippets if not provided.

    import(`!!raw-loader!/../../examples/docs_beta_snippets/docs_beta_snippets/${filePath}`)

https://github.com/dagster-io/dagster/blob/master/docs/docs-beta/src/components/CodeExample.tsx#L44

docs/docs-beta/docs/tutorial/pinecone.md

+              Now that the data has been prepared, we are ready to work with our vector database.
+              ### Vector Database
+              We can begin by creating the index within our vector database. A vector database is a specialized database designed to store, manage, and retrieve high-dimensional vector embeddings, enabling efficient similarity search and machine learning tasks. There are many different vector databases available. For this demo we will use Pinecone which is a cloud based vector database that offers a [free tier](https://app.pinecone.io/) that can help us get started.

Contributor

cmpadden Jan 2, 2025

At this point in the tutorial I had the question, "Why do we need both DuckDB and Pinecone databases?"

I understand that they serve different purposes, but we could maybe clarify more upfront on their usage.

docs/docs-beta/docs/tutorial/pinecone.md

+              ```python
+              # assets.py
+              @dg.asset(
+                  kinds={"Pinecone"},

Contributor

cmpadden Jan 2, 2025

We typically use lowercase for kinds.

Suggested change

      
                kinds={"Pinecone"},
          
                kinds={"pinecone"},

docs/docs-beta/docs/tutorial/pinecone.md

+              # assets.py
+              @dg.asset(
+                  kinds={"Pinecone"},
+                  group_name="processing",

Contributor

cmpadden Jan 2, 2025

Would "embeddings" be a more accurate group name for the vector db related assets?

docs/docs-beta/docs/tutorial/pinecone.md

+              def pinecone_index(
+                  pinecone_resource: PineconeResource,
+              ):
+                  spec = ServerlessSpec(cloud="aws", region="us-east-1")

Contributor

cmpadden Jan 2, 2025

At this point I was confused by ServerlessSpec. A short blurb explaining that this is a Pinecone concept for compute (I think...?) might be helpful.

docs/docs-beta/docs/tutorial/pinecone.md

+                  ...
+              ```
+              For our methods we will want the ability to create an index, retrieve a index so we can upsert records, and the ability to embed inputs. You can see all the details in the repo but this is what the `create_index` method looks like:

Contributor

cmpadden Jan 2, 2025

This might be a silly suggestion, but can we explain what an index is in the context of vector databases?

docs/docs-beta/docs/tutorial/pinecone.md

+              ```python
+              # assets.py
+              @dg.asset(
+                  kinds={"Pinecone"},

Contributor

cmpadden Jan 2, 2025

Suggested change

      
                kinds={"Pinecone"},
          
                kinds={"pinecone"},

docs/docs-beta/docs/tutorial/pinecone.md

+                  kinds={"Pinecone"},
+                  group_name="processing",
+              )
+              def pinecone_embeddings(

Contributor

cmpadden Jan 2, 2025

I think it would be good to return MaterializeResult from these assets to include meaningful metadata.

docs/docs-beta/docs/tutorial/pinecone.md

+              ```
+              # Going Forward
+              This was a relatively simple example but you see the amount of coordination that is needed to power realworld AI applications. As you add more and more sources of unstructured data that updates on different cadences and powers multiple downstream tasks you will want to think of operational details for building around AI.

Contributor

cmpadden Jan 2, 2025

We prob don't need to call it out being simple.

cmpadden reviewed

View reviewed changes

docs/docs-beta/docs/tutorial/pinecone.md

		@@ -0,0 +1,311 @@
		---

Contributor

cmpadden Jan 2, 2025

Should we standardize project naming convention?

Modal and Bluesky are named examples/project_ whereas this is examples/tutorial_.

Contributor Author

dehume commented Jan 3, 2025

Going to repurpose this as a different use case. Closing in favor of #26795

dehume closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet