Merge pull request #24 from pepkit/add_pepembed

Add in `pepembed` README
pepkit · Feb 7, 2024 · 807a32b · 807a32b
2 parents 269ddba + a5beb40
commit 807a32b
Show file tree

Hide file tree

Showing 10 changed files with 1,716 additions and 3 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/.gitignore b/.gitignore
@@ -1 +1,2 @@
 site
+venv/
diff --git a/docs/.DS_Store b/docs/.DS_Store
diff --git a/docs/pephub/.DS_Store b/docs/pephub/.DS_Store
diff --git a/docs/pephub/development.md b/docs/pephub/development.md
@@ -4,7 +4,7 @@
 
 _The following assumes you have already setup a database. If you have not, please see [here](#1-database-setup)._
 
-There are two components to PEPhub: a FastAPI backend, and a React frontend. As such, when developing, you will need to run both the backend and frontend development servers.
+There are two components to PEPhub: a FastAPI backend, and a React frontend. As such, when developing, you will need to run both the backend and frontend development servers. Full API documentation can be found at https://pephub-api.databio.org/api/v1/docs.
 
 ## Backend development
 

diff --git a/docs/pephub/img/architecture.png b/docs/pephub/img/architecture.png
diff --git a/docs/pephub/img/cartoon_sample_modifiers.svg b/docs/pephub/img/cartoon_sample_modifiers.svg
diff --git a/docs/pephub/img/pepembed-arch.svg b/docs/pephub/img/pepembed-arch.svg
diff --git a/docs/pephub/pepembed/README.md b/docs/pephub/pepembed/README.md
@@ -0,0 +1,104 @@
+# pepembed
+
+## Overview
+
+PEPembed is a Python package for computing text-embeddings of sample metadata stored in [pephub](https://github.com/pepkit/pephub) for search-and-retrieval tasks. It provides both a CLI and a Python API. It handles the long-running job of downloading projects inside pephub, mining any relevant metadata from them, computing a rich text embedding on that data, and finally upserting it into a vector database. We use [qdrant](https://qdrant.tech/) as our vector database for its performance and simplicity and payload capabilities.
+
+Understand everything? Jump to [running `pepembed`](#install-and-run). Or view the quick start below.
+
+## Quick Start
+
+```console
+pip install .
+```
+
+```console
+pepembed \
+  --postgres-host $POSTGRES_HOST \
+  --postgres-user $POSTGRES_USER \
+  --postgres-password $POSTGRES_PASSWORD \
+  --postgres-db $POSTGRES_DB \
+```
+
+## Architecture
+
+<p align="center">
+  <img src="../img/pepembed-arch.svg" alt="pepembed architecture" width="800px" />
+</p>
+
+`pepembed` works in three steps: 1) Download PEPs from pephub, 2) Extract metadata from these PEPs and embeds them using a [sentence transformer](https://www.sbert.net/), and 3) inserts these PEPs into a [qdrant](https://qdrant.tech/) instance.
+
+**1. Download PEPs:**  
+`pepembed` downloads all PEPS from pephub. This is the most time-consuming process. Currently there is no way to parametrize this, but in the future we should. We should also allow for generating embeddings straight from files on disc.
+
+**2. Extract Metadata from PEPs adn embeddings:**  
+Once the PEPs are downloaded, we then extract any relevant metadata from them. This is done by looking for **keywords** in the [**project-level** attributes](https://pep.databio.org/en/latest/specification/#project-attribute-sample_modifiers). For each PEP, a pseudo-description is built by looking for these keywords and building a string. Some example keyword attributes might be: `cell_type`, `protocol`, `procedure`, `institution`, etc. You can specify your own keywords to `pepembed` if you wish.
+
+<p align="center">
+  <img
+    alt="Sample modifiers in a configuration file" 
+    src="../img/cartoon_sample_modifiers.svg"
+    width="400px"
+  />
+</p>
+
+Once the pseudo-descriptions are mined, we can then utilize a `sentence-transformer` to generate low-dimensional representations of these descriptions. By defauly, we use a [state-of-the-art transformer](https://arxiv.org/abs/1908.10084) trained for the semantic textual similarity task (*Reimers & Gurevych, 2019*). The embeddings are linked back to the original PEP registry path, along with other information like the mined pseudo-description and the row id in the database.
+
+**3. Insert Embeddings:**  
+Finally, we insert the embeddings into a [qdrant](https://qdrant.tech/) instance. qdrant is a **vector database** that is designed to store embeddings as first-class data types as well as supporting native graph-based indexing of these embeddings. The allows for near-instant search and retrieval of nearest embeddings neighbors given a new embedding (say an encoded search query on a web application). qdrant supports arming the embeddings with a [**payload**](https://qdrant.tech/documentation/payload/) where we store basic information on that PEP like registry path, row id, and its description.
+
+## Install and Run
+
+While simple to install and run, `pepembed` requires lots of information to function. There are three key aspects: 1) The pephub instance, 2) the qdrant instance, and 3) the keywords. Ensure the following before running the `cli`:
+
+### Setup
+
+**1. PEPhub instance:**  
+Make sure you have access to a running pephub instance store with peps. Once complete, you can use the following environment variables to tell `pepembed` where to get data. Alternatively, you can pass these as command-line args:  
+* `POSTGRES_HOST`
+* `POSTGRES_DB`
+* `POSTGRES_USER`
+* `POSTGRES_PASSWORD`
+
+**2. Qdrant instance:**  
+In addition to a pephub instance, you will need a running instance of qdrant. It is quite simple and instructions can be found [here](https://qdrant.tech/documentation/quick_start/). The TL;DR is:  
+
+```console
+docker pull qdrant/qdrant
+docker run -p 6333:6333 \
+    -v $(pwd)/qdrant_storage:/qdrant/storage \
+    qdrant/qdrant
+```
+
+This will give you a qdrant instance served at http://localhost:6333. You can pass this information to `pepembed` as environment variables. Alternatively, you may pass these as command-line args:  
+* `QDRANT_HOST`
+* `QDRANT_PORT`
+* `QDRANT_API_KEY`
+* `QDRANT_COLLECTION_NAME`
+
+*Unless you are running this for production, you most likely do not need to specify any of these.*
+
+**3. Keywords:**  
+Finally, we need a keywords file. This is technically optional, and `pepembed` comes with [default keywords](pepembed/const.py), but you may supply your own as a plain text file. This can be supplied only as command-line args:
+* `KEYWORDS_FILE`
+
+There are many other options as well (like specifying the transformer model to use), but the defaults work great for a first try. Use `pepembed --help` to see all options. If you are like me, and like to keep your secrets in a `.env` file, you can export them easily to the environment with `export $(cat .env | xargs)`
+
+### Install
+
+Clone this repository and install with `pip`:
+
+```console
+pip install .
+```
+
+### Run
+
+```console
+pepembed \
+  --keywords-file keywords.txt \
+  --postgres-host $POSTGRES_HOST \
+  --postgres-user $POSTGRES_USER \
+  --postgres-password $POSTGRES_PASSWORD \
+  --postgres-db $POSTGRES_DB \
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -184,6 +184,8 @@ nav:
       - Deployment: pephub/deployment.md
       - Development: pephub/development.md
       - Server settings: pephub/server-settings.md
+      - PEPembed:
+        - PEPembed: pephub/pepembed/README.md
       - pepdbagent:
         - pepdbagent: pephub/pepdbagent/README.md
         - Database tutorial: pephub/pepdbagent/db_tutorial.md
@@ -192,8 +194,6 @@ nav:
       - Schema registry: https://schema.databio.org
       - How to cite: citations.md
       - Changelog: pephub/changelog.md
-
-
   - Peppy:
     - Peppy: peppy/README.md
     - Getting started: