Merge pull request #117 from cyberflying/main

This PR introduces support for PGVector as a feature in the vectorstore, and extends functionality to Azure China and Azure Global for both PGVector and Redis Stack.
ruoccofabrizio · Dec 18, 2023 · 96ce23a · 96ce23a
2 parents 533dd6e + f4ab11a
commit 96ce23a
Show file tree

Hide file tree

Showing 13 changed files with 2,417 additions and 23 deletions.
diff --git a/.env.template b/.env.template
@@ -7,6 +7,7 @@ OPENAI_API_BASE=https://YOUR_AZURE_OPENAI_RESOURCE.openai.azure.com/
 OPENAI_API_KEY=YOUR_AZURE_OPENAI_API_KEY
 OPENAI_TEMPERATURE=0.7
 OPENAI_MAX_TOKENS=-1
+AZURE_CLOUD=AzureCloud # AzureCloud or AzureChinaCloud
 VECTOR_STORE_TYPE=AzureSearch
 AZURE_SEARCH_SERVICE_NAME=YOUR_AZURE_SEARCH_SERVICE_NAME
 AZURE_SEARCH_ADMIN_KEY=YOUR_AZURE_SEARCH_ADMIN_KEY

diff --git a/.gitignore b/.gitignore
@@ -2,4 +2,8 @@
 code/embeddings_text.csv
 code/utilities/__pycache__
 .env
-__pycache__
+__pycache__
+.vscode
+WebApp.Dockerfile
+BatchProcess.Dockerfile
+.gitignore
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,8 @@
+{
+    "azureFunctions.deploySubpath": "code",
+    "azureFunctions.scmDoBuildDuringDeployment": true,
+    "azureFunctions.pythonVenv": ".venv",
+    "azureFunctions.projectLanguage": "Python",
+    "azureFunctions.projectRuntime": "~4",
+    "debug.internalConsoleOptions": "neverOpen"
+}
diff --git a/README.md b/README.md
@@ -34,7 +34,8 @@ If you want to use a Chat based deployment (gpt-35-turbo or gpt-4-32k or gpt-4),
 You have multiple options to run the code:
 -   [Deploy on Azure (WebApp + Batch Processing) with Azure Cognitive Search](#deploy-on-azure-webapp--batch-processing-with-azure-cognitive-search)
 -   [Deploy on Azure (WebApp + Azure Cache for Redis + Batch Processing)](#deploy-on-azure-webapp--azure-cache-for-redis-enterprise--batch-processing)
--   [Deploy on Azure (WebApp + Redis Stack + Batch Processing)](#deploy-on-azure-webapp--redis-stack--batch-processing)
+-   [Deploy on Azure/Azure China (WebApp + Redis Stack + Batch Processing)](#deploy-on-azureazure-china-webapp--redis-stack--batch-processing)
+-   [Deploy on Azure/Azure China (WebApp + Azure PostgreSQL + Batch Processing)](#deploy-on-azureazure-china-webapp--azure-postgresql--batch-processing)
 -   [Run everything locally in Docker (WebApp + Redis Stack + Batch Processing)](#run-everything-locally-in-docker-webapp--redis-stack--batch-processing)
 -   [Run everything locally in Python with Conda (WebApp only)](#run-everything-locally-in-python-with-conda-webapp-only)
 -   [Run everything locally in Python with venv](#run-everything-locally-in-python-with-venv)
@@ -71,8 +72,9 @@ Please be aware that you still need:
 
 You will add the endpoint and access key information for these resources when deploying the template. 
 
-## Deploy on Azure (WebApp + Redis Stack + Batch Processing)
-[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fruoccofabrizio%2Fazure-open-ai-embeddings-qna%2Fmain%2Finfrastructure%2Fdeployment.json)
+## Deploy on Azure/Azure China (WebApp + Redis Stack + Batch Processing)
+[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fruoccofabrizio%2Fazure-open-ai-embeddings-qna%2Fmain%2Finfrastructure%2Fdeployment.json) 
+[![Deploy to Azure](https://aka.ms/deploytoazurechinabutton)](https://portal.azure.cn/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fcyberflying%2Fazure-open-ai-embeddings-qna%2Fmain%2Finfrastructure%2Fdeployment_azcn.json)
 
 Click on the Deploy to Azure button and configure your settings in the Azure Portal as described in the [Environment variables](#environment-variables) section.
 
@@ -83,6 +85,15 @@ Please be aware that you need:
 -   an existing Form Recognizer Resource (OPTIONAL - if you want to extract text out of documents)
 -   an existing Translator Resource (OPTIONAL - if you want to translate documents)
 
+## Deploy on Azure/Azure China (WebApp + Azure PostgreSQL + Batch Processing)
+[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fcyberflying%2Fazure-open-ai-embeddings-qna%2Fmain%2Finfrastructure%2Fdeployment_pg.json)
+[![Deploy to Azure](https://aka.ms/deploytoazurechinabutton)](https://portal.azure.cn/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fcyberflying%2Fazure-open-ai-embeddings-qna%2Fmain%2Finfrastructure%2Fdeployment_pg_azcn.json)
+
+Click on the Deploy to Azure button and configure your settings in the Azure Portal as described in the [Environment variables](#environment-variables) section.
+
+![Architecture](docs/architecture_pg.png)
+
+
 ## Run everything locally in Docker (WebApp + Redis Stack + Batch Processing)
 
 First, clone the repo:
@@ -268,11 +279,17 @@ Here is the explanation of the parameters:
 |OPENAI_EMBEDDINGS_ENGINE_QUERY | text-embedding-ada-002  | Embedding engine for query deployed in your Azure OpenAI resource|
 |OPENAI_API_BASE | https://YOUR_AZURE_OPENAI_RESOURCE.openai.azure.com/ | Your Azure OpenAI Resource name. Get it in the [Azure Portal](https://portal.azure.com)|
 |OPENAI_API_KEY| YOUR_AZURE_OPENAI_KEY | Your Azure OpenAI API Key. Get it in the [Azure Portal](https://portal.azure.com)|
-|OPENAI_TEMPERATURE|0.7| Azure OpenAI Temperature |
+|OPENAI_TEMPERATURE|0.1| Azure OpenAI Temperature |
 |OPENAI_MAX_TOKENS|-1| Azure OpenAI Max Tokens |
-|VECTOR_STORE_TYPE| AzureSearch | Vector Store Type. Use AzureSearch for Azure Cognitive Search, leave it blank for Redis or Azure Cache for Redis Enterprise|
+|AZURE_CLOUD|AzureCloud| Azure Cloud to use. AzureCloud for Azure Global, AzureChinaCloud for Azure China |
+|VECTOR_STORE_TYPE| PGVector | Vector Store Type. Use AzureSearch for Azure Cognitive Search, PGVector for Azure PostgreSQL, leave it blank for Redis or Azure Cache for Redis Enterprise|
 |AZURE_SEARCH_SERVICE_NAME| YOUR_AZURE_SEARCH_SERVICE_URL | Your Azure Cognitive Search service name. Get it in the [Azure Portal](https://portal.azure.com)|
 |AZURE_SEARCH_ADMIN_KEY| AZURE_SEARCH_ADMIN_KEY | Your Azure Cognitive Search Admin key. Get it in the [Azure Portal](https://portal.azure.com)|
+|PGVECTOR_HOST|Your_PG_NAME.postgres.database.azure.com or Your_PG_NAME.postgres.database.chinacloudapi.cn
+|PGVECTOR_PORT|5432
+|PGVECTOR_DATABASE|YOUR_PG_DATABASE
+|PGVECTOR_USER|YOUR_PG_USER
+|PGVECTOR_PASSWORD|YOUR_PG_PASSWORD
 |REDIS_ADDRESS| api | URL for Redis Stack: "api" for docker compose|
 |REDIS_PORT | 6379 | Port for Redis |
 |REDIS_PASSWORD| redis-stack-password | OPTIONAL - Password for your Redis Stack|

diff --git a/code/OpenAI_Queries.py b/code/OpenAI_Queries.py
@@ -47,19 +47,10 @@ def check_deployment():
             Then restart your application.  
             """)
         st.error(traceback.format_exc())
-    #\ 4. Check if the Redis is working with previous version of data
+    #\ 4. Check if the VectorStore is working with previous version of data
     try:
         llm_helper = LLMHelper()
-        if llm_helper.vector_store_type != "AzureSearch":
-            if llm_helper.vector_store.check_existing_index("embeddings-index"):
-                st.warning("""Seems like you're using a Redis with an old data structure.  
-                If you want to use the new data structure, you can start using the app and go to "Add Document" -> "Add documents in Batch" and click on "Convert all files and add embeddings" to reprocess your documents.  
-                To remove this working, please delete the index "embeddings-index" from your Redis.  
-                If you prefer to use the old data structure, please change your Web App container image to point to the docker image: fruocco/oai-embeddings:2023-03-27_25. 
-                """)
-            else:
-                st.success("Redis is working!")
-        else:
+        if llm_helper.vector_store_type == "AzureSearch":
             try:
                 llm_helper.vector_store.index_exists()
                 st.success("Azure Cognitive Search is working!")
@@ -69,6 +60,26 @@ def check_deployment():
                     Then restart your application.  
                     """)
                 st.error(traceback.format_exc())
+        elif llm_helper.vector_store_type == "PGVector":
+            try:
+                llm_helper.vector_store.__post_init__()
+                st.success("PGVector is working!")
+            except Exception as e:
+                st.error("""PGVector is not working.  
+                    Please check your Azure PostgreSQL server, database, user name and password in the App Settings.
+                    Make sure the network settings(firewall rule) allow your app to access the Azure PostgreSQL service.
+                    Then restart your application.  
+                    """)
+                st.error(traceback.format_exc())
+        else:
+            if llm_helper.vector_store.check_existing_index("embeddings-index"):
+                st.warning("""Seems like you're using a Redis with an old data structure.  
+                If you want to use the new data structure, you can start using the app and go to "Add Document" -> "Add documents in Batch" and click on "Convert all files and add embeddings" to reprocess your documents.  
+                To remove this working, please delete the index "embeddings-index" from your Redis.  
+                If you prefer to use the old data structure, please change your Web App container image to point to the docker image: fruocco/oai-embeddings:2023-03-27_25. 
+                """)
+            else:
+                st.success("Redis is working!")
     except Exception as e:
         st.error(f"""Redis is not working. 
             Please check your Redis connection string in the App Settings.  

diff --git a/code/requirements.txt b/code/requirements.txt
@@ -19,5 +19,7 @@ beautifulsoup4==4.12.0
 streamlit-chat==0.0.2.2
 fake-useragent==1.1.3
 chardet==5.1.0
+pgvector==0.2.4
+psycopg2-binary==2.9.9
 --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/
 azure-search-documents==11.4.0a20230509004
diff --git a/code/utilities/azureblobstorage.py b/code/utilities/azureblobstorage.py
@@ -8,9 +8,11 @@ def __init__(self, account_name: str = None, account_key: str = None, container_
 
         load_dotenv()
 
+        self.azure_cloud : str = os.getenv('AZURE_CLOUD', 'AzureCloud')
+        self.blob_endpoint_suffix : str = 'core.chinacloudapi.cn' if self.azure_cloud == 'AzureChinaCloud' else 'core.windows.net'
         self.account_name : str = account_name if account_name else os.getenv('BLOB_ACCOUNT_NAME')
         self.account_key : str = account_key if account_key else os.getenv('BLOB_ACCOUNT_KEY')
-        self.connect_str : str = f"DefaultEndpointsProtocol=https;AccountName={self.account_name};AccountKey={self.account_key};EndpointSuffix=core.windows.net"
+        self.connect_str : str = f"DefaultEndpointsProtocol=https;AccountName={self.account_name};AccountKey={self.account_key};EndpointSuffix={self.blob_endpoint_suffix}"
         self.container_name : str = container_name if container_name else os.getenv('BLOB_CONTAINER_NAME')
         self.blob_service_client : BlobServiceClient = BlobServiceClient.from_connection_string(self.connect_str)
 
@@ -40,12 +42,12 @@ def get_all_files(self):
                     "filename" : blob.name,
                     "converted": blob.metadata.get('converted', 'false') == 'true' if blob.metadata else False,
                     "embeddings_added": blob.metadata.get('embeddings_added', 'false') == 'true' if blob.metadata else False,
-                    "fullpath": f"https://{self.account_name}.blob.core.windows.net/{self.container_name}/{blob.name}?{sas}",
+                    "fullpath": f"https://{self.account_name}.blob.{self.blob_endpoint_suffix}/{self.container_name}/{blob.name}?{sas}",
                     "converted_filename": blob.metadata.get('converted_filename', '') if blob.metadata else '',
                     "converted_path": ""
                     })
             else:
-                converted_files[blob.name] = f"https://{self.account_name}.blob.core.windows.net/{self.container_name}/{blob.name}?{sas}"
+                converted_files[blob.name] = f"https://{self.account_name}.blob.{self.blob_endpoint_suffix}/{self.container_name}/{blob.name}?{sas}"
 
         for file in files:
             converted_filename = file.pop('converted_filename', '')
@@ -70,4 +72,4 @@ def get_container_sas(self):
 
     def get_blob_sas(self, file_name):
         # Generate a SAS URL to the blob and return it
-        return f"https://{self.account_name}.blob.core.windows.net/{self.container_name}/{file_name}" + "?" + generate_blob_sas(account_name= self.account_name, container_name=self.container_name, blob_name= file_name, account_key= self.account_key, permission='r', expiry=datetime.utcnow() + timedelta(hours=1))
+        return f"https://{self.account_name}.blob.{self.blob_endpoint_suffix}/{self.container_name}/{file_name}" + "?" + generate_blob_sas(account_name= self.account_name, container_name=self.container_name, blob_name= file_name, account_key= self.account_key, permission='r', expiry=datetime.utcnow() + timedelta(hours=1))
diff --git a/code/utilities/helper.py b/code/utilities/helper.py
@@ -28,6 +28,7 @@
 from utilities.customprompt import PROMPT
 from utilities.redis import RedisExtended
 from utilities.azuresearch import AzureSearch
+from utilities.pgvector import PGVectorExtended
 
 import pandas as pd
 import urllib
@@ -69,10 +70,24 @@ def __init__(self,
         self.vector_store_type = os.getenv("VECTOR_STORE_TYPE")
 
         # Azure Search settings
-        if  self.vector_store_type == "AzureSearch":
+        if self.vector_store_type == "AzureSearch":
             self.vector_store_address: str = os.getenv('AZURE_SEARCH_SERVICE_NAME')
             self.vector_store_password: str = os.getenv('AZURE_SEARCH_ADMIN_KEY')
 
+        # PGVector settings
+        elif self.vector_store_type == "PGVector":
+            self.vector_store_driver: str = os.getenv('PGVECTOR_DRIVER', "psycopg2")
+            self.vector_store_address: str = os.getenv('PGVECTOR_HOST', "localhost")
+            self.vector_store_port: int = int(os.getenv('PGVECTOR_PORT', 5432))
+            self.vector_store_database: str = os.getenv("PGVECTOR_DATABASE", "postgres")
+            self.vector_store_username: str = os.getenv("PGVECTOR_USER", "postgres")
+            self.vector_store_password: str = os.getenv("PGVECTOR_PASSWORD", "postgres")
+
+            if self.vector_store_password:
+                self.vector_store_full_address = f"postgresql+{self.vector_store_driver}://{self.vector_store_username}:{self.vector_store_password}@{self.vector_store_address}:{self.vector_store_port}/{self.vector_store_database}"
+            else:
+                self.vector_store_full_address = f"postgresql+{self.vector_store_driver}://{self.vector_store_username}@{self.vector_store_address}:{self.vector_store_port}/{self.vector_store_database}"
+
         else:
             # Vector store settings
             self.vector_store_address: str = os.getenv('REDIS_ADDRESS', "localhost")
@@ -94,8 +109,11 @@ def __init__(self,
             self.llm: ChatOpenAI = ChatOpenAI(model_name=self.deployment_name, engine=self.deployment_name, temperature=self.temperature, max_tokens=self.max_tokens if self.max_tokens != -1 else None) if llm is None else llm
         else:
             self.llm: AzureOpenAI = AzureOpenAI(deployment_name=self.deployment_name, temperature=self.temperature, max_tokens=self.max_tokens) if llm is None else llm
+
         if self.vector_store_type == "AzureSearch":
             self.vector_store: VectorStore = AzureSearch(azure_cognitive_search_name=self.vector_store_address, azure_cognitive_search_key=self.vector_store_password, index_name=self.index_name, embedding_function=self.embeddings.embed_query) if vector_store is None else vector_store
+        elif self.vector_store_type == "PGVector":
+            self.vector_store: PGVectorExtended = PGVectorExtended(connection_string=self.vector_store_full_address, embedding_function=self.embeddings, collection_name="qnacollection", pre_delete_collection=False) if vector_store is None else vector_store
         else:
             self.vector_store: RedisExtended = RedisExtended(redis_url=self.vector_store_full_address, index_name=self.index_name, embedding_function=self.embeddings.embed_query) if vector_store is None else vector_store   
         self.k : int = 3 if k is None else k
@@ -138,8 +156,10 @@ def add_embeddings_lc(self, source_url):
                 hash_key = f"doc:{self.index_name}:{hash_key}"
                 keys.append(hash_key)
                 doc.metadata = {"source": f"[{source_url}]({source_url}_SAS_TOKEN_PLACEHOLDER_)" , "chunk": i, "key": hash_key, "filename": filename}
-            if self.vector_store_type == 'AzureSearch':
+            if self.vector_store_type == "AzureSearch":
                 self.vector_store.add_documents(documents=docs, keys=keys)
+            elif self.vector_store_type == "PGVector":
+                self.vector_store.add_documents(documents=docs, keys=keys, ids=keys)
             else:
                 self.vector_store.add_documents(documents=docs, redis_url=self.vector_store_full_address,  index_name=self.index_name, keys=keys)