Skip to content

Commit

Permalink
Merge pull request #117 from cyberflying/main
Browse files Browse the repository at this point in the history
This PR introduces support for PGVector as a feature in the vectorstore, and extends functionality to Azure China and Azure Global for both PGVector and Redis Stack.
  • Loading branch information
ruoccofabrizio authored Dec 18, 2023
2 parents 533dd6e + f4ab11a commit 96ce23a
Show file tree
Hide file tree
Showing 13 changed files with 2,417 additions and 23 deletions.
1 change: 1 addition & 0 deletions .env.template
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ OPENAI_API_BASE=https://YOUR_AZURE_OPENAI_RESOURCE.openai.azure.com/
OPENAI_API_KEY=YOUR_AZURE_OPENAI_API_KEY
OPENAI_TEMPERATURE=0.7
OPENAI_MAX_TOKENS=-1
AZURE_CLOUD=AzureCloud # AzureCloud or AzureChinaCloud
VECTOR_STORE_TYPE=AzureSearch
AZURE_SEARCH_SERVICE_NAME=YOUR_AZURE_SEARCH_SERVICE_NAME
AZURE_SEARCH_ADMIN_KEY=YOUR_AZURE_SEARCH_ADMIN_KEY
Expand Down
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,8 @@
code/embeddings_text.csv
code/utilities/__pycache__
.env
__pycache__
__pycache__
.vscode
WebApp.Dockerfile
BatchProcess.Dockerfile
.gitignore
8 changes: 8 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"azureFunctions.deploySubpath": "code",
"azureFunctions.scmDoBuildDuringDeployment": true,
"azureFunctions.pythonVenv": ".venv",
"azureFunctions.projectLanguage": "Python",
"azureFunctions.projectRuntime": "~4",
"debug.internalConsoleOptions": "neverOpen"
}
27 changes: 22 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,8 @@ If you want to use a Chat based deployment (gpt-35-turbo or gpt-4-32k or gpt-4),
You have multiple options to run the code:
- [Deploy on Azure (WebApp + Batch Processing) with Azure Cognitive Search](#deploy-on-azure-webapp--batch-processing-with-azure-cognitive-search)
- [Deploy on Azure (WebApp + Azure Cache for Redis + Batch Processing)](#deploy-on-azure-webapp--azure-cache-for-redis-enterprise--batch-processing)
- [Deploy on Azure (WebApp + Redis Stack + Batch Processing)](#deploy-on-azure-webapp--redis-stack--batch-processing)
- [Deploy on Azure/Azure China (WebApp + Redis Stack + Batch Processing)](#deploy-on-azureazure-china-webapp--redis-stack--batch-processing)
- [Deploy on Azure/Azure China (WebApp + Azure PostgreSQL + Batch Processing)](#deploy-on-azureazure-china-webapp--azure-postgresql--batch-processing)
- [Run everything locally in Docker (WebApp + Redis Stack + Batch Processing)](#run-everything-locally-in-docker-webapp--redis-stack--batch-processing)
- [Run everything locally in Python with Conda (WebApp only)](#run-everything-locally-in-python-with-conda-webapp-only)
- [Run everything locally in Python with venv](#run-everything-locally-in-python-with-venv)
Expand Down Expand Up @@ -71,8 +72,9 @@ Please be aware that you still need:

You will add the endpoint and access key information for these resources when deploying the template.

## Deploy on Azure (WebApp + Redis Stack + Batch Processing)
[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fruoccofabrizio%2Fazure-open-ai-embeddings-qna%2Fmain%2Finfrastructure%2Fdeployment.json)
## Deploy on Azure/Azure China (WebApp + Redis Stack + Batch Processing)
[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fruoccofabrizio%2Fazure-open-ai-embeddings-qna%2Fmain%2Finfrastructure%2Fdeployment.json)
[![Deploy to Azure](https://aka.ms/deploytoazurechinabutton)](https://portal.azure.cn/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fcyberflying%2Fazure-open-ai-embeddings-qna%2Fmain%2Finfrastructure%2Fdeployment_azcn.json)

Click on the Deploy to Azure button and configure your settings in the Azure Portal as described in the [Environment variables](#environment-variables) section.

Expand All @@ -83,6 +85,15 @@ Please be aware that you need:
- an existing Form Recognizer Resource (OPTIONAL - if you want to extract text out of documents)
- an existing Translator Resource (OPTIONAL - if you want to translate documents)

## Deploy on Azure/Azure China (WebApp + Azure PostgreSQL + Batch Processing)
[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fcyberflying%2Fazure-open-ai-embeddings-qna%2Fmain%2Finfrastructure%2Fdeployment_pg.json)
[![Deploy to Azure](https://aka.ms/deploytoazurechinabutton)](https://portal.azure.cn/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fcyberflying%2Fazure-open-ai-embeddings-qna%2Fmain%2Finfrastructure%2Fdeployment_pg_azcn.json)

Click on the Deploy to Azure button and configure your settings in the Azure Portal as described in the [Environment variables](#environment-variables) section.

![Architecture](docs/architecture_pg.png)


## Run everything locally in Docker (WebApp + Redis Stack + Batch Processing)

First, clone the repo:
Expand Down Expand Up @@ -268,11 +279,17 @@ Here is the explanation of the parameters:
|OPENAI_EMBEDDINGS_ENGINE_QUERY | text-embedding-ada-002 | Embedding engine for query deployed in your Azure OpenAI resource|
|OPENAI_API_BASE | https://YOUR_AZURE_OPENAI_RESOURCE.openai.azure.com/ | Your Azure OpenAI Resource name. Get it in the [Azure Portal](https://portal.azure.com)|
|OPENAI_API_KEY| YOUR_AZURE_OPENAI_KEY | Your Azure OpenAI API Key. Get it in the [Azure Portal](https://portal.azure.com)|
|OPENAI_TEMPERATURE|0.7| Azure OpenAI Temperature |
|OPENAI_TEMPERATURE|0.1| Azure OpenAI Temperature |
|OPENAI_MAX_TOKENS|-1| Azure OpenAI Max Tokens |
|VECTOR_STORE_TYPE| AzureSearch | Vector Store Type. Use AzureSearch for Azure Cognitive Search, leave it blank for Redis or Azure Cache for Redis Enterprise|
|AZURE_CLOUD|AzureCloud| Azure Cloud to use. AzureCloud for Azure Global, AzureChinaCloud for Azure China |
|VECTOR_STORE_TYPE| PGVector | Vector Store Type. Use AzureSearch for Azure Cognitive Search, PGVector for Azure PostgreSQL, leave it blank for Redis or Azure Cache for Redis Enterprise|
|AZURE_SEARCH_SERVICE_NAME| YOUR_AZURE_SEARCH_SERVICE_URL | Your Azure Cognitive Search service name. Get it in the [Azure Portal](https://portal.azure.com)|
|AZURE_SEARCH_ADMIN_KEY| AZURE_SEARCH_ADMIN_KEY | Your Azure Cognitive Search Admin key. Get it in the [Azure Portal](https://portal.azure.com)|
|PGVECTOR_HOST|Your_PG_NAME.postgres.database.azure.com or Your_PG_NAME.postgres.database.chinacloudapi.cn
|PGVECTOR_PORT|5432
|PGVECTOR_DATABASE|YOUR_PG_DATABASE
|PGVECTOR_USER|YOUR_PG_USER
|PGVECTOR_PASSWORD|YOUR_PG_PASSWORD
|REDIS_ADDRESS| api | URL for Redis Stack: "api" for docker compose|
|REDIS_PORT | 6379 | Port for Redis |
|REDIS_PASSWORD| redis-stack-password | OPTIONAL - Password for your Redis Stack|
Expand Down
33 changes: 22 additions & 11 deletions code/OpenAI_Queries.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,19 +47,10 @@ def check_deployment():
Then restart your application.
""")
st.error(traceback.format_exc())
#\ 4. Check if the Redis is working with previous version of data
#\ 4. Check if the VectorStore is working with previous version of data
try:
llm_helper = LLMHelper()
if llm_helper.vector_store_type != "AzureSearch":
if llm_helper.vector_store.check_existing_index("embeddings-index"):
st.warning("""Seems like you're using a Redis with an old data structure.
If you want to use the new data structure, you can start using the app and go to "Add Document" -> "Add documents in Batch" and click on "Convert all files and add embeddings" to reprocess your documents.
To remove this working, please delete the index "embeddings-index" from your Redis.
If you prefer to use the old data structure, please change your Web App container image to point to the docker image: fruocco/oai-embeddings:2023-03-27_25.
""")
else:
st.success("Redis is working!")
else:
if llm_helper.vector_store_type == "AzureSearch":
try:
llm_helper.vector_store.index_exists()
st.success("Azure Cognitive Search is working!")
Expand All @@ -69,6 +60,26 @@ def check_deployment():
Then restart your application.
""")
st.error(traceback.format_exc())
elif llm_helper.vector_store_type == "PGVector":
try:
llm_helper.vector_store.__post_init__()
st.success("PGVector is working!")
except Exception as e:
st.error("""PGVector is not working.
Please check your Azure PostgreSQL server, database, user name and password in the App Settings.
Make sure the network settings(firewall rule) allow your app to access the Azure PostgreSQL service.
Then restart your application.
""")
st.error(traceback.format_exc())
else:
if llm_helper.vector_store.check_existing_index("embeddings-index"):
st.warning("""Seems like you're using a Redis with an old data structure.
If you want to use the new data structure, you can start using the app and go to "Add Document" -> "Add documents in Batch" and click on "Convert all files and add embeddings" to reprocess your documents.
To remove this working, please delete the index "embeddings-index" from your Redis.
If you prefer to use the old data structure, please change your Web App container image to point to the docker image: fruocco/oai-embeddings:2023-03-27_25.
""")
else:
st.success("Redis is working!")
except Exception as e:
st.error(f"""Redis is not working.
Please check your Redis connection string in the App Settings.
Expand Down
2 changes: 2 additions & 0 deletions code/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,7 @@ beautifulsoup4==4.12.0
streamlit-chat==0.0.2.2
fake-useragent==1.1.3
chardet==5.1.0
pgvector==0.2.4
psycopg2-binary==2.9.9
--extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/
azure-search-documents==11.4.0a20230509004
10 changes: 6 additions & 4 deletions code/utilities/azureblobstorage.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@ def __init__(self, account_name: str = None, account_key: str = None, container_

load_dotenv()

self.azure_cloud : str = os.getenv('AZURE_CLOUD', 'AzureCloud')
self.blob_endpoint_suffix : str = 'core.chinacloudapi.cn' if self.azure_cloud == 'AzureChinaCloud' else 'core.windows.net'
self.account_name : str = account_name if account_name else os.getenv('BLOB_ACCOUNT_NAME')
self.account_key : str = account_key if account_key else os.getenv('BLOB_ACCOUNT_KEY')
self.connect_str : str = f"DefaultEndpointsProtocol=https;AccountName={self.account_name};AccountKey={self.account_key};EndpointSuffix=core.windows.net"
self.connect_str : str = f"DefaultEndpointsProtocol=https;AccountName={self.account_name};AccountKey={self.account_key};EndpointSuffix={self.blob_endpoint_suffix}"
self.container_name : str = container_name if container_name else os.getenv('BLOB_CONTAINER_NAME')
self.blob_service_client : BlobServiceClient = BlobServiceClient.from_connection_string(self.connect_str)

Expand Down Expand Up @@ -40,12 +42,12 @@ def get_all_files(self):
"filename" : blob.name,
"converted": blob.metadata.get('converted', 'false') == 'true' if blob.metadata else False,
"embeddings_added": blob.metadata.get('embeddings_added', 'false') == 'true' if blob.metadata else False,
"fullpath": f"https://{self.account_name}.blob.core.windows.net/{self.container_name}/{blob.name}?{sas}",
"fullpath": f"https://{self.account_name}.blob.{self.blob_endpoint_suffix}/{self.container_name}/{blob.name}?{sas}",
"converted_filename": blob.metadata.get('converted_filename', '') if blob.metadata else '',
"converted_path": ""
})
else:
converted_files[blob.name] = f"https://{self.account_name}.blob.core.windows.net/{self.container_name}/{blob.name}?{sas}"
converted_files[blob.name] = f"https://{self.account_name}.blob.{self.blob_endpoint_suffix}/{self.container_name}/{blob.name}?{sas}"

for file in files:
converted_filename = file.pop('converted_filename', '')
Expand All @@ -70,4 +72,4 @@ def get_container_sas(self):

def get_blob_sas(self, file_name):
# Generate a SAS URL to the blob and return it
return f"https://{self.account_name}.blob.core.windows.net/{self.container_name}/{file_name}" + "?" + generate_blob_sas(account_name= self.account_name, container_name=self.container_name, blob_name= file_name, account_key= self.account_key, permission='r', expiry=datetime.utcnow() + timedelta(hours=1))
return f"https://{self.account_name}.blob.{self.blob_endpoint_suffix}/{self.container_name}/{file_name}" + "?" + generate_blob_sas(account_name= self.account_name, container_name=self.container_name, blob_name= file_name, account_key= self.account_key, permission='r', expiry=datetime.utcnow() + timedelta(hours=1))
24 changes: 22 additions & 2 deletions code/utilities/helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
from utilities.customprompt import PROMPT
from utilities.redis import RedisExtended
from utilities.azuresearch import AzureSearch
from utilities.pgvector import PGVectorExtended

import pandas as pd
import urllib
Expand Down Expand Up @@ -69,10 +70,24 @@ def __init__(self,
self.vector_store_type = os.getenv("VECTOR_STORE_TYPE")

# Azure Search settings
if self.vector_store_type == "AzureSearch":
if self.vector_store_type == "AzureSearch":
self.vector_store_address: str = os.getenv('AZURE_SEARCH_SERVICE_NAME')
self.vector_store_password: str = os.getenv('AZURE_SEARCH_ADMIN_KEY')

# PGVector settings
elif self.vector_store_type == "PGVector":
self.vector_store_driver: str = os.getenv('PGVECTOR_DRIVER', "psycopg2")
self.vector_store_address: str = os.getenv('PGVECTOR_HOST', "localhost")
self.vector_store_port: int = int(os.getenv('PGVECTOR_PORT', 5432))
self.vector_store_database: str = os.getenv("PGVECTOR_DATABASE", "postgres")
self.vector_store_username: str = os.getenv("PGVECTOR_USER", "postgres")
self.vector_store_password: str = os.getenv("PGVECTOR_PASSWORD", "postgres")

if self.vector_store_password:
self.vector_store_full_address = f"postgresql+{self.vector_store_driver}://{self.vector_store_username}:{self.vector_store_password}@{self.vector_store_address}:{self.vector_store_port}/{self.vector_store_database}"
else:
self.vector_store_full_address = f"postgresql+{self.vector_store_driver}://{self.vector_store_username}@{self.vector_store_address}:{self.vector_store_port}/{self.vector_store_database}"

else:
# Vector store settings
self.vector_store_address: str = os.getenv('REDIS_ADDRESS', "localhost")
Expand All @@ -94,8 +109,11 @@ def __init__(self,
self.llm: ChatOpenAI = ChatOpenAI(model_name=self.deployment_name, engine=self.deployment_name, temperature=self.temperature, max_tokens=self.max_tokens if self.max_tokens != -1 else None) if llm is None else llm
else:
self.llm: AzureOpenAI = AzureOpenAI(deployment_name=self.deployment_name, temperature=self.temperature, max_tokens=self.max_tokens) if llm is None else llm

if self.vector_store_type == "AzureSearch":
self.vector_store: VectorStore = AzureSearch(azure_cognitive_search_name=self.vector_store_address, azure_cognitive_search_key=self.vector_store_password, index_name=self.index_name, embedding_function=self.embeddings.embed_query) if vector_store is None else vector_store
elif self.vector_store_type == "PGVector":
self.vector_store: PGVectorExtended = PGVectorExtended(connection_string=self.vector_store_full_address, embedding_function=self.embeddings, collection_name="qnacollection", pre_delete_collection=False) if vector_store is None else vector_store
else:
self.vector_store: RedisExtended = RedisExtended(redis_url=self.vector_store_full_address, index_name=self.index_name, embedding_function=self.embeddings.embed_query) if vector_store is None else vector_store
self.k : int = 3 if k is None else k
Expand Down Expand Up @@ -138,8 +156,10 @@ def add_embeddings_lc(self, source_url):
hash_key = f"doc:{self.index_name}:{hash_key}"
keys.append(hash_key)
doc.metadata = {"source": f"[{source_url}]({source_url}_SAS_TOKEN_PLACEHOLDER_)" , "chunk": i, "key": hash_key, "filename": filename}
if self.vector_store_type == 'AzureSearch':
if self.vector_store_type == "AzureSearch":
self.vector_store.add_documents(documents=docs, keys=keys)
elif self.vector_store_type == "PGVector":
self.vector_store.add_documents(documents=docs, keys=keys, ids=keys)
else:
self.vector_store.add_documents(documents=docs, redis_url=self.vector_store_full_address, index_name=self.index_name, keys=keys)

Expand Down
Loading

0 comments on commit 96ce23a

Please sign in to comment.