|
| 1 | +# Safe Retriever for LangChain |
| 2 | + |
| 3 | +***Semantic Enforcement RAG using PebbloRetrievalQA*** |
| 4 | + |
| 5 | +`PebbloRetrievalQA` is a Retrieval chain with Identity & Semantic Enforcement for question-answering against a vector database. |
| 6 | + |
| 7 | +This document covers how to retrieve documents with Semantic Enforcement. |
| 8 | + |
| 9 | +**Steps:** |
| 10 | + |
| 11 | +- **Loading Documents with Semantic metadata:** The process starts by loading documents with semantic metadata. |
| 12 | +- **Using supported Vector database** `PebbloRetrievalQA` chain requires a Vector database that supports rich metadata filtering capability. Pick one |
| 13 | + from the supported Vector database vendor list shown below in this document. |
| 14 | +- **Initializing PebbloRetrievalQA Chain:** After loading the documents, the PebbloRetrievalQA chain is initialized. This chain uses the retriever ( |
| 15 | + created from the vector database) and an LLM. |
| 16 | +- **The 'ask' Function:** The 'ask' function is used to pose questions to the system. This function accepts a question and an semantic_context as |
| 17 | + input and returns the answer using the PebbloRetrievalQA chain. The semantic context contains the topics and entities that should be denied within |
| 18 | + the context used to generate a response. |
| 19 | +- **Posing Questions:** Finally, questions are posed to the system. The system retrieves answers based on the semantic metadata in the documents |
| 20 | + and the semantic_context provided in the 'ask' function. |
| 21 | + |
| 22 | +## Setup |
| 23 | + |
| 24 | +### Dependencies |
| 25 | + |
| 26 | +The walkthrough requires Langchain, langchain-community, langchain-openai, and a Qdrant client. |
| 27 | + |
| 28 | +```bash |
| 29 | +%pip install --upgrade --quiet langchain langchain-community langchain-openai qdrant_client |
| 30 | +``` |
| 31 | + |
| 32 | +### Identity-aware Data Ingestion |
| 33 | + |
| 34 | +In this scenario, Qdrant is being utilized as a vector database. However, the flexibility of the system allows for the use of any supported vector |
| 35 | +databases. |
| 36 | + |
| 37 | +**PebbloRetrievalQA chain supports the following vector databases:** |
| 38 | + |
| 39 | +1. Qdrant |
| 40 | +1. Pinecone |
| 41 | + |
| 42 | +**Load vector database with semantic information in metadata:** |
| 43 | + |
| 44 | +In this phase, the semantic topics and entities of the original document are captured and stored in the `pebblo_semantic_topics` |
| 45 | +and `pebblo_semantic_entities` fields respectively within the metadata of |
| 46 | +each chunk in the VectorDB entry. |
| 47 | + |
| 48 | +_It's important to note that to use the PebbloRetrievalQA chain, semantic metadata must always be placed in the `pebblo_semantic_topics` |
| 49 | +and `pebblo_semantic_entities` fields._ |
| 50 | + |
| 51 | +```python |
| 52 | +from langchain_community.vectorstores.qdrant import Qdrant |
| 53 | +from langchain_core.documents import Document |
| 54 | +from langchain_openai.embeddings import OpenAIEmbeddings |
| 55 | +from langchain_openai.llms import OpenAI |
| 56 | + |
| 57 | +llm = OpenAI() |
| 58 | +embeddings = OpenAIEmbeddings() |
| 59 | +collection_name = "pebblo-semantic-rag" |
| 60 | + |
| 61 | +page_content = """ |
| 62 | +**ACME Corp Financial Report** |
| 63 | +
|
| 64 | +**Overview:** |
| 65 | +ACME Corp, a leading player in the merger and acquisition industry, presents its financial report for the fiscal year ending December 31, 2020. |
| 66 | +Despite a challenging economic landscape, ACME Corp demonstrated robust performance and strategic growth. |
| 67 | +
|
| 68 | +**Financial Highlights:** |
| 69 | +Revenue soared to $50 million, marking a 15% increase from the previous year, driven by successful deal closures and expansion into new markets. |
| 70 | +Net profit reached $12 million, showcasing a healthy margin of 24%. |
| 71 | +
|
| 72 | +**Key Metrics:** |
| 73 | +Total assets surged to $80 million, reflecting a 20% growth, highlighting ACME Corp's strong financial position and asset base. |
| 74 | +Additionally, the company maintained a conservative debt-to-equity ratio of 0.5, ensuring sustainable financial stability. |
| 75 | +
|
| 76 | +**Future Outlook:** |
| 77 | +ACME Corp remains optimistic about the future, with plans to capitalize on emerging opportunities in the global M&A landscape. |
| 78 | +The company is committed to delivering value to shareholders while maintaining ethical business practices. |
| 79 | +
|
| 80 | +**Bank Account Details:** |
| 81 | +For inquiries or transactions, please refer to ACME Corp's US bank account: |
| 82 | +Account Number: 123456789012 |
| 83 | +Bank Name: Fictitious Bank of America |
| 84 | +""" |
| 85 | + |
| 86 | +documents = [ |
| 87 | + Document( |
| 88 | + **{ |
| 89 | + "page_content": page_content, |
| 90 | + "metadata": { |
| 91 | + "pebblo_semantic_topics": ["financial-report"], |
| 92 | + "pebblo_semantic_entities": ["us-bank-account-number"], |
| 93 | + "page": 0, |
| 94 | + "source": "https://drive.google.com/file/d/xxxxxxxxxxxxx/view", |
| 95 | + "title": "ACME Corp Financial Report.pdf", |
| 96 | + }, |
| 97 | + } |
| 98 | + ) |
| 99 | +] |
| 100 | + |
| 101 | +print("Loading vectordb...") |
| 102 | + |
| 103 | +vectordb = Qdrant.from_documents( |
| 104 | + documents, |
| 105 | + embeddings, |
| 106 | + location=":memory:", |
| 107 | + collection_name=collection_name, |
| 108 | +) |
| 109 | + |
| 110 | +print("Vectordb loaded.") |
| 111 | +``` |
| 112 | + |
| 113 | +## Retrieval with Semantic Enforcement |
| 114 | + |
| 115 | +The PebbloRetrievalQA chain uses SafeRetrieval to ensure that the snippets used in context are retrieved only from documents that comply with the |
| 116 | +provided semantic context. |
| 117 | +To achieve this, the Gen-AI application must provide a semantic context for this retrieval chain. |
| 118 | +This `semantic_context` should include the topics and entities that should be denied for the user accessing the Gen-AI app. |
| 119 | + |
| 120 | +Below is a sample code for PebbloRetrievalQA with `topics_to_deny` and `entities_to_deny`. These are passed in `semantic_context` to the chain input. |
| 121 | + |
| 122 | +```python |
| 123 | +from typing import Optional, List |
| 124 | +from langchain_community.chains import PebbloRetrievalQA |
| 125 | +from langchain_community.chains.pebblo_retrieval.models import ( |
| 126 | + ChainInput, |
| 127 | + SemanticContext, |
| 128 | +) |
| 129 | + |
| 130 | +# Initialize PebbloRetrievalQA chain |
| 131 | +qa_chain = PebbloRetrievalQA.from_chain_type( |
| 132 | + llm=llm, |
| 133 | + app_name="pebblo-semantic-retriever-rag", |
| 134 | + owner="Joe Smith", |
| 135 | + description="Semantic filtering using PebbloSafeLoader, and PebbloRetrievalQA", |
| 136 | + chain_type="stuff", |
| 137 | + retriever=vectordb.as_retriever(), |
| 138 | + verbose=True, |
| 139 | +) |
| 140 | + |
| 141 | + |
| 142 | +def ask( |
| 143 | + question: str, |
| 144 | + topics_to_deny: Optional[List[str]] = None, |
| 145 | + entities_to_deny: Optional[List[str]] = None, |
| 146 | +): |
| 147 | + """ |
| 148 | + Ask a question to the PebbloRetrievalQA chain |
| 149 | + """ |
| 150 | + semantic_context = dict() |
| 151 | + if topics_to_deny: |
| 152 | + semantic_context["pebblo_semantic_topics"] = {"deny": topics_to_deny} |
| 153 | + if entities_to_deny: |
| 154 | + semantic_context["pebblo_semantic_entities"] = {"deny": entities_to_deny} |
| 155 | + |
| 156 | + semantic_context_obj = ( |
| 157 | + SemanticContext(**semantic_context) if semantic_context else None |
| 158 | + ) |
| 159 | + chain_input_obj = ChainInput(query=question, semantic_context=semantic_context_obj) |
| 160 | + return qa_chain.invoke(chain_input_obj.dict()) |
| 161 | +``` |
| 162 | + |
| 163 | +## Ask questions |
| 164 | + |
| 165 | +### Without semantic enforcement |
| 166 | + |
| 167 | +Since no semantic enforcement is applied, the system should return the answer. |
| 168 | + |
| 169 | +```python |
| 170 | +topic_to_deny = [] |
| 171 | +entities_to_deny = [] |
| 172 | +question = "Please share the financial performance of ACME Corp for 2020" |
| 173 | +resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny) |
| 174 | +print( |
| 175 | + f"Topics to deny: {topic_to_deny}\nEntities to deny: {entities_to_deny}\n" |
| 176 | + f"Question: {question}\nAnswer: {resp['result']}\n" |
| 177 | +) |
| 178 | +``` |
| 179 | + |
| 180 | +Output: |
| 181 | + |
| 182 | +```bash |
| 183 | +Topics to deny: [] |
| 184 | +Entities to deny: [] |
| 185 | +Question: Please share the financial performance of ACME Corp for 2020 |
| 186 | +Answer: |
| 187 | +ACME Corp had a strong financial performance in 2020, with a 15% increase in revenue to $50 million and a net profit of $12 million, |
| 188 | +indicating a healthy margin of 24%. The company also saw a 20% growth in total assets, reaching $80 million. |
| 189 | +ACME Corp maintained a conservative debt-to-equity ratio of 0.5, ensuring financial stability. |
| 190 | +The company has plans to capitalize on emerging opportunities in the global M&A landscape and is committed to delivering value |
| 191 | +to shareholders while maintaining ethical business practices. |
| 192 | +``` |
| 193 | +
|
| 194 | +### Deny financial-report topic |
| 195 | +
|
| 196 | +Data has been ingested with the topics: ["financial-report"]. |
| 197 | +Therefore, a app that denies the "financial-report" topic should not receive an answer. |
| 198 | +
|
| 199 | +```python |
| 200 | +topic_to_deny = ["financial-report"] |
| 201 | +entities_to_deny = [] |
| 202 | +question = "Please share the financial performance of ACME Corp for 2020" |
| 203 | +resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny) |
| 204 | +print( |
| 205 | + f"Topics to deny: {topic_to_deny}\nEntities to deny: {entities_to_deny}\n" |
| 206 | + f"Question: {question}\nAnswer: {resp['result']}\n" |
| 207 | +) |
| 208 | +``` |
| 209 | +
|
| 210 | +Output: |
| 211 | +
|
| 212 | +```bash |
| 213 | +Topics to deny: ['financial-report'] |
| 214 | +Entities to deny: [] |
| 215 | +Question: Please share the financial performance of ACME Corp for 2020 |
| 216 | +Answer: Unfortunately, I do not have access to that information. |
| 217 | +``` |
| 218 | +
|
| 219 | +### Deny us-bank-account-number entity |
| 220 | +
|
| 221 | +Since the entity "us-bank-account-number" is denied, the system should not return the answer. |
| 222 | +
|
| 223 | +```python |
| 224 | +topic_to_deny = [] |
| 225 | +entities_to_deny = ["us-bank-account-number"] |
| 226 | +question = "Please share the financial performance of ACME Corp for 2020" |
| 227 | +resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny) |
| 228 | +print( |
| 229 | + f"Topics to deny: {topic_to_deny}\nEntities to deny: {entities_to_deny}\n" |
| 230 | + f"Question: {question}\nAnswer: {resp['result']}\n" |
| 231 | +) |
| 232 | +``` |
| 233 | +
|
| 234 | +Output: |
| 235 | +
|
| 236 | +```bash |
| 237 | +Topics to deny: [] |
| 238 | +Entities to deny: ['us-bank-account-number'] |
| 239 | +Question: Please share the financial performance of ACME Corp for 2020 |
| 240 | +Answer: Unfortunately, I do not have access to that information. |
| 241 | +``` |
0 commit comments