landing-ai · dillonalaird · Aug 27, 2024 · Aug 25, 2024 · Aug 25, 2024 · Aug 25, 2024
diff --git a/README.md b/README.md
@@ -168,20 +168,18 @@ result = agent.chat_with_workflow(conv)
 
 ### Tools
 There are a variety of tools for the model or the user to use. Some are executed locally
-while others are hosted for you. You can also ask an LMM directly to build a tool for
-you. For example:
+while others are hosted for you. You can easily access them yourself, for example if
+you want to run `owl_v2` and visualize the output you can run:
 
 ```python
->>> import vision_agent as va
->>> lmm = va.lmm.OpenAILMM()
->>> detector = lmm.generate_detector("Can you build a jar detector for me?")
->>> detector(va.tools.load_image("jar.jpg"))
-[{"labels": ["jar",],
- "scores": [0.99],
- "bboxes": [
- [0.58, 0.2, 0.72, 0.45],
- ]
-}]
+import vision_agent.tools as T
+import matplotlib.pyplot as plt
+
+image = T.load_image("dogs.jpg")
+dets = T.owl_v2("dogs", image)
+viz = T.overlay_bounding_boxes(image, dets)
+plt.imshow(viz)
+plt.show()
 ```
 
 You can also add custom tools to the agent:
@@ -214,6 +212,41 @@ function. Make sure the documentation is in the same format above with descripti
 `Parameters:`, `Returns:`, and `Example\n-------`. You can find an example use case
 [here](examples/custom_tools/) as this is what the agent uses to pick and use the tool.
 
+## Additional LLMs
+### Ollama
+We also provide a `VisionAgentCoder` that uses Ollama. To get started you must download
+a few models:
+
+```bash
+ollama pull llama3.1
+ollama pull mxbai-embed-large
+```
+
+`llama3.1` is used for the `OllamaLMM` for `OllamaVisionAgentCoder`. Normally we would
+use an actual LMM such as `llava` but `llava` cannot handle the long context lengths
+required by the agent. Since `llama3.1` cannot handle images you may see some
+performance degredation. `mxbai-embed-large` is the embedding model used to look up
+tools. You can use it just like you would use `VisionAgentCoder`:
+
+```python
+>>> import vision_agent as va
+>>> agent = va.agent.OllamaVisionAgentCoder()
+>>> agent("Count the apples in the image", media="apples.jpg")
+```
+> WARNING: VisionAgent doesn't work well unless the underlying LMM is sufficiently powerful. Do not expect good results or even working code with smaller models like Llama 3.1 8B.
+
+### Azure OpenAI
+We also provide a `AzureVisionAgentCoder` that uses Azure OpenAI models. To get started
+follow the Azure Setup section below. You can use it just like you would use=
+`VisionAgentCoder`:
+
+```python
+>>> import vision_agent as va
+>>> agent = va.agent.AzureVisionAgentCoder()
+>>> agent("Count the apples in the image", media="apples.jpg")
+```
+
+
 ### Azure Setup
 If you want to use Azure OpenAI models, you need to have two OpenAI model deployments:
 
@@ -252,6 +285,6 @@ agent = va.agent.AzureVisionAgentCoder()
 2. Follow the instructions to purchase and manage your API credits.
 3. Ensure your API key is correctly configured in your project settings.
 
-Failure to have sufficient API credits may result in limited or no functionality for the features that rely on the OpenAI API.
-
-For more details on managing your API usage and credits, please refer to the OpenAI API documentation.
+Failure to have sufficient API credits may result in limited or no functionality for
+the features that rely on the OpenAI API. For more details on managing your API usage
+and credits, please refer to the OpenAI API documentation.
diff --git a/docs/index.md b/docs/index.md
@@ -1,4 +1,9 @@
 # 🔍🤖 Vision Agent
+[![](https://dcbadge.vercel.app/api/server/wPdN8RCYew?compact=true&style=flat)](https://discord.gg/wPdN8RCYew)
+![ci_status](https://github.com/landing-ai/vision-agent/actions/workflows/ci_cd.yml/badge.svg)
+[![PyPI version](https://badge.fury.io/py/vision-agent.svg)](https://badge.fury.io/py/vision-agent)
+![version](https://img.shields.io/pypi/pyversions/vision-agent)
+</div>
 
 Vision Agent is a library that helps you utilize agent frameworks to generate code to
 solve your vision task. Many current vision problems can easily take hours or days to
@@ -160,20 +165,18 @@ result = agent.chat_with_workflow(conv)
 
 ### Tools
 There are a variety of tools for the model or the user to use. Some are executed locally
-while others are hosted for you. You can also ask an LMM directly to build a tool for
-you. For example:
+while others are hosted for you. You can easily access them yourself, for example if
+you want to run `owl_v2` and visualize the output you can run:
 
 ```python
->>> import vision_agent as va
->>> lmm = va.lmm.OpenAILMM()
->>> detector = lmm.generate_detector("Can you build a jar detector for me?")
->>> detector(va.tools.load_image("jar.jpg"))
-[{"labels": ["jar",],
- "scores": [0.99],
- "bboxes": [
- [0.58, 0.2, 0.72, 0.45],
- ]
-}]
+import vision_agent.tools as T
+import matplotlib.pyplot as plt
+
+image = T.load_image("dogs.jpg")
+dets = T.owl_v2("dogs", image)
+viz = T.overlay_bounding_boxes(image, dets)
+plt.imshow(viz)
+plt.show()
 ```
 
 You can also add custom tools to the agent:
@@ -206,6 +209,40 @@ function. Make sure the documentation is in the same format above with descripti
 `Parameters:`, `Returns:`, and `Example\n-------`. You can find an example use case
 [here](examples/custom_tools/) as this is what the agent uses to pick and use the tool.
 
+## Additional LLMs
+### Ollama
+We also provide a `VisionAgentCoder` that uses Ollama. To get started you must download
+a few models:
+
+```bash
+ollama pull llama3.1
+ollama pull mxbai-embed-large
+```
+
+`llama3.1` is used for the `OllamaLMM` for `OllamaVisionAgentCoder`. Normally we would
+use an actual LMM such as `llava` but `llava` cannot handle the long context lengths
+required by the agent. Since `llama3.1` cannot handle images you may see some
+performance degredation. `mxbai-embed-large` is the embedding model used to look up
+tools. You can use it just like you would use `VisionAgentCoder`:
+
+```python
+>>> import vision_agent as va
+>>> agent = va.agent.OllamaVisionAgentCoder()
+>>> agent("Count the apples in the image", media="apples.jpg")
+```
+
+### Azure OpenAI
+We also provide a `AzureVisionAgentCoder` that uses Azure OpenAI models. To get started
+follow the Azure Setup section below. You can use it just like you would use=
+`VisionAgentCoder`:
+
+```python
+>>> import vision_agent as va
+>>> agent = va.agent.AzureVisionAgentCoder()
+>>> agent("Count the apples in the image", media="apples.jpg")
+```
+> WARNING: VisionAgent doesn't work well unless the underlying LMM is sufficiently powerful. Do not expect good results or even working code with smaller models like Llama 3.1 8B.
+
 ### Azure Setup
 If you want to use Azure OpenAI models, you need to have two OpenAI model deployments:
 
@@ -244,6 +281,6 @@ agent = va.agent.AzureVisionAgentCoder()
 2. Follow the instructions to purchase and manage your API credits.
 3. Ensure your API key is correctly configured in your project settings.
 
-Failure to have sufficient API credits may result in limited or no functionality for the features that rely on the OpenAI API.
-
-For more details on managing your API usage and credits, please refer to the OpenAI API documentation.
+Failure to have sufficient API credits may result in limited or no functionality for
+the features that rely on the OpenAI API. For more details on managing your API usage
+and credits, please refer to the OpenAI API documentation.
diff --git a/docs/lmms.md b/docs/lmms.md
diff --git a/vision_agent/agent/__init__.py b/vision_agent/agent/__init__.py
@@ -1,3 +1,7 @@
 from .agent import Agent
 from .vision_agent import VisionAgent
-from .vision_agent_coder import AzureVisionAgentCoder, VisionAgentCoder
+from .vision_agent_coder import (
+ AzureVisionAgentCoder,
+ OllamaVisionAgentCoder,
+ VisionAgentCoder,
+)
diff --git a/vision_agent/agent/agent_utils.py b/vision_agent/agent/agent_utils.py
@@ -1,9 +1,24 @@
 import json
 import logging
+import re
 import sys
-from typing import Any, Dict
+from typing import Any, Dict, Optional
 
 logging.basicConfig(stream=sys.stdout)
+_LOGGER = logging.getLogger(__name__)
+
+
+def _extract_sub_json(json_str: str) -> Optional[Dict[str, Any]]:
+ json_pattern = r"\{.*\}"
+ match = re.search(json_pattern, json_str, re.DOTALL)
+ if match:
+ json_str = match.group()
+ try:
+ json_dict = json.loads(json_str)
+ return json_dict # type: ignore
+ except json.JSONDecodeError:
+ return None
+ return None
 
 
 def extract_json(json_str: str) -> Dict[str, Any]:
@@ -18,8 +33,16 @@ def extract_json(json_str: str) -> Dict[str, Any]:
  json_str = json_str[json_str.find("```") + len("```") :]
  # get the last ``` not one from an intermediate string
  json_str = json_str[: json_str.find("}```")]
+ try:
+ json_dict = json.loads(json_str)
+ except json.JSONDecodeError as e:
+ json_dict = _extract_sub_json(json_str)
+ if json_dict is not None:
+ return json_dict # type: ignore
+ error_msg = f"Could not extract JSON from the given str: {json_str}"
+ _LOGGER.exception(error_msg)
+ raise ValueError(error_msg) from e
 
- json_dict = json.loads(json_str)
  return json_dict # type: ignore
 
 

diff --git a/vision_agent/agent/vision_agent_coder.py b/vision_agent/agent/vision_agent_coder.py
@@ -28,11 +28,11 @@
  TEST_PLANS,
  USER_REQ,
 )
-from vision_agent.lmm import LMM, AzureOpenAILMM, Message, OpenAILMM
+from vision_agent.lmm import LMM, AzureOpenAILMM, Message, OllamaLMM, OpenAILMM
 from vision_agent.utils import CodeInterpreterFactory, Execution
 from vision_agent.utils.execute import CodeInterpreter
 from vision_agent.utils.image_utils import b64_to_pil
-from vision_agent.utils.sim import AzureSim, Sim
+from vision_agent.utils.sim import AzureSim, OllamaSim, Sim
 from vision_agent.utils.video import play_video
 
 logging.basicConfig(stream=sys.stdout)
@@ -263,7 +263,11 @@ def pick_plan(
  pass
  count += 1
 
- if best_plan is None:
+ if (
+ best_plan is None
+ or "best_plan" not in best_plan
+ or ("best_plan" in best_plan and best_plan["best_plan"] not in plans)
+ ):
  best_plan = {"best_plan": list(plans.keys())[0]}
 
  if verbosity >= 1:
@@ -585,8 +589,8 @@ class VisionAgentCoder(Agent):
 
  Example
  -------
- >>> from vision_agent.agent import VisionAgentCoder
- >>> agent = VisionAgentCoder()
+ >>> import vision_agent as va
+ >>> agent = va.agent.VisionAgentCoder()
  >>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg")
  """
 
@@ -820,6 +824,7 @@ def chat_with_workflow(
  verbosity=self.verbosity,
  media=media_list,
  )
+ __import__("ipdb").set_trace()
  success = cast(bool, results["success"])
  code = cast(str, results["code"])
  test = cast(str, results["test"])
@@ -849,6 +854,64 @@ def log_progress(self, data: Dict[str, Any]) -> None:
  self.report_progress_callback(data)
 
 
+class OllamaVisionAgentCoder(VisionAgentCoder):
+ """VisionAgentCoder that uses Ollama models for planning, coding, testing.
+
+ Pre-requisites:
+ 1. Run ollama pull llama3.1 for the LLM
+ 2. Run ollama pull mxbai-embed-large for the embedding similarity model
+
+ Technically you should use a VLM such as llava but llava is not able to handle the
+ context length and crashes.
+
+ Example
+ -------
+ >>> image vision_agent as va
+ >>> agent = va.agent.OllamaVisionAgentCoder()
+ >>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg")
+ """
+
+ def __init__(
+ self,
+ planner: Optional[LMM] = None,
+ coder: Optional[LMM] = None,
+ tester: Optional[LMM] = None,
+ debugger: Optional[LMM] = None,
+ tool_recommender: Optional[Sim] = None,
+ verbosity: int = 0,
+ report_progress_callback: Optional[Callable[[Dict[str, Any]], None]] = None,
+ ) -> None:
+ super().__init__(
+ planner=(
+ OllamaLMM(model_name="llama3.1", temperature=0.0, json_mode=True)
+ if planner is None
+ else planner
+ ),
+ coder=(
+ OllamaLMM(model_name="llama3.1", temperature=0.0)
+ if coder is None
+ else coder
+ ),
+ tester=(
+ OllamaLMM(model_name="llama3.1", temperature=0.0)
+ if tester is None
+ else tester
+ ),
+ debugger=(
+ OllamaLMM(model_name="llama3.1", temperature=0.0, json_mode=True)
+ if debugger is None
+ else debugger
+ ),
+ tool_recommender=(
+ OllamaSim(T.TOOLS_DF, sim_key="desc")
+ if tool_recommender is None
+ else tool_recommender
+ ),
+ verbosity=verbosity,
+ report_progress_callback=report_progress_callback,
+ )
+
+
 class AzureVisionAgentCoder(VisionAgentCoder):
  """VisionAgentCoder that uses Azure OpenAI APIs for planning, coding, testing.
 
@@ -858,8 +921,8 @@ class AzureVisionAgentCoder(VisionAgentCoder):
 
  Example
  -------
- >>> from vision_agent import AzureVisionAgentCoder
- >>> agent = AzureVisionAgentCoder()
+ >>> import vision_agent as va
+ >>> agent = va.agent.AzureVisionAgentCoder()
  >>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg")
  """