Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Ollama support #208

Merged
merged 20 commits into from
Aug 27, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 48 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,20 +168,18 @@ result = agent.chat_with_workflow(conv)

### Tools
There are a variety of tools for the model or the user to use. Some are executed locally
while others are hosted for you. You can also ask an LMM directly to build a tool for
you. For example:
while others are hosted for you. You can easily access them yourself, for example if
you want to run `owl_v2` and visualize the output you can run:

```python
>>> import vision_agent as va
>>> lmm = va.lmm.OpenAILMM()
>>> detector = lmm.generate_detector("Can you build a jar detector for me?")
>>> detector(va.tools.load_image("jar.jpg"))
[{"labels": ["jar",],
"scores": [0.99],
"bboxes": [
[0.58, 0.2, 0.72, 0.45],
]
}]
import vision_agent.tools as T
import matplotlib.pyplot as plt

image = T.load_image("dogs.jpg")
dets = T.owl_v2("dogs", image)
viz = T.overlay_bounding_boxes(image, dets)
plt.imshow(viz)
plt.show()
```

You can also add custom tools to the agent:
Expand Down Expand Up @@ -214,6 +212,41 @@ function. Make sure the documentation is in the same format above with descripti
`Parameters:`, `Returns:`, and `Example\n-------`. You can find an example use case
[here](examples/custom_tools/) as this is what the agent uses to pick and use the tool.

## Additional LLMs
### Ollama
We also provide a `VisionAgentCoder` that uses Ollama. To get started you must download
a few models:

```bash
ollama pull llama3.1
ollama pull mxbai-embed-large
```

`llama3.1` is used for the `OllamaLMM` for `OllamaVisionAgentCoder`. Normally we would
use an actual LMM such as `llava` but `llava` cannot handle the long context lengths
required by the agent. Since `llama3.1` cannot handle images you may see some
performance degredation. `mxbai-embed-large` is the embedding model used to look up
tools. You can use it just like you would use `VisionAgentCoder`:

```python
>>> import vision_agent as va
>>> agent = va.agent.OllamaVisionAgentCoder()
>>> agent("Count the apples in the image", media="apples.jpg")
```
> WARNING: VisionAgent doesn't work well unless the underlying LMM is sufficiently powerful. Do not expect good results or even working code with smaller models like Llama 3.1 8B.

### Azure OpenAI
We also provide a `AzureVisionAgentCoder` that uses Azure OpenAI models. To get started
follow the Azure Setup section below. You can use it just like you would use=
`VisionAgentCoder`:

```python
>>> import vision_agent as va
>>> agent = va.agent.AzureVisionAgentCoder()
>>> agent("Count the apples in the image", media="apples.jpg")
```


### Azure Setup
If you want to use Azure OpenAI models, you need to have two OpenAI model deployments:

Expand Down Expand Up @@ -252,6 +285,6 @@ agent = va.agent.AzureVisionAgentCoder()
2. Follow the instructions to purchase and manage your API credits.
3. Ensure your API key is correctly configured in your project settings.

Failure to have sufficient API credits may result in limited or no functionality for the features that rely on the OpenAI API.

For more details on managing your API usage and credits, please refer to the OpenAI API documentation.
Failure to have sufficient API credits may result in limited or no functionality for
the features that rely on the OpenAI API. For more details on managing your API usage
and credits, please refer to the OpenAI API documentation.
67 changes: 52 additions & 15 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# 🔍🤖 Vision Agent
[![](https://dcbadge.vercel.app/api/server/wPdN8RCYew?compact=true&style=flat)](https://discord.gg/wPdN8RCYew)
![ci_status](https://github.com/landing-ai/vision-agent/actions/workflows/ci_cd.yml/badge.svg)
[![PyPI version](https://badge.fury.io/py/vision-agent.svg)](https://badge.fury.io/py/vision-agent)
![version](https://img.shields.io/pypi/pyversions/vision-agent)
</div>

Vision Agent is a library that helps you utilize agent frameworks to generate code to
solve your vision task. Many current vision problems can easily take hours or days to
Expand Down Expand Up @@ -160,20 +165,18 @@ result = agent.chat_with_workflow(conv)

### Tools
There are a variety of tools for the model or the user to use. Some are executed locally
while others are hosted for you. You can also ask an LMM directly to build a tool for
you. For example:
while others are hosted for you. You can easily access them yourself, for example if
you want to run `owl_v2` and visualize the output you can run:

```python
>>> import vision_agent as va
>>> lmm = va.lmm.OpenAILMM()
>>> detector = lmm.generate_detector("Can you build a jar detector for me?")
>>> detector(va.tools.load_image("jar.jpg"))
[{"labels": ["jar",],
"scores": [0.99],
"bboxes": [
[0.58, 0.2, 0.72, 0.45],
]
}]
import vision_agent.tools as T
import matplotlib.pyplot as plt

image = T.load_image("dogs.jpg")
dets = T.owl_v2("dogs", image)
viz = T.overlay_bounding_boxes(image, dets)
plt.imshow(viz)
plt.show()
```

You can also add custom tools to the agent:
Expand Down Expand Up @@ -206,6 +209,40 @@ function. Make sure the documentation is in the same format above with descripti
`Parameters:`, `Returns:`, and `Example\n-------`. You can find an example use case
[here](examples/custom_tools/) as this is what the agent uses to pick and use the tool.

## Additional LLMs
### Ollama
We also provide a `VisionAgentCoder` that uses Ollama. To get started you must download
a few models:

```bash
ollama pull llama3.1
ollama pull mxbai-embed-large
```

`llama3.1` is used for the `OllamaLMM` for `OllamaVisionAgentCoder`. Normally we would
use an actual LMM such as `llava` but `llava` cannot handle the long context lengths
required by the agent. Since `llama3.1` cannot handle images you may see some
performance degredation. `mxbai-embed-large` is the embedding model used to look up
tools. You can use it just like you would use `VisionAgentCoder`:

```python
>>> import vision_agent as va
>>> agent = va.agent.OllamaVisionAgentCoder()
>>> agent("Count the apples in the image", media="apples.jpg")
```

### Azure OpenAI
We also provide a `AzureVisionAgentCoder` that uses Azure OpenAI models. To get started
follow the Azure Setup section below. You can use it just like you would use=
`VisionAgentCoder`:

```python
>>> import vision_agent as va
>>> agent = va.agent.AzureVisionAgentCoder()
>>> agent("Count the apples in the image", media="apples.jpg")
```
> WARNING: VisionAgent doesn't work well unless the underlying LMM is sufficiently powerful. Do not expect good results or even working code with smaller models like Llama 3.1 8B.

### Azure Setup
If you want to use Azure OpenAI models, you need to have two OpenAI model deployments:

Expand Down Expand Up @@ -244,6 +281,6 @@ agent = va.agent.AzureVisionAgentCoder()
2. Follow the instructions to purchase and manage your API credits.
3. Ensure your API key is correctly configured in your project settings.

Failure to have sufficient API credits may result in limited or no functionality for the features that rely on the OpenAI API.

For more details on managing your API usage and credits, please refer to the OpenAI API documentation.
Failure to have sufficient API credits may result in limited or no functionality for
the features that rely on the OpenAI API. For more details on managing your API usage
and credits, please refer to the OpenAI API documentation.
20 changes: 0 additions & 20 deletions docs/lmms.md

This file was deleted.

6 changes: 5 additions & 1 deletion vision_agent/agent/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
from .agent import Agent
from .vision_agent import VisionAgent
from .vision_agent_coder import AzureVisionAgentCoder, VisionAgentCoder
from .vision_agent_coder import (
AzureVisionAgentCoder,
OllamaVisionAgentCoder,
VisionAgentCoder,
)
27 changes: 25 additions & 2 deletions vision_agent/agent/agent_utils.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,24 @@
import json
import logging
import re
import sys
from typing import Any, Dict
from typing import Any, Dict, Optional

logging.basicConfig(stream=sys.stdout)
_LOGGER = logging.getLogger(__name__)


def _extract_sub_json(json_str: str) -> Optional[Dict[str, Any]]:
json_pattern = r"\{.*\}"
match = re.search(json_pattern, json_str, re.DOTALL)
if match:
json_str = match.group()
try:
json_dict = json.loads(json_str)
return json_dict # type: ignore
except json.JSONDecodeError:
return None
return None


def extract_json(json_str: str) -> Dict[str, Any]:
Expand All @@ -18,8 +33,16 @@ def extract_json(json_str: str) -> Dict[str, Any]:
json_str = json_str[json_str.find("```") + len("```") :]
# get the last ``` not one from an intermediate string
json_str = json_str[: json_str.find("}```")]
try:
json_dict = json.loads(json_str)
except json.JSONDecodeError as e:
json_dict = _extract_sub_json(json_str)
if json_dict is not None:
return json_dict # type: ignore
error_msg = f"Could not extract JSON from the given str: {json_str}"
_LOGGER.exception(error_msg)
raise ValueError(error_msg) from e

json_dict = json.loads(json_str)
return json_dict # type: ignore


Expand Down
77 changes: 70 additions & 7 deletions vision_agent/agent/vision_agent_coder.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,11 @@
TEST_PLANS,
USER_REQ,
)
from vision_agent.lmm import LMM, AzureOpenAILMM, Message, OpenAILMM
from vision_agent.lmm import LMM, AzureOpenAILMM, Message, OllamaLMM, OpenAILMM
from vision_agent.utils import CodeInterpreterFactory, Execution
from vision_agent.utils.execute import CodeInterpreter
from vision_agent.utils.image_utils import b64_to_pil
from vision_agent.utils.sim import AzureSim, Sim
from vision_agent.utils.sim import AzureSim, OllamaSim, Sim
from vision_agent.utils.video import play_video

logging.basicConfig(stream=sys.stdout)
Expand Down Expand Up @@ -263,7 +263,11 @@ def pick_plan(
pass
count += 1

if best_plan is None:
if (
best_plan is None
or "best_plan" not in best_plan
or ("best_plan" in best_plan and best_plan["best_plan"] not in plans)
):
best_plan = {"best_plan": list(plans.keys())[0]}

if verbosity >= 1:
Expand Down Expand Up @@ -585,8 +589,8 @@ class VisionAgentCoder(Agent):

Example
-------
>>> from vision_agent.agent import VisionAgentCoder
>>> agent = VisionAgentCoder()
>>> import vision_agent as va
>>> agent = va.agent.VisionAgentCoder()
>>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg")
"""

Expand Down Expand Up @@ -820,6 +824,7 @@ def chat_with_workflow(
verbosity=self.verbosity,
media=media_list,
)
__import__("ipdb").set_trace()
success = cast(bool, results["success"])
code = cast(str, results["code"])
test = cast(str, results["test"])
Expand Down Expand Up @@ -849,6 +854,64 @@ def log_progress(self, data: Dict[str, Any]) -> None:
self.report_progress_callback(data)


class OllamaVisionAgentCoder(VisionAgentCoder):
"""VisionAgentCoder that uses Ollama models for planning, coding, testing.

Pre-requisites:
1. Run ollama pull llama3.1 for the LLM
2. Run ollama pull mxbai-embed-large for the embedding similarity model

Technically you should use a VLM such as llava but llava is not able to handle the
context length and crashes.

Example
-------
>>> image vision_agent as va
>>> agent = va.agent.OllamaVisionAgentCoder()
>>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg")
"""

def __init__(
self,
planner: Optional[LMM] = None,
dillonalaird marked this conversation as resolved.
Show resolved Hide resolved
coder: Optional[LMM] = None,
tester: Optional[LMM] = None,
debugger: Optional[LMM] = None,
tool_recommender: Optional[Sim] = None,
verbosity: int = 0,
report_progress_callback: Optional[Callable[[Dict[str, Any]], None]] = None,
) -> None:
super().__init__(
planner=(
OllamaLMM(model_name="llama3.1", temperature=0.0, json_mode=True)
if planner is None
else planner
),
coder=(
OllamaLMM(model_name="llama3.1", temperature=0.0)
if coder is None
else coder
),
tester=(
OllamaLMM(model_name="llama3.1", temperature=0.0)
if tester is None
else tester
),
debugger=(
OllamaLMM(model_name="llama3.1", temperature=0.0, json_mode=True)
if debugger is None
else debugger
),
tool_recommender=(
OllamaSim(T.TOOLS_DF, sim_key="desc")
if tool_recommender is None
else tool_recommender
),
verbosity=verbosity,
report_progress_callback=report_progress_callback,
)


class AzureVisionAgentCoder(VisionAgentCoder):
"""VisionAgentCoder that uses Azure OpenAI APIs for planning, coding, testing.

Expand All @@ -858,8 +921,8 @@ class AzureVisionAgentCoder(VisionAgentCoder):

Example
-------
>>> from vision_agent import AzureVisionAgentCoder
>>> agent = AzureVisionAgentCoder()
>>> import vision_agent as va
>>> agent = va.agent.AzureVisionAgentCoder()
>>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg")
"""

Expand Down
Loading
Loading