Update Docs (#98)

* renamed v2 to data interpreter * renamed vision agent to easytool v2 * renamed vision agent v3 to vision agent * renamed image to media * moved tools_v2 to tools * return type of __call__ is str * updated docs * remove unused import * black * added documentation for chat with workflow * formatting fix
landing-ai · May 29, 2024 · 079de09 · 079de09
1 parent c7f0e7b
commit 079de09
Show file tree

Hide file tree

Showing 25 changed files with 3,615 additions and 3,426 deletions.
diff --git a/README.md b/README.md
@@ -9,13 +9,12 @@
 ![version](https://img.shields.io/pypi/pyversions/vision-agent)
 </div>
 
-Vision Agent is a library that helps you utilize agent frameworks for your vision tasks.
-Many current vision problems can easily take hours or days to solve, you need to find the
-right model, figure out how to use it, possibly write programming logic around it to
-accomplish the task you want or even more expensive, train your own model. Vision Agent
-aims to provide an in-seconds experience by allowing users to describe their problem in
-text and utilizing agent frameworks to solve the task for them. Check out our discord
-for updates and roadmaps!
+Vision Agent is a library that helps you utilize agent frameworks to generate code to
+solve your vision task. Many current vision problems can easily take hours or days to
+solve, you need to find the right model, figure out how to use it and program it to
+accomplish the task you want. Vision Agent aims to provide an in-seconds experience by
+allowing users to describe their problem in text and have the agent framework generate
+code to solve the task for them. Check out our discord for updates and roadmaps!
 
 ## Documentation
 
@@ -37,70 +36,71 @@ using Azure OpenAI please see the Azure setup section):
 export OPENAI_API_KEY="your-api-key"
 ```
 
-### Vision Agents
-You can interact with the agents as you would with any LLM or LMM model:
+### Vision Agent
+You can interact with the agent as you would with any LLM or LMM model:
 
 ```python
 >>> from vision_agent.agent import VisionAgent
 >>> agent = VisionAgent()
->>> agent("What percentage of the area of this jar is filled with coffee beans?", image="jar.jpg")
-"The percentage of area of the jar filled with coffee beans is 25%."
+>>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg")
 ```
 
-To better understand how the model came up with it's answer, you can also run it in
-debug mode by passing in the verbose argument:
-
+Which produces the following code:
 ```python
->>> agent = VisionAgent(verbose=True)
+from vision_agent.tools import load_image, grounding_sam
+
+def calculate_filled_percentage(image_path: str) -> float:
+ # Step 1: Load the image
+ image = load_image(image_path)
+
+ # Step 2: Segment the jar
+ jar_segments = grounding_sam(prompt="jar", image=image)
+
+ # Step 3: Segment the coffee beans
+ coffee_beans_segments = grounding_sam(prompt="coffee beans", image=image)
+
+ # Step 4: Calculate the area of the segmented jar
+ jar_area = 0
+ for segment in jar_segments:
+ jar_area += segment['mask'].sum()
+
+ # Step 5: Calculate the area of the segmented coffee beans
+ coffee_beans_area = 0
+ for segment in coffee_beans_segments:
+ coffee_beans_area += segment['mask'].sum()
+
+ # Step 6: Compute the percentage of the jar area that is filled with coffee beans
+ if jar_area == 0:
+ return 0.0 # To avoid division by zero
+ filled_percentage = (coffee_beans_area / jar_area) * 100
+
+ # Step 7: Return the computed percentage
+ return filled_percentage
 ```
 
-You can also have it return the workflow it used to complete the task along with all
-the individual steps and tools to get the answer:
+To better understand how the model came up with it's answer, you can run it in debug
+mode by passing in the verbose argument:
 
 ```python
->>> resp, workflow = agent.chat_with_workflow([{"role": "user", "content": "What percentage of the area of this jar is filled with coffee beans?"}], image="jar.jpg")
->>> print(workflow)
-[{"task": "Segment the jar using 'grounding_sam_'.",
- "tool": "grounding_sam_",
- "parameters": {"prompt": "jar", "image": "jar.jpg"},
- "call_results": [[
- {
- "labels": ["jar"],
- "scores": [0.99],
- "bboxes": [
- [0.58, 0.2, 0.72, 0.45],
- ],
- "masks": "mask.png"
- }
- ]],
- "answer": "The jar is located at [0.58, 0.2, 0.72, 0.45].",
-},
-{"visualize_output": "final_output.png"}]
+>>> agent = VisionAgent(verbose=2)
 ```
 
-You can also provide reference data for the model to utilize. For example, if you want
-to utilize VisualPromptCounting:
+You can also have it return more information by calling `chat_with_workflow`:
 
 ```python
-agent(
- "How many apples are in this image?",
- image="apples.jpg",
- reference_data={"bbox": [0.1, 0.11, 0.24, 0.25]},
-)
+>>> results = agent.chat_with_workflow([{"role": "user", "content": "What percentage of the area of the jar is filled with coffee beans?"}], media="jar.jpg")
+>>> print(results)
+{
+ "code": "from vision_agent.tools import ..."
+ "test": "calculate_filled_percentage('jar.jpg')",
+ "test_result": "...",
+ "plan": [{"code": "...", "test": "...", "plan": "..."}, ...],
+ "working_memory": ...,
+}
 ```
-Where `[0.1, 0.11, 0.24, 0.25]` is the normalized bounding box coordinates of an apple.
-Similarly for DINOv you can provide a reference image and mask:
 
-```python
-agent(
- "Can you detect all of the objects similar to the mask I've provided?",
- image="image.jpg",
- reference_data={"mask": "reference_mask.png", "image": "reference_image.png"},
-)
-```
-Here, `reference_mask.png` and `reference_image.png` in `reference_data` could be any
-image with it's corresponding mask that is the object you want to detect in `image.jpg`.
-You can find a demo app to generate masks for DINOv [here](examples/mask_app/).
+With this you can examine more detailed information such as the etesting code, testing
+results, plan or working memory it used to complete the task.
 
 ### Tools
 There are a variety of tools for the model or the user to use. Some are executed locally
@@ -120,57 +120,6 @@ you. For example:
 }]
 ```
 
-#### Custom Tools
-You can also add your own custom tools for your vision agent to use:
-
-```python
-from vision_agent.tools import Tool, register_tool
-@register_tool
-class NumItems(Tool):
- name = "num_items_"
- description = "Returns the number of items in a list."
- usage = {
- "required_parameters": [{"name": "prompt", "type": "list"}],
- "examples": [
- {
- "scenario": "How many items are in this list? ['a', 'b', 'c']",
- "parameters": {"prompt": "['a', 'b', 'c']"},
- }
- ],
- }
- def __call__(self, prompt: list[str]) -> int:
- return len(prompt)
-```
-This will register it with the list of tools Vision Agent has access to. It will be able
-to pick it based on the tool description and use it based on the usage provided. You can
-find an example that creates a custom tool for template matching [here](examples/custom_tools/).
-
-#### Tool List
-| Tool | Description |
-| --- | --- |
-| CLIP | CLIP is a tool that can classify or tag any image given a set of input classes or tags. |
-| ImageCaption| ImageCaption is a tool that can generate a caption for an image. |
-| GroundingDINO | GroundingDINO is a tool that can detect arbitrary objects with inputs such as category names or referring expressions. |
-| GroundingSAM | GroundingSAM is a tool that can detect and segment arbitrary objects with inputs such as category names or referring expressions. |
-| DINOv | DINOv is a tool that can detect arbitrary objects with using a referring mask. |
-| Crop | Crop crops an image given a bounding box and returns a file name of the cropped image. |
-| BboxArea | BboxArea returns the area of the bounding box in pixels normalized to 2 decimal places. |
-| SegArea | SegArea returns the area of the segmentation mask in pixels normalized to 2 decimal places. |
-| BboxIoU | BboxIoU returns the intersection over union of two bounding boxes normalized to 2 decimal places. |
-| SegIoU | SegIoU returns the intersection over union of two segmentation masks normalized to 2 decimal places. |
-| BoxDistance | BoxDistance returns the minimum distance between two bounding boxes normalized to 2 decimal places. |
-| MaskDistance | MaskDistance returns the minimum distance between two segmentation masks in pixel units |
-| BboxContains | BboxContains returns the intersection of two boxes over the target box area. It is good for check if one box is contained within another box. |
-| ExtractFrames | ExtractFrames extracts frames with motion from a video. |
-| ZeroShotCounting | ZeroShotCounting returns the total number of objects belonging to a single class in a given image. |
-| VisualPromptCounting | VisualPromptCounting returns the total number of objects belonging to a single class given an image and visual prompt. |
-| VisualQuestionAnswering | VisualQuestionAnswering is a tool that can explain the contents of an image and answer questions about the image. |
-| ImageQuestionAnswering | ImageQuestionAnswering is similar to VisualQuestionAnswering but does not rely on OpenAI and instead uses a dedicated model for the task. |
-| OCR | OCR returns the text detected in an image along with the location. |
-
-
-It also has a basic set of calculate tools such as add, subtract, multiply and divide.
-
 ### Azure Setup
 If you want to use Azure OpenAI models, you can set the environment variable:
 

diff --git a/docs/easy_tool_v2.md b/docs/easy_tool_v2.md
@@ -0,0 +1,183 @@
+# 🔍🤖 Easy Tool V2
+
+Easy Tool V2 is a library that helps you utilize agent frameworks for your vision tasks.
+Many current vision problems can easily take hours or days to solve, you need to find the
+right model, figure out how to use it, possibly write programming logic around it to
+accomplish the task you want or even more expensive, train your own model. Easy Tool V2
+aims to provide an in-seconds experience by allowing users to describe their problem in
+text and utilizing agent frameworks to solve the task for them. Check out our discord
+for updates and roadmaps!
+
+## Documentation
+
+- [Easy Tool V2 Library Docs](https://landing-ai.github.io/vision-agent/)
+
+
+## Getting Started
+### Installation
+To get started, you can install the library using pip:
+
+```bash
+pip install vision-agent
+```
+
+Ensure you have an OpenAI API key and set it as an environment variable (if you are
+using Azure OpenAI please see the Azure setup section):
+
+```bash
+export OPENAI_API_KEY="your-api-key"
+```
+
+### Easy Tool V2
+You can interact with the agents as you would with any LLM or LMM model:
+
+```python
+>>> from vision_agent.agent import EasyToolV2
+>>> agent = EasyToolV2()
+>>> agent("What percentage of the area of this jar is filled with coffee beans?", image="jar.jpg")
+"The percentage of area of the jar filled with coffee beans is 25%."
+```
+
+To better understand how the model came up with it's answer, you can also run it in
+debug mode by passing in the verbose argument:
+
+```python
+>>> agent = EasyToolV2(verbose=True)
+```
+
+You can also have it return the workflow it used to complete the task along with all
+the individual steps and tools to get the answer:
+
+```python
+>>> resp, workflow = agent.chat_with_workflow([{"role": "user", "content": "What percentage of the area of this jar is filled with coffee beans?"}], image="jar.jpg")
+>>> print(workflow)
+[{"task": "Segment the jar using 'grounding_sam_'.",
+ "tool": "grounding_sam_",
+ "parameters": {"prompt": "jar", "image": "jar.jpg"},
+ "call_results": [[
+ {
+ "labels": ["jar"],
+ "scores": [0.99],
+ "bboxes": [
+ [0.58, 0.2, 0.72, 0.45],
+ ],
+ "masks": "mask.png"
+ }
+ ]],
+ "answer": "The jar is located at [0.58, 0.2, 0.72, 0.45].",
+},
+{"visualize_output": "final_output.png"}]
+```
+
+You can also provide reference data for the model to utilize. For example, if you want
+to utilize VisualPromptCounting:
+
+```python
+agent(
+ "How many apples are in this image?",
+ image="apples.jpg",
+ reference_data={"bbox": [0.1, 0.11, 0.24, 0.25]},
+)
+```
+Where `[0.1, 0.11, 0.24, 0.25]` is the normalized bounding box coordinates of an apple.
+Similarly for DINOv you can provide a reference image and mask:
+
+```python
+agent(
+ "Can you detect all of the objects similar to the mask I've provided?",
+ image="image.jpg",
+ reference_data={"mask": "reference_mask.png", "image": "reference_image.png"},
+)
+```
+Here, `reference_mask.png` and `reference_image.png` in `reference_data` could be any
+image with it's corresponding mask that is the object you want to detect in `image.jpg`.
+You can find a demo app to generate masks for DINOv [here](examples/mask_app/).
+
+### Tools
+There are a variety of tools for the model or the user to use. Some are executed locally
+while others are hosted for you. You can also ask an LLM directly to build a tool for
+you. For example:
+
+```python
+>>> import vision_agent as va
+>>> llm = va.llm.OpenAILLM()
+>>> detector = llm.generate_detector("Can you build a jar detector for me?")
+>>> detector("jar.jpg")
+[{"labels": ["jar",],
+ "scores": [0.99],
+ "bboxes": [
+ [0.58, 0.2, 0.72, 0.45],
+ ]
+}]
+```
+
+#### Custom Tools
+You can also add your own custom tools for your vision agent to use:
+
+```python
+from vision_agent.tools import Tool, register_tool
+@register_tool
+class NumItems(Tool):
+ name = "num_items_"
+ description = "Returns the number of items in a list."
+ usage = {
+ "required_parameters": [{"name": "prompt", "type": "list"}],
+ "examples": [
+ {
+ "scenario": "How many items are in this list? ['a', 'b', 'c']",
+ "parameters": {"prompt": "['a', 'b', 'c']"},
+ }
+ ],
+ }
+ def __call__(self, prompt: list[str]) -> int:
+ return len(prompt)
+```
+This will register it with the list of tools Easy Tool V2 has access to. It will be able
+to pick it based on the tool description and use it based on the usage provided. You can
+find an example that creates a custom tool for template matching [here](examples/custom_tools/).
+
+#### Tool List
+| Tool | Description |
+| --- | --- |
+| CLIP | CLIP is a tool that can classify or tag any image given a set of input classes or tags. |
+| ImageCaption| ImageCaption is a tool that can generate a caption for an image. |
+| GroundingDINO | GroundingDINO is a tool that can detect arbitrary objects with inputs such as category names or referring expressions. |
+| GroundingSAM | GroundingSAM is a tool that can detect and segment arbitrary objects with inputs such as category names or referring expressions. |
+| DINOv | DINOv is a tool that can detect arbitrary objects with using a referring mask. |
+| Crop | Crop crops an image given a bounding box and returns a file name of the cropped image. |
+| BboxArea | BboxArea returns the area of the bounding box in pixels normalized to 2 decimal places. |
+| SegArea | SegArea returns the area of the segmentation mask in pixels normalized to 2 decimal places. |
+| BboxIoU | BboxIoU returns the intersection over union of two bounding boxes normalized to 2 decimal places. |
+| SegIoU | SegIoU returns the intersection over union of two segmentation masks normalized to 2 decimal places. |
+| BoxDistance | BoxDistance returns the minimum distance between two bounding boxes normalized to 2 decimal places. |
+| MaskDistance | MaskDistance returns the minimum distance between two segmentation masks in pixel units |
+| BboxContains | BboxContains returns the intersection of two boxes over the target box area. It is good for check if one box is contained within another box. |
+| ExtractFrames | ExtractFrames extracts frames with motion from a video. |
+| ZeroShotCounting | ZeroShotCounting returns the total number of objects belonging to a single class in a given image. |
+| VisualPromptCounting | VisualPromptCounting returns the total number of objects belonging to a single class given an image and visual prompt. |
+| VisualQuestionAnswering | VisualQuestionAnswering is a tool that can explain the contents of an image and answer questions about the image. |
+| ImageQuestionAnswering | ImageQuestionAnswering is similar to VisualQuestionAnswering but does not rely on OpenAI and instead uses a dedicated model for the task. |
+| OCR | OCR returns the text detected in an image along with the location. |
+
+
+It also has a basic set of calculate tools such as add, subtract, multiply and divide.
+
+### Azure Setup
+If you want to use Azure OpenAI models, you can set the environment variable:
+
+```bash
+export AZURE_OPENAI_API_KEY="your-api-key"
+export AZURE_OPENAI_ENDPOINT="your-endpoint"
+```
+
+You can then run Easy Tool V2 using the Azure OpenAI models:
+
+```python
+>>> import vision_agent as va
+>>> agent = va.agent.EasyToolV2(
+>>> task_model=va.llm.AzureOpenAILLM(),
+>>> answer_model=va.lmm.AzureOpenAILMM(),
+>>> reflection_model=va.lmm.AzureOpenAILMM(),
+>>> )
+```
+
diff --git a/tests/fixtures.py b/tests/fixtures.py
@@ -2,7 +2,7 @@
 
 import pytest
 
-from vision_agent.tools import CLIP, GroundingDINO, GroundingSAM
+from vision_agent.tools.easytool_tools import CLIP, GroundingDINO, GroundingSAM
 
 
 @pytest.fixture

diff --git a/tests/test_tools.py b/tests/test_tools.py
@@ -1,6 +1,6 @@
 import skimage as ski
 
-from vision_agent.tools.tools_v2 import (
+from vision_agent.tools import (
  clip,
  grounding_dino,
  grounding_sam,