Skip to content

Commit

Permalink
Update Conversation v2 (#342)
Browse files Browse the repository at this point in the history
* fix notebook example

* remove old tools

* fix edge case for checking sim

* formatting fix

* fixed docs for video tracking tools

* updated embs for fixed docs

* take in user modified code

* update planner tool prompt

* change default fps to 5

* fix type error

* fix doc in example

* do not change default fps

* change default fps to 5

* dont crash if parser fails

* clean up docs

* flake8

* fix check load

* update vision agent conversation

* fix o1 for lmm class

* fixed names

* updated docs

* add image size args for lmm

* fixed resizing

* added configs

* add config for agents

* update readme

* fix tool docs

* fix tool docs

* fix bug with strip calls

* update configs

* run multi judge

* remove write code, only test now

* fix prompts

* update index

* mypy fixes

* add config module

* fix type error
  • Loading branch information
dillonalaird authored Jan 17, 2025
1 parent b20a254 commit f83c4c7
Show file tree
Hide file tree
Showing 24 changed files with 1,482 additions and 1,142 deletions.
512 changes: 53 additions & 459 deletions README.md

Large diffs are not rendered by default.

512 changes: 53 additions & 459 deletions docs/index.md

Large diffs are not rendered by default.

469 changes: 469 additions & 0 deletions docs/index_old.md

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,10 @@ markdown_extensions:
- pymdownx.details

nav:
- Quick start: index.md
- VisionAgent: index.md
- APIs:
- vision_agent.agent: api/agent.md
- vision_agent.tools: api/tools.md
- vision_agent.lmm: api/lmm.md
- vision_agent.utils: api/utils.md
- VisionAgent (Old): index_old.md
18 changes: 10 additions & 8 deletions vision_agent/.sim_tools/df.csv
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,8 @@ desc,doc,name
1.0.

Parameters:
prompt (str): The prompt to ground to the image.
prompt (str): The prompt to ground to the image. Use exclusive categories that
do not overlap such as 'person, car' and NOT 'person, athlete'.
image (np.ndarray): The image to ground the prompt to.
fine_tune_id (Optional[str]): If you have a fine-tuned model, you can pass the
fine-tuned model ID here to use it.
Expand Down Expand Up @@ -281,7 +282,8 @@ desc,doc,name
is useful for tracking and counting without duplicating counts.

Parameters:
prompt (str): The prompt to ground to the video.
prompt (str): The prompt to ground to the image. Use exclusive categories that
do not overlap such as 'person, car' and NOT 'person, athlete'.
frames (List[np.ndarray]): The list of frames to ground the prompt to.
chunk_length (Optional[int]): The number of frames to re-run florence2 to find
new objects.
Expand Down Expand Up @@ -317,14 +319,14 @@ desc,doc,name
]
",florence2_sam2_video_tracking
"'florence2_object_detection' is a tool that can detect multiple objects given a text prompt which can be object names or caption. You can optionally separate the object names in the text with commas. It returns a list of bounding boxes with normalized coordinates, label names and associated confidence scores of 1.0.","florence2_object_detection(prompt: str, image: numpy.ndarray, fine_tune_id: Optional[str] = None) -> List[Dict[str, Any]]:
'florence2_object_detection' is a tool that can detect multiple
objects given a text prompt which can be object names or caption. You
can optionally separate the object names in the text with commas. It returns a list
of bounding boxes with normalized coordinates, label names and associated
confidence scores of 1.0.
'florence2_object_detection' is a tool that can detect multiple objects given a
text prompt which can be object names or caption. You can optionally separate the
object names in the text with commas. It returns a list of bounding boxes with
normalized coordinates, label names and associated confidence scores of 1.0.

Parameters:
prompt (str): The prompt to ground to the image.
prompt (str): The prompt to ground to the image. Use exclusive categories that
do not overlap such as 'person, car' and NOT 'person, athlete'.
image (np.ndarray): The image to used to detect objects
fine_tune_id (Optional[str]): If you have a fine-tuned model, you can pass the
fine-tuned model ID here to use it.
Expand Down
19 changes: 10 additions & 9 deletions vision_agent/agent/agent_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,10 +157,11 @@ def format_conversation(chat: List[AgentMessage]) -> str:
chat = copy.deepcopy(chat)
prompt = ""
for chat_i in chat:
if chat_i.role == "user":
prompt += f"USER: {chat_i.content}\n\n"
elif chat_i.role == "observation" or chat_i.role == "coder":
prompt += f"OBSERVATION: {chat_i.content}\n\n"
if chat_i.role == "user" or chat_i.role == "coder":
if "<final_code>" in chat_i.role:
prompt += f"OBSERVATION: {chat_i.content}\n\n"
elif chat_i.role == "user":
prompt += f"USER: {chat_i.content}\n\n"
elif chat_i.role == "conversation":
prompt += f"AGENT: {chat_i.content}\n\n"
return prompt
Expand Down Expand Up @@ -332,26 +333,26 @@ class StripFunctionCallsTransformer(cst.CSTTransformer):
def __init__(self, exclusions: List[str]):
# Store exclusions to skip removing certain function calls
self.exclusions = exclusions
self.in_function_or_class = False
self.in_function_or_class: List[bool] = []

def visit_FunctionDef(self, node: cst.FunctionDef) -> Optional[bool]:
self.in_function_or_class = True
self.in_function_or_class.append(True)
return True

def leave_FunctionDef(
self, original_node: cst.FunctionDef, updated_node: cst.FunctionDef
) -> cst.BaseStatement:
self.in_function_or_class = False
self.in_function_or_class.pop()
return updated_node

def visit_ClassDef(self, node: cst.ClassDef) -> Optional[bool]:
self.in_function_or_class = True
self.in_function_or_class.append(True)
return True

def leave_ClassDef(
self, node: cst.ClassDef, updated_node: cst.ClassDef
) -> cst.BaseStatement:
self.in_function_or_class = False
self.in_function_or_class.pop()
return updated_node

def leave_Expr(
Expand Down
7 changes: 3 additions & 4 deletions vision_agent/agent/vision_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -291,10 +291,9 @@ def __init__(
verbosity (int): The verbosity level of the agent.
callback_message (Optional[Callable[[Dict[str, Any]], None]]): Callback
function to send intermediate update messages.
code_interpreter (Optional[Union[str, CodeInterpreter]]): For string values
it can be one of: None, "local" or "e2b". If None, it will read from
the environment variable "CODE_SANDBOX_RUNTIME". If a CodeInterpreter
object is provided it will use that.
code_sandbox_runtime (Optional[str]): For string values it can be one of:
None, "local" or "e2b". If None, it will read from the environment
variable "CODE_SANDBOX_RUNTIME".
"""

self.agent = AnthropicLMM(temperature=0.0) if agent is None else agent
Expand Down
12 changes: 6 additions & 6 deletions vision_agent/agent/vision_agent_coder_prompts.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,35 +44,35 @@
## Subtasks
This plan uses the owl_v2_image tool to detect both people and helmets in a single pass, which should be efficient and accurate. We can then compare the detections to determine if each person is wearing a helmet.
-Use owl_v2_image with prompt 'person, helmet' to detect both people and helmets in the image
This plan uses the owlv2_object_detection tool to detect both people and helmets in a single pass, which should be efficient and accurate. We can then compare the detections to determine if each person is wearing a helmet.
-Use owlv2_object_detection with prompt 'person, helmet' to detect both people and helmets in the image
-Process the detections to match helmets with people based on bounding box proximity
-Count people with and without helmets based on the matching results
-Return a dictionary with the counts
**Tool Tests and Outputs**:
After examining the image, I can see 4 workers in total, with 3 wearing yellow safety helmets and 1 not wearing a helmet. Plan 1 using owl_v2_image seems to be the most accurate in detecting both people and helmets. However, it needs some modifications to improve accuracy. We should increase the confidence threshold to 0.15 to filter out the lowest confidence box, and implement logic to associate helmets with people based on their bounding box positions. Plan 2 and Plan 3 seem less reliable given the tool outputs, as they either failed to distinguish between people with and without helmets or misclassified all workers as not wearing helmets.
After examining the image, I can see 4 workers in total, with 3 wearing yellow safety helmets and 1 not wearing a helmet. Plan 1 using owlv2_object_detection seems to be the most accurate in detecting both people and helmets. However, it needs some modifications to improve accuracy. We should increase the confidence threshold to 0.15 to filter out the lowest confidence box, and implement logic to associate helmets with people based on their bounding box positions. Plan 2 and Plan 3 seem less reliable given the tool outputs, as they either failed to distinguish between people with and without helmets or misclassified all workers as not wearing helmets.
**Tool Output Thoughts**:
```python
...
```
----- stdout -----
Plan 1 - owl_v2_image:
Plan 1 - owlv2_object_detection:
[{{'label': 'helmet', 'score': 0.15, 'bbox': [0.85, 0.41, 0.87, 0.45]}}, {{'label': 'helmet', 'score': 0.3, 'bbox': [0.8, 0.43, 0.81, 0.46]}}, {{'label': 'helmet', 'score': 0.31, 'bbox': [0.85, 0.45, 0.86, 0.46]}}, {{'label': 'person', 'score': 0.31, 'bbox': [0.84, 0.45, 0.88, 0.58]}}, {{'label': 'person', 'score': 0.31, 'bbox': [0.78, 0.43, 0.82, 0.57]}}, {{'label': 'helmet', 'score': 0.33, 'bbox': [0.3, 0.65, 0.32, 0.67]}}, {{'label': 'person', 'score': 0.29, 'bbox': [0.28, 0.65, 0.36, 0.84]}}, {{'label': 'helmet', 'score': 0.29, 'bbox': [0.13, 0.82, 0.15, 0.85]}}, {{'label': 'person', 'score': 0.3, 'bbox': [0.1, 0.82, 0.24, 1.0]}}]
...
**Input Code Snippet**:
```python
from vision_agent.tools import load_image, owl_v2_image
from vision_agent.tools import load_image, owlv2_object_detection
def check_helmets(image_path):
image = load_image(image_path)
# Detect people and helmets, filter out the lowest confidence helmet score of 0.15
detections = owl_v2_image("person, helmet", image, box_threshold=0.15)
detections = owlv2_object_detection("person, helmet", image, box_threshold=0.15)
height, width = image.shape[:2]
# Separate people and helmets
Expand Down
67 changes: 41 additions & 26 deletions vision_agent/agent/vision_agent_coder_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@
)
from vision_agent.agent.vision_agent_coder_prompts_v2 import CODE, FIX_BUG, TEST
from vision_agent.agent.vision_agent_planner_v2 import VisionAgentPlannerV2
from vision_agent.lmm import LMM, AnthropicLMM
from vision_agent.configs import Config
from vision_agent.lmm import LMM
from vision_agent.lmm.types import Message
from vision_agent.tools.meta_tools import get_diff
from vision_agent.utils.execute import (
Expand All @@ -36,6 +37,7 @@
)
from vision_agent.utils.sim import Sim, get_tool_recommender

CONFIG = Config()
_CONSOLE = Console()


Expand Down Expand Up @@ -185,23 +187,17 @@ def debug_code(
return code, test, debug_info


def write_and_test_code(
coder: LMM,
def test_code(
tester: LMM,
debugger: LMM,
chat: List[AgentMessage],
plan: str,
code: str,
tool_docs: str,
code_interpreter: CodeInterpreter,
media_list: List[Union[str, Path]],
verbose: bool,
) -> CodeContext:
code = write_code(
coder=coder,
chat=chat,
tool_docs=tool_docs,
plan=plan,
)
try:
code = strip_function_calls(code)
except Exception:
Expand Down Expand Up @@ -257,6 +253,36 @@ def write_and_test_code(
)


def write_and_test_code(
coder: LMM,
tester: LMM,
debugger: LMM,
chat: List[AgentMessage],
plan: str,
tool_docs: str,
code_interpreter: CodeInterpreter,
media_list: List[Union[str, Path]],
verbose: bool,
) -> CodeContext:
code = write_code(
coder=coder,
chat=chat,
tool_docs=tool_docs,
plan=plan,
)
return test_code(
tester,
debugger,
chat,
plan,
code,
tool_docs,
code_interpreter,
media_list,
verbose,
)


class VisionAgentCoderV2(AgentCoder):
"""VisionAgentCoderV2 is an agent that will write vision code for you."""

Expand Down Expand Up @@ -300,21 +326,9 @@ def __init__(
)
)

self.coder = (
coder
if coder is not None
else AnthropicLMM(model_name="claude-3-5-sonnet-20241022", temperature=0.0)
)
self.tester = (
tester
if tester is not None
else AnthropicLMM(model_name="claude-3-5-sonnet-20241022", temperature=0.0)
)
self.debugger = (
debugger
if debugger is not None
else AnthropicLMM(model_name="claude-3-5-sonnet-20241022", temperature=0.0)
)
self.coder = coder if coder is not None else CONFIG.create_coder()
self.tester = tester if tester is not None else CONFIG.create_tester()
self.debugger = debugger if debugger is not None else CONFIG.create_debugger()
if tool_recommender is not None:
if isinstance(tool_recommender, str):
self.tool_recommender = Sim.load(tool_recommender)
Expand Down Expand Up @@ -440,12 +454,13 @@ def generate_code_from_plan(
) as code_interpreter:
int_chat, _, media_list = add_media_to_chat(chat, code_interpreter)
tool_docs = retrieve_tools(plan_context.instructions, self.tool_recommender)
code_context = write_and_test_code(
coder=self.coder,

code_context = test_code(
tester=self.tester,
debugger=self.debugger,
chat=int_chat,
plan=format_plan_v2(plan_context),
code=plan_context.code,
tool_docs=tool_docs,
code_interpreter=code_interpreter,
media_list=media_list,
Expand Down
12 changes: 6 additions & 6 deletions vision_agent/agent/vision_agent_planner_prompts.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,27 +55,27 @@
--- EXAMPLE1 ---
plan1:
- Load the image from the provided file path 'image.jpg'.
- Use the 'owl_v2_image' tool with the prompt 'person' to detect and count the number of people in the image.
- Use the 'owlv2_object_detection' tool with the prompt 'person' to detect and count the number of people in the image.
plan2:
- Load the image from the provided file path 'image.jpg'.
- Use the 'florence2_sam2_image' tool with the prompt 'person' to detect and count the number of people in the image.
- Use the 'florence2_sam2_instance_segmentation' tool with the prompt 'person' to detect and count the number of people in the image.
- Count the number of detected objects labeled as 'person'.
plan3:
- Load the image from the provided file path 'image.jpg'.
- Use the 'countgd_object_detection' tool to count the dominant foreground object, which in this case is people.
```python
from vision_agent.tools import load_image, owl_v2_image, florence2_sam2_image, countgd_object_detection
from vision_agent.tools import load_image, owlv2_object_detection, florence2_sam2_instance_segmentation, countgd_object_detection
image = load_image("image.jpg")
owl_v2_out = owl_v2_image("person", image)
owl_v2_out = owlv2_object_detection("person", image)
f2s2_out = florence2_sam2_image("person", image)
f2s2_out = florence2_sam2_instance_segmentation("person", image)
# strip out the masks from the output becuase they don't provide useful information when printed
f2s2_out = [{{k: v for k, v in o.items() if k != "mask"}} for o in f2s2_out]
cgd_out = countgd_object_detection("person", image)
final_out = {{"owl_v2_image": owl_v2_out, "florence2_sam2_image": f2s2, "countgd_object_detection": cgd_out}}
final_out = {{"owlv2_object_detection": owl_v2_out, "florence2_sam2_instance_segmentation": f2s2, "countgd_object_detection": cgd_out}}
print(final_out)
--- END EXAMPLE1 ---
Expand Down
Loading

0 comments on commit f83c4c7

Please sign in to comment.