Update Conversation v2 (#342)

* fix notebook example * remove old tools * fix edge case for checking sim * formatting fix * fixed docs for video tracking tools * updated embs for fixed docs * take in user modified code * update planner tool prompt * change default fps to 5 * fix type error * fix doc in example * do not change default fps * change default fps to 5 * dont crash if parser fails * clean up docs * flake8 * fix check load * update vision agent conversation * fix o1 for lmm class * fixed names * updated docs * add image size args for lmm * fixed resizing * added configs * add config for agents * update readme * fix tool docs * fix tool docs * fix bug with strip calls * update configs * run multi judge * remove write code, only test now * fix prompts * update index * mypy fixes * add config module * fix type error
landing-ai · Jan 17, 2025 · f83c4c7 · f83c4c7
1 parent b20a254
commit f83c4c7
Show file tree

Hide file tree

Showing 24 changed files with 1,482 additions and 1,142 deletions.
diff --git a/README.md b/README.md
diff --git a/docs/index.md b/docs/index.md
diff --git a/docs/index_old.md b/docs/index_old.md
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -72,9 +72,10 @@ markdown_extensions:
   - pymdownx.details
 
 nav:
-  - Quick start: index.md
+  - VisionAgent: index.md
   - APIs:
       - vision_agent.agent: api/agent.md
       - vision_agent.tools: api/tools.md
       - vision_agent.lmm: api/lmm.md
       - vision_agent.utils: api/utils.md
+  - VisionAgent (Old): index_old.md
diff --git a/vision_agent/.sim_tools/df.csv b/vision_agent/.sim_tools/df.csv
@@ -244,7 +244,8 @@ desc,doc,name
     1.0.
 
     Parameters:
-        prompt (str): The prompt to ground to the image.
+        prompt (str): The prompt to ground to the image. Use exclusive categories that
+            do not overlap such as 'person, car' and NOT 'person, athlete'.
         image (np.ndarray): The image to ground the prompt to.
         fine_tune_id (Optional[str]): If you have a fine-tuned model, you can pass the
             fine-tuned model ID here to use it.
@@ -281,7 +282,8 @@ desc,doc,name
     is useful for tracking and counting without duplicating counts.
 
     Parameters:
-        prompt (str): The prompt to ground to the video.
+        prompt (str): The prompt to ground to the image. Use exclusive categories that
+            do not overlap such as 'person, car' and NOT 'person, athlete'.
         frames (List[np.ndarray]): The list of frames to ground the prompt to.
         chunk_length (Optional[int]): The number of frames to re-run florence2 to find
             new objects.
@@ -317,14 +319,14 @@ desc,doc,name
         ]
     ",florence2_sam2_video_tracking
 "'florence2_object_detection' is a tool that can detect multiple objects given a text prompt which can be object names or caption. You can optionally separate the object names in the text with commas. It returns a list of bounding boxes with normalized coordinates, label names and associated confidence scores of 1.0.","florence2_object_detection(prompt: str, image: numpy.ndarray, fine_tune_id: Optional[str] = None) -> List[Dict[str, Any]]:
-'florence2_object_detection' is a tool that can detect multiple
-    objects given a text prompt which can be object names or caption. You
-    can optionally separate the object names in the text with commas. It returns a list
-    of bounding boxes with normalized coordinates, label names and associated
-    confidence scores of 1.0.
+'florence2_object_detection' is a tool that can detect multiple objects given a
+    text prompt which can be object names or caption. You can optionally separate the
+    object names in the text with commas. It returns a list of bounding boxes with
+    normalized coordinates, label names and associated confidence scores of 1.0.
 
     Parameters:
-        prompt (str): The prompt to ground to the image.
+        prompt (str): The prompt to ground to the image. Use exclusive categories that
+            do not overlap such as 'person, car' and NOT 'person, athlete'.
         image (np.ndarray): The image to used to detect objects
         fine_tune_id (Optional[str]): If you have a fine-tuned model, you can pass the
             fine-tuned model ID here to use it.

diff --git a/vision_agent/agent/agent_utils.py b/vision_agent/agent/agent_utils.py
@@ -157,10 +157,11 @@ def format_conversation(chat: List[AgentMessage]) -> str:
     chat = copy.deepcopy(chat)
     prompt = ""
     for chat_i in chat:
-        if chat_i.role == "user":
-            prompt += f"USER: {chat_i.content}\n\n"
-        elif chat_i.role == "observation" or chat_i.role == "coder":
-            prompt += f"OBSERVATION: {chat_i.content}\n\n"
+        if chat_i.role == "user" or chat_i.role == "coder":
+            if "<final_code>" in chat_i.role:
+                prompt += f"OBSERVATION: {chat_i.content}\n\n"
+            elif chat_i.role == "user":
+                prompt += f"USER: {chat_i.content}\n\n"
         elif chat_i.role == "conversation":
             prompt += f"AGENT: {chat_i.content}\n\n"
     return prompt
@@ -332,26 +333,26 @@ class StripFunctionCallsTransformer(cst.CSTTransformer):
         def __init__(self, exclusions: List[str]):
             # Store exclusions to skip removing certain function calls
             self.exclusions = exclusions
-            self.in_function_or_class = False
+            self.in_function_or_class: List[bool] = []
 
         def visit_FunctionDef(self, node: cst.FunctionDef) -> Optional[bool]:
-            self.in_function_or_class = True
+            self.in_function_or_class.append(True)
             return True
 
         def leave_FunctionDef(
             self, original_node: cst.FunctionDef, updated_node: cst.FunctionDef
         ) -> cst.BaseStatement:
-            self.in_function_or_class = False
+            self.in_function_or_class.pop()
             return updated_node
 
         def visit_ClassDef(self, node: cst.ClassDef) -> Optional[bool]:
-            self.in_function_or_class = True
+            self.in_function_or_class.append(True)
             return True
 
         def leave_ClassDef(
             self, node: cst.ClassDef, updated_node: cst.ClassDef
         ) -> cst.BaseStatement:
-            self.in_function_or_class = False
+            self.in_function_or_class.pop()
             return updated_node
 
         def leave_Expr(

diff --git a/vision_agent/agent/vision_agent.py b/vision_agent/agent/vision_agent.py
@@ -291,10 +291,9 @@ def __init__(
             verbosity (int): The verbosity level of the agent.
             callback_message (Optional[Callable[[Dict[str, Any]], None]]): Callback
                 function to send intermediate update messages.
-            code_interpreter (Optional[Union[str, CodeInterpreter]]): For string values
-                it can be one of: None, "local" or "e2b". If None, it will read from
-                the environment variable "CODE_SANDBOX_RUNTIME". If a CodeInterpreter
-                object is provided it will use that.
+            code_sandbox_runtime (Optional[str]): For string values it can be one of:
+                None, "local" or "e2b". If None, it will read from the environment
+                variable "CODE_SANDBOX_RUNTIME".
         """
 
         self.agent = AnthropicLMM(temperature=0.0) if agent is None else agent

diff --git a/vision_agent/agent/vision_agent_coder_prompts.py b/vision_agent/agent/vision_agent_coder_prompts.py
@@ -44,35 +44,35 @@
 
 ## Subtasks
 
-This plan uses the owl_v2_image tool to detect both people and helmets in a single pass, which should be efficient and accurate. We can then compare the detections to determine if each person is wearing a helmet.
--Use owl_v2_image with prompt 'person, helmet' to detect both people and helmets in the image
+This plan uses the owlv2_object_detection tool to detect both people and helmets in a single pass, which should be efficient and accurate. We can then compare the detections to determine if each person is wearing a helmet.
+-Use owlv2_object_detection with prompt 'person, helmet' to detect both people and helmets in the image
 -Process the detections to match helmets with people based on bounding box proximity
 -Count people with and without helmets based on the matching results
 -Return a dictionary with the counts
 
 
 **Tool Tests and Outputs**:
-After examining the image, I can see 4 workers in total, with 3 wearing yellow safety helmets and 1 not wearing a helmet. Plan 1 using owl_v2_image seems to be the most accurate in detecting both people and helmets. However, it needs some modifications to improve accuracy. We should increase the confidence threshold to 0.15 to filter out the lowest confidence box, and implement logic to associate helmets with people based on their bounding box positions. Plan 2 and Plan 3 seem less reliable given the tool outputs, as they either failed to distinguish between people with and without helmets or misclassified all workers as not wearing helmets.
+After examining the image, I can see 4 workers in total, with 3 wearing yellow safety helmets and 1 not wearing a helmet. Plan 1 using owlv2_object_detection seems to be the most accurate in detecting both people and helmets. However, it needs some modifications to improve accuracy. We should increase the confidence threshold to 0.15 to filter out the lowest confidence box, and implement logic to associate helmets with people based on their bounding box positions. Plan 2 and Plan 3 seem less reliable given the tool outputs, as they either failed to distinguish between people with and without helmets or misclassified all workers as not wearing helmets.
 
 **Tool Output Thoughts**:
 ```python
 ...
 ```
 ----- stdout -----
-Plan 1 - owl_v2_image:
+Plan 1 - owlv2_object_detection:
 
 [{{'label': 'helmet', 'score': 0.15, 'bbox': [0.85, 0.41, 0.87, 0.45]}}, {{'label': 'helmet', 'score': 0.3, 'bbox': [0.8, 0.43, 0.81, 0.46]}}, {{'label': 'helmet', 'score': 0.31, 'bbox': [0.85, 0.45, 0.86, 0.46]}}, {{'label': 'person', 'score': 0.31, 'bbox': [0.84, 0.45, 0.88, 0.58]}}, {{'label': 'person', 'score': 0.31, 'bbox': [0.78, 0.43, 0.82, 0.57]}}, {{'label': 'helmet', 'score': 0.33, 'bbox': [0.3, 0.65, 0.32, 0.67]}}, {{'label': 'person', 'score': 0.29, 'bbox': [0.28, 0.65, 0.36, 0.84]}}, {{'label': 'helmet', 'score': 0.29, 'bbox': [0.13, 0.82, 0.15, 0.85]}}, {{'label': 'person', 'score': 0.3, 'bbox': [0.1, 0.82, 0.24, 1.0]}}]
 
 ...
 
 **Input Code Snippet**:
 ```python
-from vision_agent.tools import load_image, owl_v2_image
+from vision_agent.tools import load_image, owlv2_object_detection
 
 def check_helmets(image_path):
     image = load_image(image_path)
     # Detect people and helmets, filter out the lowest confidence helmet score of 0.15
-    detections = owl_v2_image("person, helmet", image, box_threshold=0.15)
+    detections = owlv2_object_detection("person, helmet", image, box_threshold=0.15)
     height, width = image.shape[:2]
 
     # Separate people and helmets

diff --git a/vision_agent/agent/vision_agent_coder_v2.py b/vision_agent/agent/vision_agent_coder_v2.py
@@ -26,7 +26,8 @@
 )
 from vision_agent.agent.vision_agent_coder_prompts_v2 import CODE, FIX_BUG, TEST
 from vision_agent.agent.vision_agent_planner_v2 import VisionAgentPlannerV2
-from vision_agent.lmm import LMM, AnthropicLMM
+from vision_agent.configs import Config
+from vision_agent.lmm import LMM
 from vision_agent.lmm.types import Message
 from vision_agent.tools.meta_tools import get_diff
 from vision_agent.utils.execute import (
@@ -36,6 +37,7 @@
 )
 from vision_agent.utils.sim import Sim, get_tool_recommender
 
+CONFIG = Config()
 _CONSOLE = Console()
 
 
@@ -185,23 +187,17 @@ def debug_code(
     return code, test, debug_info
 
 
-def write_and_test_code(
-    coder: LMM,
+def test_code(
     tester: LMM,
     debugger: LMM,
     chat: List[AgentMessage],
     plan: str,
+    code: str,
     tool_docs: str,
     code_interpreter: CodeInterpreter,
     media_list: List[Union[str, Path]],
     verbose: bool,
 ) -> CodeContext:
-    code = write_code(
-        coder=coder,
-        chat=chat,
-        tool_docs=tool_docs,
-        plan=plan,
-    )
     try:
         code = strip_function_calls(code)
     except Exception:
@@ -257,6 +253,36 @@ def write_and_test_code(
     )
 
 
+def write_and_test_code(
+    coder: LMM,
+    tester: LMM,
+    debugger: LMM,
+    chat: List[AgentMessage],
+    plan: str,
+    tool_docs: str,
+    code_interpreter: CodeInterpreter,
+    media_list: List[Union[str, Path]],
+    verbose: bool,
+) -> CodeContext:
+    code = write_code(
+        coder=coder,
+        chat=chat,
+        tool_docs=tool_docs,
+        plan=plan,
+    )
+    return test_code(
+        tester,
+        debugger,
+        chat,
+        plan,
+        code,
+        tool_docs,
+        code_interpreter,
+        media_list,
+        verbose,
+    )
+
+
 class VisionAgentCoderV2(AgentCoder):
     """VisionAgentCoderV2 is an agent that will write vision code for you."""
 
@@ -300,21 +326,9 @@ def __init__(
             )
         )
 
-        self.coder = (
-            coder
-            if coder is not None
-            else AnthropicLMM(model_name="claude-3-5-sonnet-20241022", temperature=0.0)
-        )
-        self.tester = (
-            tester
-            if tester is not None
-            else AnthropicLMM(model_name="claude-3-5-sonnet-20241022", temperature=0.0)
-        )
-        self.debugger = (
-            debugger
-            if debugger is not None
-            else AnthropicLMM(model_name="claude-3-5-sonnet-20241022", temperature=0.0)
-        )
+        self.coder = coder if coder is not None else CONFIG.create_coder()
+        self.tester = tester if tester is not None else CONFIG.create_tester()
+        self.debugger = debugger if debugger is not None else CONFIG.create_debugger()
         if tool_recommender is not None:
             if isinstance(tool_recommender, str):
                 self.tool_recommender = Sim.load(tool_recommender)
@@ -440,12 +454,13 @@ def generate_code_from_plan(
         ) as code_interpreter:
             int_chat, _, media_list = add_media_to_chat(chat, code_interpreter)
             tool_docs = retrieve_tools(plan_context.instructions, self.tool_recommender)
-            code_context = write_and_test_code(
-                coder=self.coder,
+
+            code_context = test_code(
                 tester=self.tester,
                 debugger=self.debugger,
                 chat=int_chat,
                 plan=format_plan_v2(plan_context),
+                code=plan_context.code,
                 tool_docs=tool_docs,
                 code_interpreter=code_interpreter,
                 media_list=media_list,

diff --git a/vision_agent/agent/vision_agent_planner_prompts.py b/vision_agent/agent/vision_agent_planner_prompts.py
@@ -55,27 +55,27 @@
 --- EXAMPLE1 ---
 plan1:
 - Load the image from the provided file path 'image.jpg'.
-- Use the 'owl_v2_image' tool with the prompt 'person' to detect and count the number of people in the image.
+- Use the 'owlv2_object_detection' tool with the prompt 'person' to detect and count the number of people in the image.
 plan2:
 - Load the image from the provided file path 'image.jpg'.
-- Use the 'florence2_sam2_image' tool with the prompt 'person' to detect and count the number of people in the image.
+- Use the 'florence2_sam2_instance_segmentation' tool with the prompt 'person' to detect and count the number of people in the image.
 - Count the number of detected objects labeled as 'person'.
 plan3:
 - Load the image from the provided file path 'image.jpg'.
 - Use the 'countgd_object_detection' tool to count the dominant foreground object, which in this case is people.
 
 ```python
-from vision_agent.tools import load_image, owl_v2_image, florence2_sam2_image, countgd_object_detection
+from vision_agent.tools import load_image, owlv2_object_detection, florence2_sam2_instance_segmentation, countgd_object_detection
 image = load_image("image.jpg")
-owl_v2_out = owl_v2_image("person", image)
+owl_v2_out = owlv2_object_detection("person", image)
 
-f2s2_out = florence2_sam2_image("person", image)
+f2s2_out = florence2_sam2_instance_segmentation("person", image)
 # strip out the masks from the output becuase they don't provide useful information when printed
 f2s2_out = [{{k: v for k, v in o.items() if k != "mask"}} for o in f2s2_out]
 
 cgd_out = countgd_object_detection("person", image)
 
-final_out = {{"owl_v2_image": owl_v2_out, "florence2_sam2_image": f2s2, "countgd_object_detection": cgd_out}}
+final_out = {{"owlv2_object_detection": owl_v2_out, "florence2_sam2_instance_segmentation": f2s2, "countgd_object_detection": cgd_out}}
 print(final_out)
 --- END EXAMPLE1 ---