Skip to content

Commit

Permalink
749 create and integrate function websurfer team for scraping and cre…
Browse files Browse the repository at this point in the history
…ating page summary for perticular ad group (#762)

* wip

* wip

* wip

* Prompt updates and new end2end benchmark results

---------

Co-authored-by: Davor Runje <davor@airt.ai>
Co-authored-by: Kumaran Rajendhiran <kumaran@airt.ai>
  • Loading branch information
3 people authored Jun 11, 2024
1 parent 97056c8 commit ce3e42c
Show file tree
Hide file tree
Showing 9 changed files with 184 additions and 93 deletions.

This file was deleted.

61 changes: 0 additions & 61 deletions benchmarking/end2end-benchmark-task-list-2024-06-04T20:57:23.csv

This file was deleted.

12 changes: 12 additions & 0 deletions benchmarking/end2end-benchmark-task-list-aggregated.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
,url,success_percentage,success_with_retry_percentage,failed_percentage,avg_time
0,https://camelbackflowershop.com/,100.0,0.0,0.0,787.82
1,https://faststream.airt.ai,100.0,0.0,0.0,546.88
2,https://getbybus.com/hr/,100.0,0.0,0.0,600.33
3,https://websitedemos.net/organic-shop-02/,100.0,0.0,0.0,632.92
4,https://www.disneystore.eu,100.0,0.0,0.0,600.73
5,https://www.hamleys.com/,100.0,0.0,0.0,646.25
6,https://www.ikea.com/gb/en/,100.0,0.0,0.0,1038.94
7,https://www.konzum.hr,100.0,0.0,0.0,746.19
8,https://zagreb.cinestarcinemas.hr/,100.0,0.0,0.0,967.01
9,www.bbc.com/news,100.0,0.0,0.0,777.56
10,Total,100.0,0.0,0.0,734.46
101 changes: 101 additions & 0 deletions benchmarking/end2end-benchmark-task-list.csv

Large diffs are not rendered by default.

13 changes: 7 additions & 6 deletions captn/captn_agents/backend/teams/_brief_creation_team.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,19 +153,20 @@ def _guidelines(self) -> str:
If the client has provided a link to the web page and you do not try to gather information from the web page, you will be penalized!
If you are unable to retrieve ANY information, use the 'reply_to_client' command to ask the client for the information which you need.
Otherwise, focus on creating the brief based on the information which you were able to gather from the web page (ignore the links which you were unable to retrieve information from and don't mention them in the brief!).
Do NOT use the 'get_info_from_the_web_page' for retrieving the information of the subpages which you have found in the provided link.
Your job is to create a brief based on the information which you have gathered from URL which the client has provided.
If you try to gather information from the subpages, a lot of time will be wasted and you will be penalized!
i.e. use the 'get_info_from_the_web_page' command ONLY once for the URL which the client has provided!
6. When you have gathered all the information, create a detailed brief.
6. Initially, use the 'get_info_from_the_web_page' command with the 'max_links_to_click' parameter set to 10.
This will allow you to gather the most information about the clients business.
Once you have gathered the information, if the client wants to focus on a specific subpage(s), use the 'get_info_from_the_web_page' command again with the 'max_links_to_click' parameter set to 4.
This will allow you to gather deeper information about the specific subpage(s) which the client is interested in - this step is VERY important!
7. When you have gathered all the information, create a detailed brief.
Do NOT repeat the content which you have received from the 'get_info_from_the_web_page' command (the content will be injected automatically later on)!
i.e. do NOT mention keywords, headlines and descriptions in the brief which you are constructing!
Do NOT mention to the client that you are creating a brief. This is your internal task and the client does not need to know that.
Do NOT ask the client which information he wants to include in the brief.
i.e. word 'brief' should NOT be mentioned to the client at all!
7. Finally, after you retrieve the information from the web page and create the brief, use the 'delagate_task' command to send the brief to the chosen team.
8. Finally, after you retrieve the information from the web page and create the brief, use the 'delagate_task' command to send the brief to the chosen team.
Guidelines SUMMARY:
- Write a detailed step-by-step plan
Expand Down
2 changes: 1 addition & 1 deletion captn/captn_agents/backend/teams/_shared_prompts.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
Use it so the client can easily choose between multiple options and make a quick reply by clicking on the suggestion.
e.g.:"""

GET_INFO_FROM_THE_WEB_COMMAND = """'get_info_from_the_web_page': Retrieve wanted information from the web page, params: (url: string)
GET_INFO_FROM_THE_WEB_COMMAND = """'get_info_from_the_web_page': Retrieve wanted information from the web page, params: (url: string, max_links_to_click: int (default 10))
It should be used only for the clients web page(s), final_url(s) etc.
This command should be used for retrieving the information from clients web page.
If this command fails to retrieve the information, only then you should ask the client for the additional information about his business/web page etc."""
Expand Down
22 changes: 19 additions & 3 deletions captn/captn_agents/backend/tools/_brief_creation_team_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
from ..toolboxes import Toolbox
from ._functions import (
LAST_MESSAGE_BEGINNING,
MAX_LINKS_TO_CLICK_DESCRIPTION,
MIN_RELEVANT_PAGES_DESCRIPTION,
BaseContext,
get_get_info_from_the_web_page,
get_info_from_the_web_page_description,
Expand Down Expand Up @@ -39,7 +41,7 @@ class DelegateTask(BaseModel):


class WebPageInfo:
def get_info_from_the_web_page_f(self) -> Callable[[str], str]:
def get_info_from_the_web_page_f(self) -> Callable[[str, int, int], str]:
return get_get_info_from_the_web_page()


Expand Down Expand Up @@ -166,17 +168,31 @@ def delagate_task(
@toolbox.add_function(get_info_from_the_web_page_description)
def get_info_from_the_web_page(
url: Annotated[str, "The url of the web page which needs to be summarized"],
max_links_to_click: Annotated[
int,
MAX_LINKS_TO_CLICK_DESCRIPTION,
],
min_relevant_pages: Annotated[int, MIN_RELEVANT_PAGES_DESCRIPTION],
context: Context,
) -> str:
result = web_page_info_f(url)
result = web_page_info_f( # type: ignore[call-arg]
url=url,
max_links_to_click=max_links_to_click,
min_relevant_pages=min_relevant_pages,
)

if LAST_MESSAGE_BEGINNING in result:
context.get_info_from_web_page_result += result + "\n\n"
context.get_info_from_web_page_result += (
result.replace(LAST_MESSAGE_BEGINNING, "") + "\n\n"
)

result += """\n\nPlease use the rely_to_client to present what you have found on the web page to the client.
Use smart suggestions with type 'manyOf' to ask the client in which pages they are interested in.
Each relevant page should be one smart suggestion.
Additionally, add to smart suggestions 'Proceed with the task without further web page scraping' to allow the client to proceed without further web page scraping.
If the client chooses this option do NOT use the 'get_info_from_the_web_page' command again.
If the client does not choose this option, you can use the 'get_info_from_the_web_page' if you think you need more information for the selected pages.
"""
return result

Expand Down
59 changes: 43 additions & 16 deletions captn/captn_agents/backend/tools/_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,7 @@ def _ask_client_for_permission_mock(
)

client_system_message = """We are creating a new Google Ads campaign (ad groups, ads etc).
We currently do NOT care about geo and audience targeting, we are focusing on the structure of the campaign.
We are in the middle of the process and we need your permission.
If the proposed changes make sense, Please answer 'Yes' and nothing else.
Expand Down Expand Up @@ -592,7 +593,9 @@ def validate_url(cls, v: str) -> str:
return v


def _create_web_surfer_navigator_system_message(task_guidelines: str) -> str:
def _create_web_surfer_navigator_system_message(
task_guidelines: str, max_links_to_click: int
) -> str:
return f"""You are in charge of navigating the web_surfer agent to scrape the web.
web_surfer is able to CLICK on links, SCROLL down, and scrape the content of the web page. e.g. you cen tell him: "Click the 'Getting Started' result".
Each time you receive a reply from web_surfer, you need to tell him what to do next. e.g. "Click the TV link" or "Scroll down".
Expand Down Expand Up @@ -644,7 +647,8 @@ def _create_web_surfer_navigator_system_message(task_guidelines: str) -> str:
We are interested ONLY in the products/services which the page is offering.
- NEVER include in the summary links which return 40x error!
- Do NOT repeat completed parts of the plan you have created. Each message should contain only the next steps!
- When clicking on a link, add a comment "Click no. X -> 'Page you want to click' (I can click MAX 10 links, but I will click only the most relevant ones, once I am done, I need to generate JSON-encoded string)" to the message.
- When clicking on a link, add a comment "Click no. X -> 'Page you want to click' (I can click MAX {max_links_to_click} links, but I will click only the most relevant ones, once I am done, I need to generate JSON-encoded string)" to the message.
- Each time some page is visited, increase the click number by 1, otherwise you will be penalized!
OFTEN MISTAKES:
- Do NOT create more than 15 headlines and 4 descriptions for each link!
Expand All @@ -660,9 +664,9 @@ def _create_web_surfer_navigator_system_message(task_guidelines: str) -> str:
LAST_MESSAGE_BEGINNING = "Here is a summary of the information you requested:"


def _format_last_message(summary: Summary) -> str:
def _format_last_message(url: str, summary: Summary) -> str:
summary_response = f"""{LAST_MESSAGE_BEGINNING}
URL: {url}
{summary.summary}
Relevant Pages:
Expand Down Expand Up @@ -705,17 +709,26 @@ def get_webpage_status_code(url: str) -> Optional[int]:
return None


_task = """We are tasked with creating a new Google Ads campaign for the website.
def _get_task_message(max_links_to_click: int) -> str:
_task = f"""We are tasked with creating a new Google Ads campaign for the website.
The focus is on the provided url and its subpages ,we do NOT care about the rest of the website i.e. parent pages.
e.g. If the url is 'https://www.example.com/products/air-conditioners', we are interested ONLY in the 'air-conditioners' and its subpages.
In order to create the campaign, we need to understand the website and its products/services.
Our task is to provide a summary of the website, including the products/services offered and any unique selling points.
This is the first step in creating the Google Ads campaign so please gather as much information as possible.
Visit the most likely pages to be advertised, such as the homepage, product pages, and any other relevant pages.
Visit the most likely pages to be advertised, such as product pages, and any other relevant pages.
Please provide a detailed summary of the website as JSON-encoded string as instructed in the guidelines.
AFTER visiting the home page, create a step-by-step plan BEFORE visiting the other pages.
You can click on MAXIMUM 10 links. Do NOT try to click all the links on the page, but only the ones which are most relevant for the task (MAX 10)!
You can click on MAXIMUM {max_links_to_click} links. Do NOT try to click all the links on the page, but only the ones which are most relevant for the task (MAX {max_links_to_click})!
Make sure you use keyword insertion in the headlines and provide unique headlines and descriptions for each link.
Do NOT visit the same page multiple times, but only once!
If your co-speaker repeats the same message, inform him that you have already answered to that message and ask him to proceed with the task.
e.g. "I have already answered to that message, please proceed with the task or you will be penalized!"
"""
return _task


_task_guidelines = "Please provide a summary of the website, including the products/services offered and any unique selling points."

Expand All @@ -735,27 +748,39 @@ def _constuct_retry_message(
"""


MAX_LINKS_TO_CLICK_DESCRIPTION = """The maximum number of links to click on the page.
When you want to do the initial research about the client's business, set max_links_to_click=10
If you have already have insight in the clients business, and you need only the info for perticualr subpage, set max_links_to_click=4"""

MIN_RELEVANT_PAGES_DESCRIPTION = """The minimum number of relevant pages which the summary must include.
When you want to do the initial research about the client's business, set min_relevant_pages=3
If you have already have insight in the clients business, and you need only the info for perticualr subpage, set min_relevant_pages=1"""


def get_get_info_from_the_web_page(
outer_retries: int = 3,
inner_retries: int = 10,
summarizer_llm_config: Optional[Dict[str, Any]] = None,
websurfer_llm_config: Optional[Dict[str, Any]] = None,
websurfer_navigator_llm_config: Optional[Dict[str, Any]] = None,
timestamp: Optional[str] = None,
min_relevant_pages: int = 3,
max_retires_before_give_up_message: int = 7,
) -> Callable[[str], str]:
) -> Callable[[str, int, int], str]:
fx = summarizer_llm_config, websurfer_llm_config, websurfer_navigator_llm_config

give_up_message = f"""ONLY if you are 100% sure that you can NOT retrieve any information for at least {min_relevant_pages} relevant pages,
write 'I GIVE UP' and the reason why you gave up.
But before giving up, please try to navigate to another page and continue with the task. Give up ONLY if you are sure that you can NOT retrieve any information!"""

@lru_cache(maxsize=20)
def get_info_from_the_web_page(
url: Annotated[str, "The url of the web page which needs to be summarized"],
max_links_to_click: Annotated[
int,
MAX_LINKS_TO_CLICK_DESCRIPTION,
] = 10,
min_relevant_pages: Annotated[int, MIN_RELEVANT_PAGES_DESCRIPTION] = 3,
) -> str:
give_up_message = f"""ONLY if you are 100% sure that you can NOT retrieve any information for at least {min_relevant_pages} relevant pages,
write 'I GIVE UP' and the reason why you gave up.
But before giving up, please try to navigate to another page and continue with the task. Give up ONLY if you are sure that you can NOT retrieve any information!"""
summarizer_llm_config, websurfer_llm_config, websurfer_navigator_llm_config = fx

if summarizer_llm_config is None:
Expand All @@ -772,7 +797,8 @@ def get_info_from_the_web_page(
)
web_surfer_navigator_system_message = (
_create_web_surfer_navigator_system_message(
task_guidelines=_task_guidelines
task_guidelines=_task_guidelines,
max_links_to_click=max_links_to_click,
)
)
# validate url, error will be raised if url is invalid
Expand Down Expand Up @@ -810,6 +836,7 @@ def get_info_from_the_web_page(
initial_message = (
f"Time now is {timestamp_copy}." if timestamp_copy else ""
)
_task = _get_task_message(max_links_to_click=max_links_to_click)
initial_message += f"""
URL: {url}
TASK: {_task}
Expand Down Expand Up @@ -877,7 +904,7 @@ def get_info_from_the_web_page(
recipient=web_surfer_navigator,
)
continue
last_message = _format_last_message(summary)
last_message = _format_last_message(url=url, summary=summary)
return last_message
except ValidationError as e:
retry_message = _constuct_retry_message(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def create_weekly_analysis_team_toolbox(
)
toolbox.add_function(execute_query_description)(execute_query)
toolbox.add_function(get_info_from_the_web_page_description)(
get_get_info_from_the_web_page(min_relevant_pages=1)
get_get_info_from_the_web_page()
)
toolbox.add_function(send_email_description)(send_email)

Expand Down

0 comments on commit ce3e42c

Please sign in to comment.