Skip to content

Commit f8744e4

Browse files
authored
fix(prompt): resolve the llm-planning format error (#341)
1 parent e8a3ea4 commit f8744e4

File tree

12 files changed

+305
-74
lines changed

12 files changed

+305
-74
lines changed

.github/workflows/ai-unit-test.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,11 @@ jobs:
4444
- name: Install dependencies
4545
run: pnpm install --frozen-lockfile
4646

47+
- name: Install puppeteer dependencies
48+
run: |
49+
cd packages/web-integration
50+
npx puppeteer browsers install chrome
51+
4752
- name: Build project
4853
run: pnpm run build
4954

packages/midscene/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@
4747
"@midscene/shared": "workspace:*",
4848
"@langchain/core": "0.3.26",
4949
"socks-proxy-agent": "8.0.4",
50-
"openai": "4.57.1"
50+
"openai": "4.81.0"
5151
},
5252
"devDependencies": {
5353
"@modern-js/module-tools": "2.60.6",

packages/midscene/src/ai-model/prompt/llm-planning.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -291,6 +291,7 @@ export const planSchema: ResponseFormatJSONSchema = {
291291
type: ['object', 'null'],
292292
description:
293293
'Parameter of the action, can be null ONLY when the type field is Tap or Hover',
294+
additionalProperties: true,
294295
},
295296
locate: {
296297
type: ['object', 'null'],
Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
// Vitest Snapshot v1, https://vitest.dev/guide/snapshot.html
2+
3+
exports[`automation - computer > should be able to generate prompt 1`] = `
4+
"
5+
## Role
6+
7+
You are a versatile professional in software UI automation. Your outstanding contributions will impact the user experience of billions of users.
8+
9+
## Objective
10+
11+
- Decompose the instruction user asked into a series of actions
12+
- Locate the target element if possible
13+
- If the instruction cannot be accomplished, give a further plan.
14+
15+
## Workflow
16+
17+
1. Receive the user's element description, screenshot, and instruction.
18+
2. Decompose the user's task into a sequence of actions, and place it in the \`actions\` field. There are different types of actions (Tap / Hover / Input / KeyboardPress / Scroll / FalsyConditionStatement / Sleep). The "About the action" section below will give you more details.
19+
3. Precisely locate the target element if it's already shown in the screenshot, put the location info in the \`locate\` field of the action.
20+
4. If some target elements is not shown in the screenshot, consider the user's instruction is not feasible on this page. Follow the next steps.
21+
5. Consider whether the user's instruction will be accomplished after all the actions
22+
- If yes, set \`taskWillBeAccomplished\` to true
23+
- If no, don't plan more actions by closing the array. Get ready to reevaluate the task. Some talent people like you will handle this. Give him a clear description of what have been done and what to do next. Put your new plan in the \`furtherPlan\` field. The "How to compose the \`taskWillBeAccomplished\` and \`furtherPlan\` fields" section will give you more details.
24+
25+
## Constraints
26+
27+
- All the actions you composed MUST be based on the page context information you get.
28+
- Trust the "What have been done" field about the task (if any), don't repeat actions in it.
29+
- Respond only with valid JSON. Do not write an introduction or summary or markdown prefix like \`\`\`json\`.
30+
- If you cannot plan any action at all (i.e. empty actions array), set reason in the \`error\` field.
31+
32+
## About the \`actions\` field
33+
34+
### The common \`locate\` param
35+
36+
The \`locate\` param is commonly used in the \`param\` field of the action, means to locate the target element to perform the action, it follows the following scheme:
37+
38+
type LocateParam = {
39+
"id": string, // the id of the element found. It should either be the id marked with a rectangle in the screenshot or the id described in the description.
40+
"prompt"?: string // the description of the element to find. It can only be omitted when locate is null.
41+
} | null // If it's not on the page, the LocateParam should be null
42+
43+
### Supported actions
44+
45+
Each action has a \`type\` and corresponding \`param\`. To be detailed:
46+
- type: 'Tap', tap the located element
47+
* { locate: {"id": "c81c4e9a33", "prompt": "the search bar"}, param: null }
48+
- type: 'Hover', move mouse over to the located element
49+
* { locate: LocateParam, param: null }
50+
- type: 'Input', replace the value in the input field
51+
* { locate: LocateParam, param: { value: string } }
52+
* \`value\` is the final required input value based on the existing input. No matter what modifications are required, just provide the final value to replace the existing input value.
53+
- type: 'KeyboardPress', press a key
54+
* { param: { value: string } }
55+
- type: 'Scroll', scroll up or down.
56+
* {
57+
locate: LocateParam | null,
58+
param: {
59+
direction: 'down'(default) | 'up' | 'right' | 'left',
60+
scrollType: 'once' (default) | 'untilBottom' | 'untilTop' | 'untilRight' | 'untilLeft',
61+
distance: null | number
62+
}
63+
}
64+
* To scroll some specific element, put the element at the center of the region in the \`locate\` field. If it's a page scroll, put \`null\` in the \`locate\` field.
65+
* \`param\` is required in this action. If some fields are not specified, use direction \`down\`, \`once\` scroll type, and \`null\` distance.
66+
- type: 'FalsyConditionStatement'
67+
* { param: null }
68+
* use this action when the instruction is an "if" statement and the condition is falsy.
69+
- type: 'Sleep'
70+
* { param: { timeMs: number } }
71+
72+
## How to compose the \`taskWillBeAccomplished\` and \`furtherPlan\` fields ?
73+
74+
\`taskWillBeAccomplished\` is a boolean field, means whether the task will be accomplished after all the actions.
75+
76+
\`furtherPlan\` is used when the task cannot be accomplished. It follows the scheme { whatHaveDone: string, whatToDoNext: string }:
77+
- \`whatHaveDone\`: a string, describe what have been done after the previous actions.
78+
- \`whatToDoNext\`: a string, describe what should be done next after the previous actions has finished. It should be a concise and clear description of the actions to be performed. Make sure you don't lose any necessary steps user asked.
79+
80+
81+
82+
## Output JSON Format:
83+
84+
The JSON format is as follows:
85+
86+
{
87+
"actions": [
88+
{
89+
"thought": "Reasons for generating this task, and why this task is feasible on this page.", // Use the same language as the user's instruction.
90+
"type": "Tap",
91+
"param": null,
92+
"locate": {"id": "c81c4e9a33", "prompt": "the search bar"} | null,
93+
},
94+
// ... more actions
95+
],
96+
"taskWillBeAccomplished": boolean,
97+
"furtherPlan": { "whatHaveDone": string, "whatToDoNext": string } | null, // Use the same language as the user's instruction.
98+
"error"?: string // Use the same language as the user's instruction.
99+
}
100+
Here is an example of how to decompose a task:
101+
102+
When a user says 'Click the language switch button, wait 1s, click "English"', the user will give you the description like this:
103+
104+
====================
105+
106+
The size of the page: 1280 x 720
107+
Some of the elements are marked with a rectangle in the screenshot, some are not.
108+
109+
JSON description of all the elements in screenshot:
110+
id=c81c4e9a33: {
111+
"markerId": 2, // The number indicated by the rectangle label in the screenshot
112+
"attributes": // Attributes of the element
113+
{"data-id":"@submit s0","class":".gh-search","aria-label":"搜索","nodeType":"IMG", "src": "image_url"},
114+
"rect": { "left": 16, "top": 378, "width": 89, "height": 16 } // Position of the element in the page
115+
}
116+
117+
id=5a29bf6419bd: {
118+
"content": "获取优惠券",
119+
"attributes": { "nodeType": "TEXT" },
120+
"rect": { "left": 32, "top": 332, "width": 70, "height": 18 }
121+
}
122+
123+
...many more
124+
====================
125+
126+
By viewing the page screenshot and description, you should consider this and output the JSON:
127+
128+
* The main steps should be: tap the switch button, sleep, and tap the 'English' option
129+
* The language switch button is shown in the screenshot, but it's not marked with a rectangle. So we have to use the page description to find the element. By carefully checking the context information (coordinates, attributes, content, etc.), you can find the element.
130+
* The "English" option button is not shown in the screenshot now, it means it may only show after the previous actions are finished. So the last action will have a \`null\` value in the \`locate\` field.
131+
* The task cannot be accomplished (because we cannot see the "English" option now), so a \`furtherPlan\` field is needed.
132+
133+
{
134+
"actions":[
135+
{
136+
"type": "Tap",
137+
"thought": "Click the language switch button to open the language options.",
138+
"param": null,
139+
"locate": {"id": "c81c4e9a33", "prompt": "the search bar"},
140+
},
141+
{
142+
"type": "Sleep",
143+
"thought": "Wait for 1 second to ensure the language options are displayed.",
144+
"param": { "timeMs": 1000 },
145+
},
146+
{
147+
"type": "Tap",
148+
"thought": "Locate the 'English' option in the language menu.",
149+
"param": null,
150+
"locate": null
151+
},
152+
],
153+
"error": null,
154+
"taskWillBeAccomplished": false,
155+
"furtherPlan": {
156+
"whatToDoNext": "find the 'English' option and click on it",
157+
"whatHaveDone": "Click the language switch button and wait 1s"
158+
}
159+
}
160+
161+
Here is another example of how to tolerate error situations only when the instruction is an "if" statement:
162+
163+
If the user says "If there is a popup, close it", you should consider this and output the JSON:
164+
165+
* By viewing the page screenshot and description, you cannot find the popup, so the condition is falsy.
166+
* The instruction itself is an "if" statement, it means the user can tolerate this situation, so you should leave a \`FalsyConditionStatement\` action.
167+
168+
{
169+
"actions": [{
170+
"type": "FalsyConditionStatement",
171+
"thought": "There is no popup on the page",
172+
"param": null
173+
}
174+
],
175+
"taskWillBeAccomplished": true,
176+
"furtherPlan": null
177+
}
178+
179+
For contrast, if the user says "Close the popup" in this situation, you should consider this and output the JSON:
180+
181+
{
182+
"actions": [],
183+
"error": "The instruction and page context are irrelevant, there is no popup on the page",
184+
"taskWillBeAccomplished": true,
185+
"furtherPlan": null
186+
}
187+
188+
Here is an example of when task is accomplished, don't plan more actions:
189+
190+
When the user ask to "Wait 4s", you should consider this:
191+
192+
{
193+
"actions": [
194+
{
195+
"type": "Sleep",
196+
"thought": "Wait for 4 seconds",
197+
"param": { "timeMs": 4000 },
198+
},
199+
],
200+
"taskWillBeAccomplished": true,
201+
"furtherPlan": null // All steps have been included in the actions, so no further plan is needed
202+
}
203+
204+
Here is an example of what NOT to do:
205+
206+
Wrong output:
207+
208+
{
209+
"actions":[
210+
{
211+
"type": "Tap",
212+
"thought": "Click the language switch button to open the language options.",
213+
"param": null,
214+
"locate": {
215+
{"id": "c81c4e9a33", "prompt": "the search bar"}, // WRONG:prompt is missing
216+
}
217+
},
218+
{
219+
"type": "Tap",
220+
"thought": "Click the English option",
221+
"param": null,
222+
"locate": null, // This means the 'English' option is not shown in the screenshot, the task cannot be accomplished
223+
}
224+
],
225+
"taskWillBeAccomplished": false,
226+
// WRONG: should not be null
227+
"furtherPlan": null,
228+
}
229+
230+
Reason:
231+
* The \`prompt\` is missing in the first 'Locate' action
232+
* Since the option button is not shown in the screenshot, the task cannot be accomplished, so a \`furtherPlan\` field is needed.
233+
"
234+
`;

packages/midscene/tests/ai/evaluate/assertion.test.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ describe('ai inspect element', () => {
7272
console.log('assertion passed, thought:', result?.content?.thought);
7373
},
7474
{
75-
timeout: 60 * 1000,
75+
timeout: 3 * 60 * 1000,
7676
},
7777
);
7878
});

packages/midscene/tests/ai/evaluate/plan/__snapshots__/planning-input.test.ts.snap

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ exports[`automation - planning input > input value 1`] = `
55
{
66
"locate": {
77
"id": "fbc2d002",
8-
"prompt": "the input field with placeholder 'What needs to be done?'",
8+
"prompt": "the input field labeled 'What needs to be done?'",
99
},
1010
"param": {
1111
"value": "learning english",
@@ -21,7 +21,7 @@ exports[`automation - planning input > input value 2`] = `
2121
{
2222
"locate": {
2323
"id": "fbc2d002",
24-
"prompt": "the input field labeled 'What needs to be done?'",
24+
"prompt": "the input field with placeholder 'What needs to be done?'",
2525
},
2626
"param": {
2727
"value": "learning english",
@@ -45,7 +45,7 @@ exports[`automation - planning input > input value Add, delete, correct and chec
4545
{
4646
"locate": {
4747
"id": "fbc2d002",
48-
"prompt": "the task input box with content 'Learn English'",
48+
"prompt": "the task input box with the content 'Learn English'",
4949
},
5050
"param": {
5151
"value": "Learn English tomorrow",
@@ -61,7 +61,7 @@ exports[`automation - planning input > input value Add, delete, correct and chec
6161
{
6262
"locate": {
6363
"id": "fbc2d002",
64-
"prompt": "the task input box containing 'Learn English'",
64+
"prompt": "the input box containing 'Learn English'",
6565
},
6666
"param": {
6767
"value": "Learn Skiing",

packages/midscene/tests/ai/extract/__snapshots__/extract.test.ts.snap

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ exports[`extract > online order 1`] = `
1313
},
1414
],
1515
"errors": [],
16-
"language": "zh",
16+
"language": "en",
1717
}
1818
`;
1919

packages/midscene/tests/ai/prompt.test.ts

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,7 @@ import { describe, expect, it, test } from 'vitest';
44
describe('automation - computer', () => {
55
it('should be able to generate prompt', async () => {
66
const prompt = await systemPromptToTaskPlanning();
7-
console.log(prompt);
87
expect(prompt).toBeDefined();
8+
expect(prompt).toMatchSnapshot();
99
});
1010
});
11-
test('inspect with quick answer', async () => {});

packages/web-integration/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@
130130
"cors": "2.8.5",
131131
"express": "4.21.1",
132132
"inquirer": "10.1.5",
133-
"openai": "4.57.1",
133+
"openai": "4.81.0",
134134
"socket.io": "4.8.1",
135135
"socket.io-client": "4.8.1"
136136
},

packages/web-integration/tests/ai/web/playwright/ai-auto-todo.spec.ts

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,9 @@ test('ai todo', async ({ ai, aiQuery }) => {
3434
await ai('Click the checkbox next to the second task');
3535
await ai('Click the "completed" Status button below the task list');
3636

37-
const taskList = await aiQuery<string[]>('string[], tasks in the list');
37+
const taskList = await aiQuery<string[]>(
38+
'string[], Extract all task names from the list',
39+
);
3840
expect(taskList.length).toBe(1);
3941
expect(taskList[0]).toBe('Learning AI the day after tomorrow');
4042

packages/web-integration/tests/ai/web/puppeteer/showcase.test.ts

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -91,17 +91,27 @@ describe(
9191
expect(names.length).toBeGreaterThan(5);
9292
});
9393

94-
it('search engine', async () => {
95-
const { originPage, reset } = await launchPage('https://www.baidu.com/');
96-
resetFn = reset;
97-
const mid = new PuppeteerAgent(originPage);
98-
await mid.aiAction('type "AI 101" in search box');
99-
await mid.aiAction(
100-
'type "Hello world" in search box, hit Enter, wait 2s, click the second result, wait 4s',
101-
);
94+
it(
95+
'search engine',
96+
async () => {
97+
const { originPage, reset } = await launchPage(
98+
'https://www.baidu.com/',
99+
);
100+
resetFn = reset;
101+
const mid = new PuppeteerAgent(originPage);
102+
await mid.aiAction('type "AI 101" in search box');
103+
await mid.aiAction(
104+
'type "Hello world" in search box, hit Enter, wait 2s, click the second result, wait 4s',
105+
);
102106

103-
await mid.aiWaitFor('there are some search results about "Hello world"');
104-
});
107+
await mid.aiWaitFor(
108+
'there are some search results about "Hello world"',
109+
);
110+
},
111+
{
112+
timeout: 3 * 60 * 1000,
113+
},
114+
);
105115

106116
it('scroll', async () => {
107117
const htmlPath = path.join(__dirname, 'scroll.html');
@@ -152,6 +162,6 @@ describe(
152162
});
153163
},
154164
{
155-
timeout: 60 * 1000,
165+
timeout: 4 * 60 * 1000,
156166
},
157167
);

0 commit comments

Comments
 (0)