Skip to content

Commit 4a9b86e

Browse files
committed
0425update
1 parent 2becb93 commit 4a9b86e

File tree

3 files changed

+34
-9
lines changed

3 files changed

+34
-9
lines changed

README.md

+27-3
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,21 @@
22
![framework](pics/Framework.png)
33
Uncertainty of Thought (UoT) is a novel algorithm to augment large language models with the ability to actively seek information by asking effective questions.
44

5-
We tested on two categories of models: open-source (Llama-2-70B) and closed-source commercial (other models). The results showed that UoT achieves an average performance improvement of 57.8% in the rate of successful task completion across multiple LLMs compared with direct prompting, and also improves efficiency (i.e., the number of questions needed to complete the task).
5+
We tested on two categories of models: open-source (Llama-2-70B) and closed-source commercial (other models). The results showed that UoT achieves an average performance improvement of **57.8%** in the rate of successful task completion across multiple LLMs compared with direct prompting, and also improves efficiency (i.e., the number of questions needed to complete the task).
66

77
![result](pics/result.jpg)
88

9+
To increase the number of items, we focus on the open-set scenario where the possibility space is unknown rather than predefined. Within this context, the number of items can increase by considering it as an infinite space, due to the lack of constraints.
910

11+
![result os](pics/results_os.png)
12+
13+
- In medical diagnosis and troubleshooting, initial patient or customer symptom descriptions help form an initial set of possibilities. However, in the game of 20 Questions, with limited early information, setting possibilities too soon may lead to wrong directions. Thus, for the first three rounds, DPOS method is used to collect more data. Afterward, the UoT approach **updates the possibilities each round** to improve the questioning strategy.
14+
- For each dataset, we configure the size of the possibility set for each update round, setting them at 10, 10, 10, 5, 5 and 5, respectively.
15+
- Compared to DPOS, the Uo'T method significantly improves performance, with enhancements of **54.9%** for GPT-3.5 and **21.1%** for GPT-4.
1016

1117
## Update
1218

19+
- \[20/04/2024\]: Supplement the code and experiment results in the open-set scenarios.
1320
- \[19/03/2024\]: Supplement the experiment results of `Mistral-Large`, `Gemini-1.0-pro`, and `Claude-3-Opus` models.
1421
- \[19/03/2024\]: Add the implementation for Gemini
1522
- \[15/03/2024\]: Add the implementation for Gemma
@@ -68,28 +75,45 @@ For further details, follow the [official guidance](https://ai.google.dev/gemma/
6875
Run experiments via `run.py`, which implements the UoT algorithm, as well as the naive prompting method. Arguments are as follows:
6976
7077
- `--guesser_model`: The name of model used to plan and ask questions
78+
7179
- `--temperature`: Parameter for calling guesser model.
80+
7281
- `--examiner_model`: The name of model used to provide environment feedback. Fixed to be `gpt-4` currently.
82+
7383
- `--task` and `--dataset`: Select the corresponding task name and dataset according to the table below.
74-
84+
7585
| Description | task | dataset |
7686
|-------------------|-------|-----------------------|
7787
| 20 Question Game | `20q` | `bigbench` / `common` |
7888
| Medical Diagnosis | `md` | `DX` / `MedDG` |
7989
| Troubleshooting | `tb` | `FiaDial` |
8090
8191
- `--task_start_index` and `--task_end_index`: Conduct experiment with [start, end) targets in selected dataset. (Default: entire dataset)
92+
93+
- `--open_set_size`: Size of the possibility set after updating for open-set setting. (Default: -1, unable open-set setting)
94+
95+
- `--size_to_renew`: When the size of the possibility set is less than this value, update the possibility set to the size of `--open_set_size`. (Consider only when `--open_set_size` > 0)
96+
97+
- `--n_pre_ask`: Number of DPOS rounds at the beginning. (Consider only when `--open_set_size` > 0 and self-report disable)
98+
8299
- `--naive_run`: If True, run with naive prompting method, otherwise UoT.
100+
83101
- `--inform`: If True, the guesser is given answer set. (Consider only when `--naive_run` is True)
102+
84103
- `--reward_lambda`: Parameter $\lambda$ in uncertainty-based reward setting.
104+
85105
- `--n_extend_layers`: Parameter $J$ -- Number of simulation steps.
106+
86107
- `--n_potential_actions`: Parameter $N$ -- Number of candidate actions generated.
108+
87109
- `--n_pruned_nodes`: Max number of remaining nodes in each step.
88-
110+
89111
- If not prun, set it to 0;
90112
- If prun and remain exact number of nodes, set it > 0 (e.g. `10`: Each step has a maximum of 10 nodes, $M$ or $U$, remaining);
91113
- If prun and remain a certain proportion of nodes, set it < 0 (e.g. `-0.5`: The remaining 50% of nodes in each step).
114+
92115
- `--expected_action_tokens`: Max tokens not to call `gpt-3.5-turbo` model simplifying the guesser's selected action.
116+
93117
- `--expected_target_tokens`: Max tokens for each target name. Used to predict and set the `max_tokens` when calling guesser model.
94118
95119
## Implement Note

pics/results_os.png

450 KB
Loading

src/uot/eval.py

+7-6
Original file line numberDiff line numberDiff line change
@@ -3,19 +3,20 @@
33

44
def evaluate_performance(file, task):
55
cnt = success = 0
6-
efficiency = success_efficiency = 0
6+
length = success_length = 0
77
with (open(file, 'r') as f):
88
data = json.load(f)
99
for i in data:
1010
cnt += 1
1111
if i['state'] == 1:
1212
success += 1
13-
success_efficiency += i['turn']
14-
efficiency += i['turn']
13+
success_length += i['turn']
14+
length += i['turn']
1515
else:
16-
efficiency += task.max_turn
16+
length += task.max_turn
1717

1818
print('Dialogue Count:', cnt)
1919
print('Success Rate:', success / cnt)
20-
print('Efficiency in Successful Cases:', success_efficiency / success)
21-
print('Efficiency:', efficiency / cnt)
20+
print('Mean Conversation Length in Successful Cases:', success_length / success)
21+
print('Mean Conversation Length:', length / cnt)
22+

0 commit comments

Comments
 (0)