WakingUp: Experiments on Hallucination Detection and Recovery

Installation

install Lean 4
install poetry
clone repository
cd into the repo, then clone LeanTool
install LeanTool by following its instructions, which includes installing Pantograph
modify pyproject.toml in this directory (WakingUp) to have the right path name for pantograph's wheel file
poetry install
lake update
cd pbt
lake update (unfortunately, the scripts in pbt directory have a dependence on a different version of mathlib, so we are installing two versions of mathlib in two different directories)

code_contests_sample_passed.jsonl: 10 autoformalized problem instances, created from code_contests data set via the FormalizeWithTest pipeline.
easy_with_tests.jsonl: CodeProofBenchmark problems, with test cases automatically generated using the script pbt/make_tests.py
in the repo directory, do

poetry shell

python code_only.py code_contests_sample_passed.jsonl <output_file> <model_name>

where model_name can be gpt (for GPT 4o), sonnet (for Sonnet 3.5), and deepseek (for DeepSeek v3). You can try any other model supported by LiteLLM.

Manually inspect the outputs. In our initial experiment, we focused on a particular problem where DeepSeek passed 3 out of 4 test cases. Extract the lines (e.g. using grep) for further processing.
To apply PBT:

cd pbt
python pbt.py ../<input_file> ../<output_file>

cd ..
python pbt_recog.py <input_file> <output_file> <model_name>

Examples of hallucination detection and recovery are collected in the directory episodes/.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
WakingUp		WakingUp
episodes		episodes
pbt		pbt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WakingUp.lean		WakingUp.lean
code_contests_sample_passed.jsonl		code_contests_sample_passed.jsonl
code_contests_sample_passed_split.jsonl		code_contests_sample_passed_split.jsonl
code_only.py		code_only.py
code_only_10.py		code_only_10.py
easy_with_tests.jsonl		easy_with_tests.jsonl
lake-manifest.json		lake-manifest.json
lakefile.toml		lakefile.toml
lean-toolchain		lean-toolchain
pbt_recog.py		pbt_recog.py
pbt_recog_analysis.py		pbt_recog_analysis.py
pbt_recog_bon.py		pbt_recog_bon.py
poetry.lock		poetry.lock
printj.py		printj.py
pyproject.toml		pyproject.toml
recog.py		recog.py
tester.py		tester.py