Generation and LLM evaluation of zebra puzzles in multiple languages and themes.
Run uv run src/scripts/build_dataset.py
to generate puzzles.
Run uv run src/scripts/evaluate.py
to evaluate puzzles.
Run uv run src/scripts/plot_performance.py
to plot and compare puzzle evaluation performance.
Run uv run src/scripts/fix_files.py
to combine datasets. Use the script to edit many filenames at once and/or move files to another folder.
Run uv run src/scripts/format_dataets.py
to format and push a dataset to Hugging Face.
Use the configuration in config/config.yaml
to specify:
- language and theme of puzzles
- model for evaluation (e.g. gpt-4o-mini, gpt-4o, o3-mini, o3)
- whether to generate new LLM responses
- data folders
- number of puzzles to generate
- puzzle dimensions
- clue type weights
- number of red herrings to include
The chosen main data folder contains puzzles, their solutions, LLM reponses, chosen clue types and the indices to red herring clues in each puzzle. LLM scores are saved in the 'scores' subfolder. Plots and cross-model comparisons are saved in the 'plots' subfolder.
Puzzles can be evaluated using fewer red herrings than they were generated with. This allows for measuring the impact of red herrings. If the number of red herrings is reduced, the new version of the puzzle is saved in a 'reduced_puzzles' folder, and the clue types are saved in a 'reduced_clue_types' folder.
The following is an example of a 2x3 puzzle with 5 red herrings. The theme is houses and the language is English.
A row of houses have numbers 1 to 2 from left to right.
In each house lives a person with unique attributes in each of the following categories:
Nationalities: Faroe Islands and United Kingdom.
Drinks: cocoa and coffee.
Hobbies: handball and painting.
We also know the following:
1. The Faroese thinks the second best fruit is mango.
2. The coffee drinker plays handball.
3. The Brit lives in house no. 2.
4. The person who paints does not live in house no. 1.
5. The person with a sister plays video games.
6. The person who paints lives next to the person who often sails.
7. The person who owns a cactus does not live in house no. 1.
8. The person who paints wears glasses.
Who has which attributes and lives in which house?
Please submit your answer as a JSON dictionary in the format below. Each row must begin with object_X where X is the house number. Each column represents a category, and they should be in the same order as in the list of categories above.
{
"object_1": [
"nationalities_1",
"drinks_1",
"hobbies_1"
],
"object_2": [
"nationalities_2",
"drinks_2",
"hobbies_2"
]
}
Typical runtimes for generating a puzzle of size n_objects x n_attributes are (using all clue types):
- 3x7: 0.7 s
- 4x4: 0.6 s
- 4x5: 13 s
- 4x6: 3 min
- 5x3: 3.8 s
- 5x6: >10 min
- 6x3: 4 min
Typical times for evaluation of a puzzle without red herrings:
gpt-4o-mini:
- 3x3: 1.5 s
- 4x4: 2 s
- 4x5: 2 s
o3-mini:
- 2x2: 6 s
- 3x3: 25 s (35 s with 5 red herrings)
- 4x4: 2 min
- 4x5: 8 min
GitHub Copilot has been used for this project.
Developer:
- Sofie Helene Bruun (sofie.bruun@alexandra.dk)
- Run
make install
, which sets up a virtual environment and all Python dependencies therein. - Run
source .venv/bin/activate
to activate the virtual environment. - (Optional) Run
make install-pre-commit
, which installs pre-commit hooks for linting, formatting and type checking.
To install new PyPI packages, run:
uv add <package-name>
To remove them again, run:
uv remove <package-name>
To show all installed packages, run:
uv pip list
The project includes the following convenience commands:
make install
: Install the project and its dependencies in a virtual environment.make install-pre-commit
: Install pre-commit hooks for linting, formatting and type checking.make lint
: Lint the code usingruff
.make format
: Format the code usingruff
.make type-check
: Type check the code usingmypy
.make test
: Run tests usingpytest
and update the coverage badge in the readme.make docker
: Build a Docker image and run the Docker container.make docs
: View documentation locally in a browser.make publish-docs
: Publish documentation to GitHub Pages.make tree
: Show the project structure as a tree.
In the src
directory there are two subdirectories, zebra_puzzles
and scripts
. This is a brief explanation of the differences between the two.
All Python files in the zebra_puzzles
directory are modules
internal to the project package. Examples here could be a general data loading script,
a definition of a model, or a training function. Think of modules as all the building
blocks of a project.
When a module is importing functions/classes from other modules we use the relative import notation - here's an example:
from .other_module import some_function
Python files in the scripts
folder are scripts, which are short code snippets that
are external to the project package, and which is meant to actually run the code. As
such, only scripts will be called from the terminal. An analogy here is that the
internal numpy
code are all modules, but the Python code you write where you import
some numpy
functions and actually run them, that a script.
When importing module functions/classes when you're in a script, you do it like you would normally import from any other package:
from zebra_puzzles import some_function
Note that this is also how we import functions/classes in tests, since each test Python file is also a Python script, rather than a module.
A Dockerfile is included in the new repositories, which by default runs
src/scripts/main.py
. You can build the Docker image and run the Docker container by
running make docker
.
Run make docs
to create the documentation in the docs
folder, which is based on
your docstrings in your code. You can publish this documentation to Github Pages by
running make publish-docs
. To add more manual documentation pages, simply add more
Markdown files to the docs
directory; this will automatically be included in the
documentation.
Run make test
to test your code, which also updates the "coverage badge" in the
README, showing you how much of your code base that is currently being tested.
Github CI pipelines are included in the repo, running all the tests in the tests
directory, as well as building online documentation, if Github Pages has been enabled
for the repository (can be enabled on Github in the repository settings).
Code Spaces is a new feature on Github, that allows you to develop on a project
completely in the cloud, without having to do any local setup at all. This repo comes
included with a configuration file for running code spaces on Github. When hosted on
alexandrainst/zebra_puzzles
then simply press the <> Code
button
and add a code space to get started, which will open a VSCode window directly in your
browser.