Zebra Puzzles

Generation and LLM evaluation of zebra puzzles in multiple languages and themes.

Usage

Run uv run src/scripts/build_dataset.py to generate puzzles.

Run uv run src/scripts/evaluate.py to evaluate puzzles.

Run uv run src/scripts/plot_performance.py to plot and compare puzzle evaluation performance.

Run uv run src/scripts/fix_files.py to combine datasets. Use the script to edit many filenames at once and/or move files to another folder.

Run uv run src/scripts/format_dataets.py to format and push a dataset to Hugging Face.

Use the configuration in config/config.yaml to specify:

language and theme of puzzles
model for evaluation (e.g. gpt-4o-mini, gpt-4o, o3-mini, o3)
whether to generate new LLM responses
data folders
number of puzzles to generate
puzzle dimensions
clue type weights
number of red herrings to include

The chosen main data folder contains puzzles, their solutions, LLM reponses, chosen clue types and the indices to red herring clues in each puzzle. LLM scores are saved in the 'scores' subfolder. Plots and cross-model comparisons are saved in the 'plots' subfolder.

Puzzles can be evaluated using fewer red herrings than they were generated with. This allows for measuring the impact of red herrings. If the number of red herrings is reduced, the new version of the puzzle is saved in a 'reduced_puzzles' folder, and the clue types are saved in a 'reduced_clue_types' folder.

Example

The following is an example of a 2x3 puzzle with 5 red herrings. The theme is houses and the language is English.

A row of houses have numbers 1 to 2 from left to right.

In each house lives a person with unique attributes in each of the following categories:

Nationalities: Faroe Islands and United Kingdom.
Drinks: cocoa and coffee.
Hobbies: handball and painting.

We also know the following:

1. The Faroese thinks the second best fruit is mango.
2. The coffee drinker plays handball.
3. The Brit lives in house no. 2.
4. The person who paints does not live in house no. 1.
5. The person with a sister plays video games.
6. The person who paints lives next to the person who often sails.
7. The person who owns a cactus does not live in house no. 1.
8. The person who paints wears glasses.

Who has which attributes and lives in which house?

Please submit your answer as a JSON dictionary in the format below. Each row must begin with object_X where X is the house number. Each column represents a category, and they should be in the same order as in the list of categories above.

{
    "object_1": [
        "nationalities_1",
        "drinks_1",
        "hobbies_1"
    ],
    "object_2": [
        "nationalities_2",
        "drinks_2",
        "hobbies_2"
    ]
}

Typical runtimes

Typical runtimes for generating a puzzle of size n_objects x n_attributes are (using all clue types):

3x7: 0.7 s
4x4: 0.6 s
4x5: 13 s
4x6: 3 min
5x3: 3.8 s
5x6: >10 min
6x3: 4 min

Typical times for evaluation of a puzzle without red herrings:

gpt-4o-mini:

3x3: 1.5 s
4x4: 2 s
4x5: 2 s

o3-mini:

2x2: 6 s
3x3: 25 s (35 s with 5 red herrings)
4x4: 2 min
4x5: 8 min

GitHub Copilot has been used for this project.

Developer:

Sofie Helene Bruun (sofie.bruun@alexandra.dk)

Setup

Installation

Run make install, which sets up a virtual environment and all Python dependencies therein.
Run source .venv/bin/activate to activate the virtual environment.
(Optional) Run make install-pre-commit, which installs pre-commit hooks for linting, formatting and type checking.

Adding and Removing Packages

To install new PyPI packages, run:

uv add <package-name>

To remove them again, run:

uv remove <package-name>

To show all installed packages, run:

uv pip list

All Built-in Commands

The project includes the following convenience commands:

make install: Install the project and its dependencies in a virtual environment.
make install-pre-commit: Install pre-commit hooks for linting, formatting and type checking.
make lint: Lint the code using ruff.
make format: Format the code using ruff.
make type-check: Type check the code using mypy.
make test: Run tests using pytest and update the coverage badge in the readme.
make docker: Build a Docker image and run the Docker container.
make docs: View documentation locally in a browser.
make publish-docs: Publish documentation to GitHub Pages.
make tree: Show the project structure as a tree.

A Word on Modules and Scripts

In the src directory there are two subdirectories, zebra_puzzles and scripts. This is a brief explanation of the differences between the two.

Modules

All Python files in the zebra_puzzles directory are modules internal to the project package. Examples here could be a general data loading script, a definition of a model, or a training function. Think of modules as all the building blocks of a project.

When a module is importing functions/classes from other modules we use the relative import notation - here's an example:

from .other_module import some_function

Scripts

Python files in the scripts folder are scripts, which are short code snippets that are external to the project package, and which is meant to actually run the code. As such, only scripts will be called from the terminal. An analogy here is that the internal numpy code are all modules, but the Python code you write where you import some numpy functions and actually run them, that a script.

When importing module functions/classes when you're in a script, you do it like you would normally import from any other package:

from zebra_puzzles import some_function

Note that this is also how we import functions/classes in tests, since each test Python file is also a Python script, rather than a module.

Features

Docker Setup

A Dockerfile is included in the new repositories, which by default runs src/scripts/main.py. You can build the Docker image and run the Docker container by running make docker.

Automatic Documentation

Run make docs to create the documentation in the docs folder, which is based on your docstrings in your code. You can publish this documentation to Github Pages by running make publish-docs. To add more manual documentation pages, simply add more Markdown files to the docs directory; this will automatically be included in the documentation.

Automatic Test Coverage Calculation

Run make test to test your code, which also updates the "coverage badge" in the README, showing you how much of your code base that is currently being tested.

Continuous Integration

Github CI pipelines are included in the repo, running all the tests in the tests directory, as well as building online documentation, if Github Pages has been enabled for the repository (can be enabled on Github in the repository settings).

Code Spaces

Code Spaces is a new feature on Github, that allows you to develop on a project completely in the cloud, without having to do any local setup at all. This repo comes included with a configuration file for running code spaces on Github. When hosted on alexandrainst/zebra_puzzles then simply press the <> Code button and add a code space to get started, which will open a VSCode window directly in your browser.

Name		Name	Last commit message	Last commit date
Latest commit History 313 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
config		config
docs		docs
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
dependabot.yaml		dependabot.yaml
makefile		makefile
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Zebra Puzzles

Usage

Example

Typical runtimes

Setup

Installation

Adding and Removing Packages

All Built-in Commands

A Word on Modules and Scripts

Modules

Scripts

Features

Docker Setup

Automatic Documentation

Automatic Test Coverage Calculation

Continuous Integration

Code Spaces

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

alexandrainst/zebra_puzzles

Folders and files

Latest commit

History

Repository files navigation

Zebra Puzzles

Usage

Example

Typical runtimes

Setup

Installation

Adding and Removing Packages

All Built-in Commands

A Word on Modules and Scripts

Modules

Scripts

Features

Docker Setup

Automatic Documentation

Automatic Test Coverage Calculation

Continuous Integration

Code Spaces

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages