This repository is a playground to explore different approaches for deductive reasoning with LLMs. The project focuses on exploring the creation, improvment and reuse of logical formalisations as part of autoformalisation.
The methodology, data and experimental results are described in our [paper][https://arxiv.org/abs/2502.04352] (accepted at ECAI 2025).
@misc{hoppe_investigating_2025,
title = {Investigating the Robustness of Deductive Reasoning with Large Language Models},
url = {http://arxiv.org/abs/2502.04352},
doi = {10.48550/arXiv.2502.04352},
author = {Hoppe, Fabian and Ilievski, Filip and Kalo, Jan-Christoph},
}
At the moment the documentation of this repository is limited, but it might be extended in the future. The repository provides anything you need to reproduce the performed experiments and hopefully extend upon our work.
Feel free to contact me, in case you identify bugs or other challenges while using this simple project.
The following sections are work in progress in order to provide a bit more documentation for the repository itself.
The framework relies on Hydra for configuration managment, MLFlow for experiment tracking and parts of the autoformalisation pipline are implemented using LangGraph.
- src/langdeductive: Python package to group all deductive reasoning models.
- configs: YAML configuration files of the framework
- tests: Stub for unit-tests
- data: Folder for dataset (I use softlinks )
- outputs: Default folder for results
Experiments:
- evaluate.py: Running evaluation on a set of datasets.
- summarize.py: Summarize previous evaluation results from MLflow tracking server
The project contains a pyproject.toml file with all dependencies (setting up the project should be possible with uv sync )
Running the experiments requires several enviroment variables, that should be specified in an .env file.
PROJECT_ROOT="<absolut path to project dir>"
OPENAI_API_KEY=""
HUGGINGFACE_KEY=""
MLFLOW_TRACKING_USERNAME=""
MLFLOW_TRACKING_PASSWORD=""
Hydra
MLFlow
python -m evaluate --config-name eval