Skip to content

The project page for "LOGIC-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning"

Notifications You must be signed in to change notification settings

NSombekke/Logic-LLM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

743 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logic-LM extension by integrating Llama3 and Linear Temporal Logic

You can find the code in src (this has not been cleaned yet, we will do this for the final submission).

Below you can find the readme of the original logic-LM paper.

Logic-LM

Data and Codes for "LOGIC-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning" (Findings of EMNLP 2023).

Authors: Liangming Pan, Alon Albalak, Xinyi Wang, William Yang Wang.

NLP Group, University of California, Santa Barbara

Introduction

Large Language Models (LLMs) have shown human-like reasoning abilities but still struggle with complex logical problems. This paper introduces a novel framework, Logic-LM, which integrates LLMs with symbolic solvers to improve logical problem-solving. Our method first utilizes LLMs to translate a natural language problem into a symbolic formulation. Afterward, a deterministic symbolic solver performs inference on the formulated problem. We also introduce a self-refinement module, which utilizes the symbolic solver's error messages to revise symbolic formalizations. We demonstrate Logic-LM's effectiveness on five logical reasoning datasets: ProofWriter, PrOntoQA, FOLIO, LogicalDeduction, and AR-LSAT. On average, Logic-LM achieves a significant performance boost of 39.2% over using LLM alone with standard prompting and 18.4% over LLM with chain-of-thought prompting. Our findings suggest that Logic-LM, by combining LLMs with symbolic logic, offers a promising avenue for faithful logical reasoning.

The general framework of Logic-LM

First, install all the required packages:

pip install -r requirements.txt

Datasets

The datasets we used are preprocessed and stored in the ./data folder. We evaluate on the following datasets:

  • ProntoQA: Deductive resoning dataset. We use the 5-hop subset of the fictional characters version, consisting of 500 testing examples.
  • ProofWriter: Deductive resoning dataset. We use the depth-5 subset of the OWA version. To reduce overall experimentation costs, we randomly sample 600 examples in the test set and ensure a balanced label distribution.
  • FOLIO: First-Order Logic reasoning dataset. We use the entire FOLIO test set for evaluation, consisting of 204 examples.
  • LogicalDeduction: Constraint Satisfaction Problems (CSPs). We use the full test set consisting of 300 examples.
  • AR-LSAT: Analytical Reasoning (AR) problems, containing all analytical logic reasoning questions from the Law School Admission Test from 1991 to 2016. We use the test set which has 230 multiple-choice questions.
  • Drone Planning: Linear time Temporal Logic (LTL) problems, this dataset contains drone planning related reasoning questions. We use a self-made dev set containing 50 multiple-choice questions.

Baselines

To replicate the Standard-LM (Direct) and the Chain-of-Thought (CoT) baselines, please run the following commands:

python ./src/models/model_baseline.py \
    --api_key "Your API Key (OpenAI or HuggingFace)" \
    --model_name "Model Name" \
    --dataset_name "Dataset Name [ProntoQA | ProofWriter | FOLIO | LogicalDeduction | AR-LSAT]" \
    --split dev \
    --mode "Baseline [Direct | CoT]" \
    --max_new_tokens "16 for Direct; 1024 for CoT"

The results will be saved in ./src/outputs/baselines/[dataset_name]/. To evaluate the results, please run the following commands:

python ./src/models/evaluate.py \
    --dataset_name "Dataset Name [ProntoQA | ProofWriter | FOLIO | LogicalDeduction | AR-LSAT]" \
    --model_name "Model Name [text-davinci-003 | gpt-4]" \
    --split dev \
    --mode "Baseline [Direct | CoT]" \

Logic Program Generation

To generate logic programs for logical reasoning problems in each dataset, at the root directory, run the following commands:

python ./src/models/logic_program.py \
    --api_key "Your API Key (OpenAI or HuggingFace)" \
    --dataset_name "Dataset Name [ProntoQA | ProofWriter | FOLIO | LogicalDeduction | AR-LSAT]" \
    --split dev \
    --model_name "Model Name" \
    --framework "[openai | huggingface]" \
    --max_new_tokens 1024

The generated logic programs will be saved in ./src/outputs/logic_programs/[dataset_name]/. You can also reuse the logic programs we generated in ./src/outputs/logic_programs/.

Logic Inference with Symbolic Solver

After generating logic programs, we can perform inference with symbolic solvers. At the root directory, run the following commands:

DATASET="Dataset Name [ProntoQA | ProofWriter | FOLIO | LogicalDeduction | AR-LSAT]"
SPLIT="Dataset Split [dev | test]"
MODEL="The logic programs are generated by which model?"
BACKUP="The random backup answer (random) or CoT-Logic collabration mode (LLM)"

python ./src/models/logic_inference.py \
    --model_name ${MODEL} \
    --dataset_name ${DATASET} \
    --split ${SPLIT} \
    --backup_strategy ${BACKUP} \
    --backup_LLM_result_path ./src/outputs/baselines/[dataset_name]/CoT_${DATASET}_${SPLIT}_${MODEL}.json

The logic reasoning results will be saved in outputs/logic_inferences.

Backup Strategies:

  • random: If the generated logic program cannot be executed by the symbolic solver, we will use random guess as the prediction.
  • LLM: If the generated logic program cannot be executed by the symbolic solver, we will back up to using CoT to generate the prediction. To run this mode, you need to have the corresponding baseline LLM results stored in ./src/outputs/baselines/[dataset_name]/. To make the inference more efficient, the model will just load the baseline LLM results and use them as the prediction if the symbolic solver fails.

Evaluation

To evaluate the logic reasoning results, please run the following commands:

python ./src/models/evaluation.py \
    --dataset_name "Dataset Name [ProntoQA | ProofWriter | FOLIO | LogicalDeduction]" \
    --model_name "The logic programs are generated by which model?" \
    --split dev \
    --backup "The basic mode (random) or CoT-Logic collabration mode (LLM)"

Self-Refinement

After generating the logic programs without self-refinement, run the following commands for self-refinement:

DATASET="Dataset Name [ProntoQA | ProofWriter | FOLIO | LogicalDeduction | AR-LSAT]"
SPLIT="Dataset Split [dev | test]"
MODEL="The logic programs are generated by which model?"
BACKUP="The random backup answer (random) or CoT-Logic collabration mode (LLM)"

python ./src/models/self_refinement.py \
    --model_name ${MODEL} \
    --dataset_name ${DATASET} \
    --split ${SPLIT} \
    --backup_strategy ${BACKUP} \
    --backup_LLM_result_path ./src/outputs/baselines/${DATASET}/CoT_${DATASET}_${SPLIT}_${MODEL}.json
    --api_key "Your OpenAI API Key" \
    --maximum_rounds 3 \

The self-refinement results will be saved in ./src/outputs/logic_inference/${DATASET}/.

Reference

Please cite the paper in the following format if you use this dataset during your research.

@inproceedings{PanLogicLM23,
  author       = {Liangming Pan and
                  Alon Albalak and
                  Xinyi Wang and
                  William Yang Wang},
  title        = {{Logic-LM:} Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning},
  booktitle    = {Findings of the 2023 Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP)},
  address      = {Singapore},
  year         = {2023},
  month        = {Dec},
  url          = {https://arxiv.org/abs/2305.12295}
}

Credit

The codes for the SMT solver are modified from SatLM.

About

The project page for "LOGIC-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C 77.6%
  • HTML 13.2%
  • Python 5.9%
  • Perl 1.0%
  • Makefile 0.8%
  • XSLT 0.7%
  • Other 0.8%