Skip to content

Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing

Notifications You must be signed in to change notification settings

mrigankpawagi/PropertyEval

Repository files navigation

PropertyEval

A repository of property-based tests for thorough benchmarking of LLM code generation.

Usage

The /tests directory contains directories labeled from 0 to 163, each of which contains a strategy.py file. This file contains the hypothesis strategy for the corresponding problem from the HumanEval dataset. __init__.py files have been placed in each directory to allow for importing of the tests as modules. The strategies are available as the strategy attribute of these strategy modules. Usage of the strategies is as follows.

from hypothesis import given, strategies

@given(strategies.tuples(*st))
def test_property(args):
    # call functions as f(*args)
    # for example, assert f(*args) == ground_truth(*args)
    # ...

Here, st is the imported strategy. One way to do this is using the importlib module.

import importlib

st_module = importlib.import_module(f"test.{humaneval_id}.strategy")
st = st_module.strategy

Contribution

Thoroughness

We show that it is possible to improve the thoroughness of programming benchmarks using Property-Based Testing (PBT), leveraging the canonical solutions within these benchmarks. For the HumanEval dataset, since adequate property-based tests cannot be automatically generated using rule-based tools, we carefully construct these tests manually. We show that our approach using PBT allows us to synthesize as thorough test cases as those generated using type-aware mutations in Liu et al.'s EvalPlus1. However, our approach can be easily adapted to other contexts.

Dataset

We share our full set of property-based tests as a complementary resource to existing manual and synthesized test suites.

Examples

A non-trivial strategy.

# HumanEval 129: minPath
@composite
def create_grid(draw, n_st=integers(min_value=2, max_value=MAX_SEQUENCE_LEN)):
    n = draw(n_st)
    grid = draw(lists(lists(integers(), min_size=n, max_size=n), min_size=n, max_size=n))    
    perm = draw(permutations(range(1, n**2 + 1)))
    # fill grid with perm
    for i in range(n):
        for j in range(n):
            grid[i][j] = perm[i*n + j]    
    return grid

grid = create_grid()
k = integers(min_value=1, max_value=MAX_INT)
strategy = grid, k

Examples of additional constraints on the input space. Here, we have restricted the alphabet and introduced bounds on the lengths of strings and lists.

# HumanEval 134: check_if_last_char_is_a_letter
txt = text(alphabet='abcde0123 ')
strategy = txt
# HumanEval 143: words_in_sentence
sentence = text(alphabet="a ", min_size=1, max_size=100)
    .map(lambda s: re.sub(r"\s+", " ", s))
    .filter(lambda s: not (s.startswith(" ") or s.endswith(" ")))
strategy = sentence
# HumanEval 158: find_max
words = lists(text(alphabet='abc', max_size=MAX_SEQUENCE_LEN), min_size=1, max_size=MAX_SEQUENCE_LEN)
strategy = words

Automation

For the MBPP dataset, we demonstrate that these tests can be generated largely automatically using GPT-3.5 by providing few-shot prompts based on some of our manually constructed tests. This demonstrates that our approach can be easily scaled to other datasets.

Warning

This is a work in progress, but some preliminary results are available here.

image

Evaluation

The /humaneval_groundtruth directory contains canonical solutions to HumanEval problems, adapted from the ground truth solutions provided with EvalPlus v0.1.0. The results from the equivalence tests on code samples for 84 (model, size, temperature) combinations provided with EvalPlus v0.1.0 are available in evaldata.csv. The script for executing this benchmark is a modified fork2 of the EvalPlus script.

The limits/limits.py file contains several standardized limits for the strategies. The limits/fuzzer.py script is for running fuzz-tests on all HumanEval ground truth with the strategies in order to validate these limits.

Footnotes

  1. Jiawei Liu et al. “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation”. In: arXiv preprint arXiv:2305.01210 (2023)

  2. We forked EvalPlus and modified the evaluation script to evaluate code samples with PropertyEval's property-based tests as well, in addition to the Base and Base + Extra test cases. We further modified the existing pipeline for estimating pass@k for PropertyEval's property-based tests also. The fork is available as EvalPlusPro. Some points to note are as follows.

    1. The property-based tests are executed with 1000 examples, with @settings(max_examples=1000).
    2. Instead of the time limits enforced by EvalPlus, we use the default deadline of 200ms that comes with Hypothesis.

About

Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages