Nerdsking HumanEval Benchmark for llama.cpp (GGUF)

A strict, auditable HumanEval benchmark runner for GGUF models served via llama.cpp, using its OpenAI-compatible HTTP API.

This project focuses on correct execution semantics and reproducibility:

Prompts are preserved verbatim (no stripping or truncation).
Only fenced Python code is accepted.
Each task is executed using strict HumanEval semantics.
Full outputs and failure reasons are saved for auditing.

How to install from PyPI:

pip install gguf-humaneval-benchmark

After install, verify installation:

gguf-humaneval-benchmark --help

You should see the CLI help for the HumanEval benchmark runner.

Key Features

✅ Correct HumanEval Semantics (Strict)

For every task, execution follows exactly:

Execute the original prompt (function signature + docstring)
Execute the model-generated code
Execute the test harness
Call check(entry_point)

✅ Prompt Integrity

HumanEval prompts are used verbatim
No stripping, rewriting, or truncation
Only a minimal instruction header is prepended
Raw prompts are stored in the output JSON

✅ Strict Code Extraction

Only code inside a single fenced block is accepted:

```python
# code here


If no such block exists → **automatic failure (`no_code`)**.

---

### ✅ Full Failure Attribution
Each failed task records:
- `error_type`
- `error_detail`
- `full_response`
- `generated_code`

---

### ✅ llama.cpp Native Support
- Automatic server start/stop **or** reuse an existing server
- Uses `/v1/completions` OpenAI-compatible API
- Streaming-safe and timeout-safe
- **GGUF-only by design**

---

## Repository Structure

. ├── benchmark.py ├── HumanEval.jsonl ├── LICENSE ├── README.md └── eval_utils/ ├── init.py ├── bench_config.json └── code_bench.py


---

## Dependencies

### Required
- Python 3.10+
- llama.cpp (with server support)
- GGUF model

### Python packages
```bash
pip install requests datasets

Building llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir -p build && cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

Usage

Automatic server management

python benchmark.py --model model.gguf --server-path /path/to/llama.cpp/build/bin

Use existing server

python benchmark.py --server-url http://127.0.0.1:8080 --no-server

Use local HumanEval

python benchmark.py --humaneval-jsonl HumanEval.jsonl

Output

A JSON file is generated containing:

Full configuration
Per-task results
Raw model outputs
Error attribution
Timing metrics

License / citation

-If you use this code in research or benchmarking, please cite:

https://github.com/nerdskingcom/gguf-humaneval-benchmark, IPMN/IMNECHO / https://Nerdsking.com

☕ Support

If this project helped you, consider supporting my work:

https://patreon.com/Nerdsking?utm_medium=unknown&utm_source=join_link&utm_campaign=creatorshare_creator&utm_content=copyLink

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
eval_utils		eval_utils
gguf_humaneval_benchmark		gguf_humaneval_benchmark
.gitignore		.gitignore
CITATION.cff		CITATION.cff
HumanEval.jsonl		HumanEval.jsonl
LICENSE		LICENSE
LICENSE.txt		LICENSE.txt
README.md		README.md
benchmark.py		benchmark.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nerdsking HumanEval Benchmark for llama.cpp (GGUF)

How to install from PyPI:

After install, verify installation:

Key Features

✅ Correct HumanEval Semantics (Strict)

✅ Prompt Integrity

✅ Strict Code Extraction

Building llama.cpp

Usage

Automatic server management

Use existing server

Use local HumanEval

Output

License / citation

☕ Support

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Nerdsking HumanEval Benchmark for llama.cpp (GGUF)

How to install from PyPI:

After install, verify installation:

Key Features

✅ Correct HumanEval Semantics (Strict)

✅ Prompt Integrity

✅ Strict Code Extraction

Building llama.cpp

Usage

Automatic server management

Use existing server

Use local HumanEval

Output

License / citation

☕ Support

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages