π [Paper] β’ π [Github Repo] β’ π [Leaderboard] β’ π [Critic Model] β’ βοΈ [Writing Model]
- Switch to Claude-Sonnet-4 for evaluation.
- Add scripts for response generation and score aggregation; see the Quick Start section for details.
- Leaderboard update: introduced requirement-dimension scores and updated the latest LLM evaluation results.
- π Leaderboard Launch: Explore evaluation results on Hugging Face Leaderboard and ModelScope Leaderboard. Update latest LLM evaluations (Claude-3-7-Sonnet, o3, grok-3, etc)
- Parameters for response generation:
top_p: 0.8
;top_k: 20
;temperature: 0.7
;max_length: 16000
(or maximum allowed if less than 16000) - Parameters for scoring:
top_p: 0.95
;top_k: (empty)
;temperature: 1.0
;max_length: 2048
- Leaderboard scores are scaled from 10 to 100 by multiplying by 10 for easier viewing.
- Parameters for response generation:
βΌοΈ Update benchmark queries & criteria for improved assessment, including 1,000 queries and requirement dimension subsets.βΌοΈ Update evaluation prompt for better scoring, and switch to using Claude-3-7-Sonnet for evaluation.
- We release the first version of WritingBench, including 1,239 writing queries and style/format/length dimension subsets.
WritingBench is a comprehensive benchmark for evaluating LLMs' writing capabilities across 1,000 real-world queries, spanning:
- 6 primary domains
- 100 fine-grained subdomains
- 1,500+ avg. tokens per query
WritingBench integrates diverse sources of materials. Each query is paired with 5 instance-specific criteria, scoring either through LLM evaluators or through a finetuned critic model.
WritingBench is built through a hybrid pipeline combining Model-Augmented Query Generation and Human-in-the-Loop Refinement, ensuring both diversity and real-world applicability. The construction process involves two key phases:
Leverage LLMs to generate queries from a two-tiered domain pool grounded in real-world writing scenarios, consisting of 6 primary domains and 100 secondary subdomains, covering:
- π¬ Academic & Engineering
- πΌ Finance & Business
- βοΈ Politics & Law
- π¨ Literature & Art
- π Education
- π’ Advertising & Marketing
Enhance the diversity and practical applicability of queries by random selected strategies from Query Refinement Guidance Pool, covering:
- Style Adjustments (e.g., kid-friendly tone)
- Format Specifications (e.g., IEEE template)
- Length Constraints (e.g., 500-word summary)
- Personalization (e.g., educator's perspective)
- Content Specificity (e.g., 2023 Q3 metrics)
- Expression Optimization (query rewriting)
30 trained annotators collect necessary open-source materials (e.g., public financial statements or legal templates), guided by material requirements generated by LLMs.
5 experts conduct a delicate two-stage filtering process:
- query adaptation: ambiguous or unrealistic queries are revised to better align with the provided materials and practical scenarios
- material pruning: redundant or irrelevant content is eliminated from the collected materials
Given a query
For each criterion
git clone https://github.com/X-PLUG/WritingBench.git
.
βββ generate_response.py # Generation script
βββ evaluate_benchmark.py # Evaluation script
βββ calculate_scores.py # Scoring script
βββ prompt.py # Prompt templates
βββ evaluator/
β βββ __int__.py
β βββ critic.py # Critic model evaluation interface
β βββ llm.py # LLM evaluation interface
βββ benchmark_query/
βββ benchmark_all.jsonl # Full dataset (1,000 queries)
βββ requirement/
βββ style/
β βββ style_subset.jsonl # requirement-involved subset for style
β βββ style_subset_C.jsonl # category-specific subset for style
βββ format/
β βββ format_subset.jsonl # requirement-involved subset for format
β βββ format_subset_C.jsonl # category-specific subset for format
βββ length/
βββ length_subset.jsonl # requirement-involved subset for length
βββ length_subset_C.jsonl # category-specific subset for length
Generate responses for the evaluation queries (you need to complete the API call in the writer
function in generate_response.py
), or obtain the answers through any other method. Whatever approach you take, write the results to a file that follows exactly the same JSONL structure as the example response_model.jsonl
.
python generate_response.py \
--query_file query_set.jsonl \ # use files under benchmark_query/
--output_file ./responses/response_model.jsonl \ # path to save response file
First, Add your API credentials:
- For LLM-as-a-Judge, see
evaluator/llm.py
. Recommend usingClaude-3-7-Sonnet
for evaluation.
self.api_key = "your_api_key_here"
self.url = "Your API endpoint"
self.model = "Chose your model name"
```bash
- For critic model, see `evaluator/critic.py`
```bash
self.model = LLM(
model="", # Your local path. Please download critic model from https://huggingface.co/AQuarterMile/WritingBench-Critic-Model-Qwen-7B.
tensor_parallel_size=1, # Your tensor parallel size setting. Defaults to 1, indicating no parallelism
)
Then choose appropriate evaluation sets from benchmark_query/
.
python evaluate_benchmark.py \
--evaluator critic \ # or claude
--query_criteria_file query_set.jsonl \ # use files under benchmark_query/
--input_file ./responses/response_model.jsonl \
--output_file ./score_dir/score_model.jsonl
An example of response_model.jsonl
used to store responses generated by the evaluated LLMs:
{"index": i, "response": "xxx"}
Store all scoring results from the previous step in a folder, then aggregate the scores.
python calculate_scores.py \
--score_dir ./score_dir \ # Directory with JSONL score files produced in the previous stage
--benchmark_file ./benchmark_query/benchmark_all.jsonl \ # Full benchmark file provided in the repository
--output_excel ./scores.xlsx \ # Path to the aggregated-result Excel file to be generated
--requirement_dir ./requirement # Requirement folder included in the repository
WritingBench aims to be a reliable, comprehensive, and sustainable community resource for charting the frontiers of generative writing. If you are interested in leaderboard construction or any further discussion, please reach out via GitHub or email.
@misc{wu2025writingbench,
title={WritingBench: A Comprehensive Benchmark for Generative Writing},
author={Yuning Wu and Jiahao Mei and Ming Yan and Chenliang Li and Shaopeng Lai and Yuran Ren and Zijia Wang and Ji Zhang and Mengyue Wu and Qin Jin and Fei Huang},
year={2025},
url={https://arxiv.org/abs/2503.05244},
}