Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



52 Commits

Repository files navigation

WenMind Benchmark

  • 2024/09/26 WenMind Benchmark paper has been accepted by NeurIPS 2024.

WenMind is a comprehensive benchmark dedicated for evaluating Large Language Models (LLMs) in Chinese Classical Literature and Language Arts (CCLLA). WenMind covers the sub-domains of Ancient Prose, Ancient Poetry, and Ancient Literary Culture, comprising 4,875 question-answer pairs, spanning 42 fine-grained tasks (as shown in the figure 1), 3 question formats (Fill-in-the-Blank questions, Multiple-Choice questions and Question-and-Answer questions), and 2 evaluation scenarios (domain-oriented and capability-oriented).

Figure 1: Overview of WenMind Benchmark, which covers 3 sub-domains and 42 fine-gained tasks.


You can obtain the complete WenMind evaluation dataset from WenMind Benchmark folder on GitHub.

Data Format

    "id": 2464,
    "domain": "ancient literary culture",
    "capability": "knowledge",
    "question_format": "QA",
    "coarse_grained_task_zh": "成语",
    "coarse_grained_task_en": "idiom",
    "fine_grained_task_zh": "成语解释",
    "fine_grained_task_en": "idiom explanation",
    "question": "解释下面成语的意思:\n暮去朝来",
    "answer": "黄昏过去,清晨又到来。形容时光流逝。"

The following is an explanation of the various fields in the data samples:

  • id: The unique identifier for the data sample, used to distinguish different samples.

  • domain: The domain to which the data sample belongs, including ancient prose, ancient poetry and ancient literary culture.

  • capability: The type of capability of the data sample, including knowledge, understanding and generation.

  • question_format: The format of the question, indicating the type of question in the sample, including FB, MCQ and QA.

  • coarse_grained_task_zh: The Chinese name of the coarse-grained task classification. Describes the coarse-grained task category of the sample, with a total of 26 categories.

  • coarse_grained_task_en: The English name of the coarse-grained task classification. Corresponds to coarse_grained_task_zh, describing the coarse-grained task category of the sample, with a total of 26 categories.

  • fine_grained_task_zh: The Chinese name of the fine-grained task classification. Describes the fine-grained task category of the sample, with a total of 42 categories.

  • fine_grained_task_en: The English name of the fine-grained task classification. Corresponds to fine_grained_task_zh, describing the fine-grained task category of the sample, with a total of 42 categories.

  • question: The actual content of the question. The question to be answered in the sample.

  • answer: The answer to the corresponding question. Provides a detailed response to the question.

Task List

T1-1: Inverted Sentence Structure (倒装句语序)

  • Task Description: Correct word order for inverted sentences.
  • Capability: Understanding
  • Scale: 18

T1-2: Elliptical Sentence (省略句)

  • Task Description: Answer the omitted information in the elliptical sentence.
  • Capability: Understanding
  • Scale: 32

T1-3: Inverted Sentence Types (倒装句类型)

  • Task Description: Identify the inversion type of inverted sentences.
  • Capability: Understanding
  • Scale: 7

T1-4: Sentence Structure Identification (判断句式)

  • Task Description: Identify the sentence's syntactic type.
  • Capability: Understanding
  • Scale: 43

T2: Classical Chinese to Modern Chinese (文白翻译)

  • Task Description: Translate classical Chinese into modern Chinese.
  • Capability: Understanding
  • Scale: 200

T3: Modern Chinese to Classical Chinese (白文翻译)

  • Task Description: Translate modern Chinese into classical Chinese.
  • Capability: Understanding
  • Scale: 200

T4: Named Entity Recognition (命名实体识别)

  • Task Description: Extract named entities from Classical Chinese sentences.
  • Capability: Understanding
  • Scale: 200

T5: Punctuation (句读)

  • Task Description: Add punctuation to Classical Chinese sentences.
  • Capability: Understanding
  • Scale: 200

T6: Topic Classification (主题分类)

  • Task Description: Select theme categories based on Classical Chinese sentences.
  • Capability: Understanding
  • Scale: 200

T7: Word Explanation (字词解释)

  • Task Description: Explain the words and phrases in Classical Chinese sentences.
  • Capability: Understanding
  • Scale: 100

T8: Reading Comprehension (阅读理解)

  • Task Description: Read Classical Chinese texts and answer related questions.
  • Capability: Understanding
  • Scale: 100

T9: Function Words (虚词)

  • Task Description: Answer the usage of function words in classical Chinese sentences.
  • Capability: Understanding
  • Scale: 100

T10: Homophones (通假字)

  • Task Description: Identify whether a character is a homophone.
  • Capability: Understanding
  • Scale: 200

T11: Polysemy (单字多义)

  • Task Description: Distinguish between different meanings of the same character.
  • Capability: Understanding
  • Scale: 200

T12: Classical Chinese Writing (文言文写作)

  • Task Description: Writing in classical Chinese.
  • Capability: Generation
  • Scale: 100

T13-1: Appreciation Exam Questions (赏析真题)

  • Task Description: Answer appreciation questions based on ancient poetry.
  • Capability: Understanding
  • Scale: 150

T13-2: Free Appreciation (自由赏析)

  • Task Description: Conduct a free and detailed analysis of ancient poetry.
  • Capability: Understanding
  • Scale: 100

T14-1: Poetry Writing (诗创作)

  • Task Description: Compose a poem based on the theme.
  • Capability: Generation
  • Scale: 30

T14-2: Ci Writing (词创作)

  • Task Description: Compose a ci based on the theme.
  • Capability: Generation
  • Scale: 50

T14-3: Qu Writing (曲创作)

  • Task Description: Compose a qu based on the theme.
  • Capability: Generation
  • Scale: 20

T15-1: Content Q&A (内容问答)

  • Task Description: Answer the complete content of ancient poetry according to the title and author.
  • Capability: Knowledge
  • Scale: 200

T15-2: Title and Author Q&A (题目作者问答)

  • Task Description: Answer the title and author according to the content of ancient poetry.
  • Capability: Knowledge
  • Scale: 200

T15-3: Write the Next Sentence (下句默写)

  • Task Description: Write the next sentence according to the previous sentence in the ancient poem.
  • Capability: Knowledge
  • Scale: 100

T15-4: Write the Previous Sentence (上句默写)

  • Task Description: Write the previous sentence according to the next sentence in the ancient poem.
  • Capability: Knowledge
  • Scale: 100

T15-5: Comprehension Dictation (理解性默写)

  • Task Description: Provide ancient poetry sentences that meet the requirements.
  • Capability: Knowledge
  • Scale: 30

T15-6: Genre Judgment (判断体裁)

  • Task Description: Judge the genre of ancient poetry.
  • Capability: Knowledge
  • Scale: 120

T16: Ancient Poetry Translation (古诗词翻译)

  • Task Description: Translate ancient poetry into modern Chinese.
  • Capability: Understanding
  • Scale: 200

T17: Sentiment Classification (情感分类)

  • Task Description: Judge the sentiment contained in ancient poetry.
  • Capability: Understanding
  • Scale: 200

T18: Ancient Poetry to English (古诗词英文翻译)

  • Task Description: Translate ancient poetry into English.
  • Capability: Understanding
  • Scale: 50

T19: Poet Introduction (诗人介绍)

  • Task Description: Provide a detailed introduction of the poet.
  • Capability: Knowledge
  • Scale: 110

T20: Analysis of Imagery (意象解析)

  • Task Description: Provide the meanings of the imagery.
  • Capability: Knowledge
  • Scale: 185

T21-1: Couplet Following (接下联)

  • Task Description: Create the following couplet based on the previous one.
  • Capability: Generation
  • Scale: 100

T21-2: Couplet Writing (主题创作)

  • Task Description: Write a couplet based on the theme.
  • Capability: Generation
  • Scale: 100

T21-3: HengPi Writing (拟横批)

  • Task Description: Write HengPi based on the content of a couplet.
  • Capability: Generation
  • Scale: 100

T22-1: Synonyms (近义词)

  • Task Description: Provide the synonym for the idiom.
  • Capability: Knowledge
  • Scale: 100

T22-2: The Origin of Idiom (成语出处)

  • Task Description: Provide the source of the idiom.
  • Capability: Knowledge
  • Scale: 100

T22-3: Idiom Finding (成语蕴含)

  • Task Description: Extract idioms from ancient Chinese sentences and provide their meanings.
  • Capability: Knowledge
  • Scale: 100

T22-4: Idiom Explanation (解释含义)

  • Task Description: Provide the meaning of idioms.
  • Capability: Knowledge
  • Scale: 100

T23: Riddle (谜语)

  • Task Description: Guess the answer based on clues or clever hints.
  • Capability: Knowledge
  • Scale: 100

T24: Xiehouyu (歇后语)

  • Task Description: Complete the second half of the proverb based on the first half.
  • Capability: Knowledge
  • Scale: 100

T25: Historical Chinese Phonology (古汉语音韵)

  • Task Description: Answer questions about ancient Chinese phonetics and rhymes.
  • Capability: Knowledge
  • Scale: 100

T26: Knowledge of Sinology Q&A (国学常识问答)

  • Task Description: Answer questions about Sinology.
  • Capability: Knowledge
  • Scale: 130

Data Construction

The construction pipeline of WenMind includes data collection and data processing, as illustrated in Figure 2.

Figure 2: Construction pipeline of WenMind Benchmark.

Data Statistics

Table 1 provides the statistics of the WenMind dataset.

Table 1: The statistics of the WenMind Benchmark. "Q" represents "Question" and "A" represents "Answer".

Domain Tasks #Q Max. #Q Min. #Q Avg. Q Tokens Avg. A Tokens
Ancient Prose 15 1,900 200 7 107.51 62.12
Ancient Poetry 16 1,845 200 20 73.42 94.93
Ancient Literary Culture 11 1,130 100 100 26.68 14.26
Overall 42 4,875 200 7 75.87 63.44


a. Obtain the model’s responses

Open-source Model

For open-source models, we perform inference locally, only requiring the model path and the output file path for the answers.

--model_path The path to the model, defaults to loading from huggingface
--output_path The file path for the model's answer output, defaults to {model_name}_result.json


CUDA_VISIBLE_DEVICES=0,1 python Evaluation_Code/Inference/ \  
    --model_path baichuan-inc/Baichuan2-7B-Chat \  
    --output_path Baichuan2-7B-Chat_result.json

API Model

For GPT-3.5 and GPT-4 models, provide two parameters: api_base and api_key.
For ERNIE-3.5 and ERNIE-4.0 models, provide two parameters: api_key and secret_key.
For Spark models, provide three parameters: api_key, secret_key, and appid.
Refer to the official documentation of each API model for details.


python \
    --API_KEY {api_key} \
    --SECRET_KEY {secret_key} \
    --output_path {output_path}

b. Use ERNIE-3.5 to score the responses

Step 1: Check whether the LLM response file is consistent with the format of the JSON/LLM_Response_Examples.json file.

Step 2: Open the Evaluation_Code/ file, input the API_KEY and SECRET_KEY for the scoring model ERNIE-3.5, replace LLM_response_path with the storage path of the LLM response file, replace LLM_score_path with the path where the scoring results will be saved, and replace LLM_prompt_path with the storage path of JSON/Task_Score_Prompt.json.

Step 3: Run the following command to obtain the scoring results:

python Evaluation_Code/ 

c. Calculate the model’s score

Step 1: Check whether the scoring file is consistent with the format of the JSON/LLM_Score_Examples.json file.

Step 2: Open the Evaluation_Code/ file and replace LLM_score_path with the storage path of the scoring file.

Step 3: Run the following command to obtain the model's score:

python Evaluation_Code/ 

Evaluation Result

Table 2: Results of all evaluated models on different domains and capabilities.



License: MIT

The work is licensed under a MIT License.

License: CC BY-NC-SA 4.0

The WenMind benchmark is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


No releases published


No packages published
