Code Evaulation Guide

This document provides a code evaluation program for the VibeThinker-1.5B model.

Evaulation Process

1. Clone the Required Project

git clone git@github.com:LiveCodeBench/LiveCodeBench.git
cd LiveCodeBench

2. Install Dependencies

require python version: python3.12

pip install -e . -i https://mirrors.aliyun.com/pypi/simple/
# our custom settings for run our models
pip install datasets==3.6.0 vllm==0.10.1 -i https://mirrors.aliyun.com/pypi/simple/
pip install --no-cache-dir https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

3. Download LiveCodeBench Dataset for eval

optional: export HF_ENDPOINT=https://hf-mirror.com

from datasets import load_dataset

load_dataset("livecodebench/code_generation_lite", split="test", version_tag="release_v6", trust_remote_code=True)

4. Customize Chat Template

add model enum

edit lcb_runner/lm_styles.py, add enum for LMStyles:

class LMStyle(Enum):
    ...
    VibeThinker = "VibeThinker"

add chat template edit the lcb_runner/prompts/code_generation.py
in the format_prompt_generation function, add this snippet code:

def format_prompt_generation(
    question: CodeGenerationProblem, LanguageModelStyle: LMStyle
) -> str:
    ...

    if LanguageModelStyle == LMStyle.VibeThinker:
        from transformers import AutoTokenizer

        tokenizer = AutoTokenizer.from_pretrained(
          # or put local path of vibe thinker 1.5B here
            "VibeThinker/VibeThinker-1.5B", padding_side="left", use_fast=False
        )
        prompt = f"{PromptConstants.SYSTEM_MESSAGE_GENERIC}\n\n"
        prompt += f"{get_generic_question_template_answer(question)}"
        chat_messages = [
            {
                "role": "user",
                "content": prompt,
            },
        ]
        prompt = tokenizer.apply_chat_template(
            chat_messages,
            tokenize=False,
            add_generation_prompt=True,
            truncation=False,
            padding=False,
        )
        return prompt

add LanguageModel

edit lcb_runner/lm_styles.py, in LanguageModelList, add this model:

LanguageModelList: list[LanguageModel] = [
    ...

    LanguageModel(
        model_name="VibeThinker/VibeThinker-1.5B",
        model_repr="VibeThinker-1.5B",
        model_style=LMStyle.VibeThinker,
        release_date=datetime(2025, 11, 10),
        link="https://huggingface.co/WeiboAI/VibeThinker-1.5B",
    ),
]

4. Update Sampling Params

update top_k and stop words settings

edit lcb_runner/runner/vllm_runner.py

class VLLMRunner(BaseRunner):
    def __init__(self, args, model):
        self.sampling_params = SamplingParams(
            top_k=-1,
            ...
            stop=[],
        )

5. Run the Evaluation

LiveCodeBench v6

# LiveCodeBench v6 (2025.02.01 - 2025.05.01 for release v6, 131 problems total):
export VLLM_USE_V1=0 # In our experiments, the v1 engine lowers the LCB score by about 2 points compared to v0
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
N_ROLLOUT=8
python -m lcb_runner.runner.main \
    --model VibeThinker/VibeThinker-1.5B \
    # --local_model_path <local model path for VibeThinker-1.5B>
    --scenario codegeneration \
    --evaluate \
    --release_version release_v6 \
    --temperature 0.6 \
    --n $N_ROLLOUT \
    --codegen_n $N_ROLLOUT \
    --max_tokens 40960 \
    --start_date 2025-02-01 \
    --tensor_parallel_size 1 \
    --enable_prefix_caching \
    --num_process_evaluate 180

LiveCodeBench v5

# LiveCodeBench v5 (2024.08.01 - 2025.02.01 for release v5, 279 problems total):
export VLLM_USE_V1=0 # In our experiments, the v1 engine lowers the LCB score by about 2 points compared to v0
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
N_ROLLOUT=8
python -m lcb_runner.runner.main \
    --model VibeThinker/VibeThinker-1.5B \
    # --local_model_path <local model path for VibeThinker-1.5B>
    --scenario codegeneration \
    --evaluate \
    --release_version release_v5 \
    --temperature 0.6  \
    --n $N_ROLLOUT \
    --codegen_n $N_ROLLOUT \
    --max_tokens 40960 \
    --start_date 2024-08-01 \
    --tensor_parallel_size 1 \
    --enable_prefix_caching \
    --num_process_evaluate 180

Acknowledgements

This evaluation program is built upon the LiveCodeBench/LiveCodeBench project and . Thanks to the original authors for their contributions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code Evaulation Guide

Evaulation Process

1. Clone the Required Project

2. Install Dependencies

3. Download LiveCodeBench Dataset for eval

4. Customize Chat Template

4. Update Sampling Params

5. Run the Evaluation

Acknowledgements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Code Evaulation Guide

Evaulation Process

1. Clone the Required Project

2. Install Dependencies

3. Download LiveCodeBench Dataset for eval

4. Customize Chat Template

4. Update Sampling Params

5. Run the Evaluation

Acknowledgements