Skip to content

Latest commit

 

History

History
173 lines (132 loc) · 4.54 KB

File metadata and controls

173 lines (132 loc) · 4.54 KB

Code Evaulation Guide

This document provides a code evaluation program for the VibeThinker-1.5B model.

Evaulation Process

1. Clone the Required Project

git clone git@github.com:LiveCodeBench/LiveCodeBench.git
cd LiveCodeBench

2. Install Dependencies

require python version: python3.12

pip install -e . -i https://mirrors.aliyun.com/pypi/simple/
# our custom settings for run our models
pip install datasets==3.6.0 vllm==0.10.1 -i https://mirrors.aliyun.com/pypi/simple/
pip install --no-cache-dir https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

3. Download LiveCodeBench Dataset for eval

optional: export HF_ENDPOINT=https://hf-mirror.com

from datasets import load_dataset

load_dataset("livecodebench/code_generation_lite", split="test", version_tag="release_v6", trust_remote_code=True)

4. Customize Chat Template

  1. add model enum

edit lcb_runner/lm_styles.py, add enum for LMStyles:

class LMStyle(Enum):
    ...
    VibeThinker = "VibeThinker"
  1. add chat template edit the lcb_runner/prompts/code_generation.py
    in the format_prompt_generation function, add this snippet code:
def format_prompt_generation(
    question: CodeGenerationProblem, LanguageModelStyle: LMStyle
) -> str:
    ...

    if LanguageModelStyle == LMStyle.VibeThinker:
        from transformers import AutoTokenizer

        tokenizer = AutoTokenizer.from_pretrained(
          # or put local path of vibe thinker 1.5B here
            "VibeThinker/VibeThinker-1.5B", padding_side="left", use_fast=False
        )
        prompt = f"{PromptConstants.SYSTEM_MESSAGE_GENERIC}\n\n"
        prompt += f"{get_generic_question_template_answer(question)}"
        chat_messages = [
            {
                "role": "user",
                "content": prompt,
            },
        ]
        prompt = tokenizer.apply_chat_template(
            chat_messages,
            tokenize=False,
            add_generation_prompt=True,
            truncation=False,
            padding=False,
        )
        return prompt
  1. add LanguageModel

    edit lcb_runner/lm_styles.py, in LanguageModelList, add this model:

LanguageModelList: list[LanguageModel] = [
    ...

    LanguageModel(
        model_name="VibeThinker/VibeThinker-1.5B",
        model_repr="VibeThinker-1.5B",
        model_style=LMStyle.VibeThinker,
        release_date=datetime(2025, 11, 10),
        link="https://huggingface.co/WeiboAI/VibeThinker-1.5B",
    ),
]

4. Update Sampling Params

  1. update top_k and stop words settings

    edit lcb_runner/runner/vllm_runner.py

class VLLMRunner(BaseRunner):
    def __init__(self, args, model):
        self.sampling_params = SamplingParams(
            top_k=-1,
            ...
            stop=[],
        )

5. Run the Evaluation

  1. LiveCodeBench v6
# LiveCodeBench v6 (2025.02.01 - 2025.05.01 for release v6, 131 problems total):
export VLLM_USE_V1=0 # In our experiments, the v1 engine lowers the LCB score by about 2 points compared to v0
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
N_ROLLOUT=8
python -m lcb_runner.runner.main \
    --model VibeThinker/VibeThinker-1.5B \
    # --local_model_path <local model path for VibeThinker-1.5B>
    --scenario codegeneration \
    --evaluate \
    --release_version release_v6 \
    --temperature 0.6 \
    --n $N_ROLLOUT \
    --codegen_n $N_ROLLOUT \
    --max_tokens 40960 \
    --start_date 2025-02-01 \
    --tensor_parallel_size 1 \
    --enable_prefix_caching \
    --num_process_evaluate 180
  1. LiveCodeBench v5
# LiveCodeBench v5 (2024.08.01 - 2025.02.01 for release v5, 279 problems total):
export VLLM_USE_V1=0 # In our experiments, the v1 engine lowers the LCB score by about 2 points compared to v0
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
N_ROLLOUT=8
python -m lcb_runner.runner.main \
    --model VibeThinker/VibeThinker-1.5B \
    # --local_model_path <local model path for VibeThinker-1.5B>
    --scenario codegeneration \
    --evaluate \
    --release_version release_v5 \
    --temperature 0.6  \
    --n $N_ROLLOUT \
    --codegen_n $N_ROLLOUT \
    --max_tokens 40960 \
    --start_date 2024-08-01 \
    --tensor_parallel_size 1 \
    --enable_prefix_caching \
    --num_process_evaluate 180

Acknowledgements

This evaluation program is built upon the LiveCodeBench/LiveCodeBench project and . Thanks to the original authors for their contributions.