This document provides a code evaluation program for the VibeThinker-1.5B model.
git clone git@github.com:LiveCodeBench/LiveCodeBench.git
cd LiveCodeBenchrequire python version: python3.12
pip install -e . -i https://mirrors.aliyun.com/pypi/simple/
# our custom settings for run our models
pip install datasets==3.6.0 vllm==0.10.1 -i https://mirrors.aliyun.com/pypi/simple/
pip install --no-cache-dir https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whloptional: export HF_ENDPOINT=https://hf-mirror.com
from datasets import load_dataset
load_dataset("livecodebench/code_generation_lite", split="test", version_tag="release_v6", trust_remote_code=True)- add model enum
edit lcb_runner/lm_styles.py, add enum for LMStyles:
class LMStyle(Enum):
...
VibeThinker = "VibeThinker"- add chat template
edit the lcb_runner/prompts/code_generation.py
in the format_prompt_generation function, add this snippet code:
def format_prompt_generation(
question: CodeGenerationProblem, LanguageModelStyle: LMStyle
) -> str:
...
if LanguageModelStyle == LMStyle.VibeThinker:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
# or put local path of vibe thinker 1.5B here
"VibeThinker/VibeThinker-1.5B", padding_side="left", use_fast=False
)
prompt = f"{PromptConstants.SYSTEM_MESSAGE_GENERIC}\n\n"
prompt += f"{get_generic_question_template_answer(question)}"
chat_messages = [
{
"role": "user",
"content": prompt,
},
]
prompt = tokenizer.apply_chat_template(
chat_messages,
tokenize=False,
add_generation_prompt=True,
truncation=False,
padding=False,
)
return prompt-
add LanguageModel
edit lcb_runner/lm_styles.py, in LanguageModelList, add this model:
LanguageModelList: list[LanguageModel] = [
...
LanguageModel(
model_name="VibeThinker/VibeThinker-1.5B",
model_repr="VibeThinker-1.5B",
model_style=LMStyle.VibeThinker,
release_date=datetime(2025, 11, 10),
link="https://huggingface.co/WeiboAI/VibeThinker-1.5B",
),
]-
update top_k and stop words settings
edit lcb_runner/runner/vllm_runner.py
class VLLMRunner(BaseRunner):
def __init__(self, args, model):
self.sampling_params = SamplingParams(
top_k=-1,
...
stop=[],
)- LiveCodeBench v6
# LiveCodeBench v6 (2025.02.01 - 2025.05.01 for release v6, 131 problems total):
export VLLM_USE_V1=0 # In our experiments, the v1 engine lowers the LCB score by about 2 points compared to v0
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
N_ROLLOUT=8
python -m lcb_runner.runner.main \
--model VibeThinker/VibeThinker-1.5B \
# --local_model_path <local model path for VibeThinker-1.5B>
--scenario codegeneration \
--evaluate \
--release_version release_v6 \
--temperature 0.6 \
--n $N_ROLLOUT \
--codegen_n $N_ROLLOUT \
--max_tokens 40960 \
--start_date 2025-02-01 \
--tensor_parallel_size 1 \
--enable_prefix_caching \
--num_process_evaluate 180- LiveCodeBench v5
# LiveCodeBench v5 (2024.08.01 - 2025.02.01 for release v5, 279 problems total):
export VLLM_USE_V1=0 # In our experiments, the v1 engine lowers the LCB score by about 2 points compared to v0
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
N_ROLLOUT=8
python -m lcb_runner.runner.main \
--model VibeThinker/VibeThinker-1.5B \
# --local_model_path <local model path for VibeThinker-1.5B>
--scenario codegeneration \
--evaluate \
--release_version release_v5 \
--temperature 0.6 \
--n $N_ROLLOUT \
--codegen_n $N_ROLLOUT \
--max_tokens 40960 \
--start_date 2024-08-01 \
--tensor_parallel_size 1 \
--enable_prefix_caching \
--num_process_evaluate 180This evaluation program is built upon the LiveCodeBench/LiveCodeBench project and . Thanks to the original authors for their contributions.