Skip to content

prepare-benchmark get xbench-ds 出错UnicodeEncodeError: 'gbk' codec can't encode character '\u2011' in position 273: illegal multibyte sequence #100

@zhoukai83

Description

@zhoukai83

Describe the bug
执行:uv run main.py prepare-benchmark get xbench-ds
后出错:
\MiroFlow\utils\prepare_benchmark\main.py:179 in get │
│ │
│ 176 │ ds_file = env.data_dir / dataset / env.meta_filename │
│ 177 │ with open(ds_file, mode="w") as f: │
│ 178 │ │ for task in ds_gen(): │
│ ❱ 179 │ │ │ f.write(task.to_json().decode() + "\n") │
│ 180 │ print("\n" + "=" * 80) │
│ 181 │ print(f" Benchmark: {dataset}") │
│ 182 │ print(f" Saved to: {ds_file}") │
│ │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ dataset = 'xbench-ds' │ │
│ │ ds_file = WindowsPath('D:/myCode/github/MiroFlow/data/xbench-ds/standardized_data.jsonl') │ │
│ │ env = _Env(data_dir=WindowsPath('D:/myCode/github/MiroFlow/data'), hf_token='') │ │
│ │ f = <_io.TextIOWrapper │ │
│ │ name='D:\myCode\github\MiroFlow\data\xbench-ds\standardized_data.jsonl' │ │
│ │ mode='w' encoding='cp936'> │ │
│ │ task = Task( │ │
│ │ │ task_id=1, │ │
│ │ │ │ │
│ │ task_question='截至2024年12月31日,2024年上海黄金交易所Au(T+D)合约的“最高价”与“最… │ │
│ │ │ ground_truth='161.27元', │ │
│ │ │ file_path=None, │ │
│ │ │ metadata={ │ │
│ │ │ │ 'reference_steps': '1. │ │
│ │ 访问官方行情页面:https://www.sge.com.cn/sjzx/quotation_daily_new\n2. │ │
│ │ 设置查询区间:在页面中,按月度查询'+223 │ │
│ │ │ } │ │
│ │ ) │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnicodeEncodeError: 'gbk' codec can't encode character '\u2011' in position 273: illegal multibyte sequence

代码没考虑到不同系统不同环境下字符的默认encoding可能不一样
修改utils/prepare_benchmark/main.py 177行后可执行成功
with open(ds_file, mode="w", encoding='utf8') as f:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions