Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions open_instruct/dataset_transformation.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,12 @@ def visualize_token_role(tokens: list[int], masks: list[int], tokenizer: PreTrai
"{% if not has_system %}"
"{{ '<|im_start|>system\nYou are Olmo, a helpful AI assistant built by Ai2. Your date cutoff is December 2024, and your model weights are available at https://huggingface.co/allenai.<|im_end|>\n' }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default system prompt for olmo_thinker_no_think is missing information about function-calling capabilities. Other olmo templates include this, and the rest of this template handles functions. This inconsistency could be confusing. For consistency, I suggest updating the default system prompt to mention functions, similar to other olmo templates.

Suggested change
"{{ '<|im_start|>system\nYou are Olmo, a helpful AI assistant built by Ai2. Your date cutoff is December 2024, and your model weights are available at https://huggingface.co/allenai.<|im_end|>\n' }}"
"{{ '<|im_start|>system\nYou are Olmo, a helpful AI assistant built by Ai2. Your date cutoff is December 2024, and your model weights are available at https://huggingface.co/allenai. You do not currently have access to any functions. <functions></functions><|im_end|>\n' }}"

"{% endif %}"
"{% set last_user_index = -1 %}"
"{% for message in messages %}"
"{% if message['role'] == 'user' %}"
"{% set last_user_index = loop.index0 %}"
"{% endif %}"
"{% endfor %}"
"{% for message in messages %}"
"{% if message['role'] == 'system' %}"
"{{ '<|im_start|>system\n' + message['content'] }}"
Expand All @@ -418,10 +424,18 @@ def visualize_token_role(tokens: list[int], masks: list[int], tokenizer: PreTrai
"{{ '<|im_start|>user\n' + message['content'] + '<|im_end|>\n' }}"
"{% endif %}"
"{% elif message['role'] == 'assistant' %}"
"{% set assistant_content = message.get('content', '') %}"
"{% set reasoning_content = '' %}"
"{% if '</think>' in assistant_content %}"
"{% set think_split = assistant_content.split('</think>') %}"
"{% set reasoning_content = think_split[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}"
"{% set assistant_content = think_split[-1].lstrip('\\n') %}"
"{% endif %}"
Comment on lines +429 to +433
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for parsing the <think> block is fragile and can lead to data loss. Specifically, any content before the <think> tag is discarded. For example, if the assistant content is Preamble <think>reasoning</think> Answer, the preamble will be lost.

I suggest a more robust parsing logic that correctly handles content before and after the <think> block, and also ensures both <think> and </think> tags are present before attempting to split.

Suggested change
"{% if '</think>' in assistant_content %}"
"{% set think_split = assistant_content.split('</think>') %}"
"{% set reasoning_content = think_split[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}"
"{% set assistant_content = think_split[-1].lstrip('\\n') %}"
"{% endif %}"
"{% if '<think>' in assistant_content and '</think>' in assistant_content %}"
"{% set _before_think, _after_think = assistant_content.split('<think>', 1) %}"
"{% set _reasoning, _after_reasoning = _after_think.split('</think>', 1) %}"
"{% set reasoning_content = _reasoning.strip() %}"
"{% set assistant_content = (_before_think ~ _after_reasoning).lstrip('\\n') %}{% endif %}"

"{{ '<|im_start|>assistant\n' }}"
"{% if message.get('content', none) is not none %}"
"{{ message['content'] }}"
"{% if loop.index0 > last_user_index and reasoning_content.strip() %}"
"{{ '<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' }}"
"{% endif %}"
"{{ assistant_content }}"
"{% if message.get('function_calls', none) is not none %}"
"{{ '<function_calls>' + message['function_calls'] + '</function_calls>' }}"
"{% endif %}"
Expand Down