BlossomData

BlossomData是一个用于处理大型语言模型（LLM）训练数据的框架。它提供了一系列工具，用于合成训练数据，包括但不限于生成、翻译、蒸馏、校验等，帮助用户快速构建高质量的训练数据集。

⚠注意：该项目仍处于原型阶段，算子正在快速更新，并且可能存在大量未知问题。建议在实验环境中进行测试。

使用示例

使用pip安装。

pip3 install git+https://github.com/Azure99/BlossomData.git

在使用之前，请在config.yaml文件中配置模型服务提供商的API密钥和相关参数。

翻译数据

下面是一个最简单的示例，用于将对话数据从英文翻译为中文。

# 示例对话数据
data = [
    ChatSchema(
        messages=[
            user("hello"),
            assistant("Hello."),
        ]
    ),
]
# 定义Pipeline
pipeline = SimplePipeline().add_operators(
    # 对话翻译，使用gpt-4o将对话数据翻译为中文
    ChatTranslate(model="gpt-4o-mini", target_language="Chinese"),
)
# 执行并打印结果
print(pipeline.execute(data))

配置参数即可实现仅翻译数据中的指令部分，不翻译代码或其他内容，并开启并行处理。

pipeline = SimplePipeline().add_operators(
    ChatTranslate(
        model="gpt-4o-mini",
        target_language="Chinese",
        instruction_only=True,
        parallel=4,
    ),
)

翻译并重新蒸馏数据

直接翻译的模型回复质量可能不佳，因此可以先翻译用户指令，再使用ChatDistill重新生成Assistant回复以提高质量。

pipeline = SimplePipeline().add_operators(
    ChatTranslate(
        model="gpt-4o-mini",
        target_language="Chinese",
        # 由于Assistant的回复会被蒸馏覆盖，此处可以仅翻译USER的消息
        roles=[ChatRole.USER],
    ),
    # 提供多种蒸馏模式，第一轮、最后一轮、所有轮次
    ChatDistill(model="gpt-4o-mini", strategy=ChatDistill.Strategy.MULTI_TURN),
)

根据答案校验

对于有确切答案的问题（GSM8K等数学数据集），我们可以蒸馏回答，并基于参考答案检查是否正确。

data = [
    ChatSchema(
        messages=[
            user("Find all roots of the polynomial $x^3+x^2-4x-4$. Enter your answer as a list of numbers separated by commas."),
            assistant("−2,−1,2"),
        ]
    )
]
pipeline = SimplePipeline().add_operators(
    ChatMathDistill(
        model="gpt-4o-mini",
        validate_mode=ChatMathDistill.ValidateMode.LLM,
        max_retry=3,
    ),
)

多模型推理校验

对于没有确切答案的问题，我们可以使用另一个模型进行推理，并由第三个模型对两个回答进行检查，过滤掉不一致的结果。

data = [
    ChatSchema(
        messages=[
            user("Who developed ChatGPT?"),
            assistant("OpenAI"),
        ]
    ),
    ChatSchema(
        messages=[
            user("Who developed ChatGPT?"),
            assistant("Google"),
        ]
    ),
]
pipeline = SimplePipeline().add_operators(
    ChatMultiReasoningFilter(review_model="gpt-4o-mini", reasoning_model="gpt-4o-mini"),
)

自定义算子

定义自己的算子，以便灵活处理和生成训练数据。下面的示例中，首先翻译英文文档为中文，然后从中抽取问答对作为训练数据。

# 自定义Map算子，进行一对一映射
@context_map_operator(parallel=4)
def self_qa_op(context, item):
    self_qa_prompt = (
        "基于给定的文本，随意生成一个问题以及对应的长答案。\n"
        "你的输出应该是一个json，包含question、answer两个字符串字段，不需要输出任何其他的无关解释。\n"
        f"给定的文本：{item.content}"
    )
    raw_result = context.chat_completion("gpt-4o-mini", [user(self_qa_prompt)])
    result = loads_markdown_first_json(raw_result)
    return ChatSchema(
        messages=[
            user(result["question"]),
            assistant(result["answer"]),
        ]
    )


# 纯文本英文数据
data = [
    TextSchema(
        content="""Tomato scrambled eggs is a common dish in Eastern cuisine. 
        Because its ingredients are easy to obtain and the cooking steps are relatively simple, 
        it is also loved by beginners in the kitchen."""
    ),
]

pipeline = SimplePipeline().add_operators(
    # 翻译英文文本
    TextTranslate(
        model="gpt-4o-mini",
        target_language="Chinese",
    ),
    # 基于翻译后的文本，生成问题和答案
    self_qa_op,
)
print(pipeline.execute(data))

你可能会得到这样的输出：

# User: 为什么厨房新手喜欢做番茄炒蛋？
# Assistant: 厨房新手喜欢做番茄炒蛋有几个主要原因。
# 首先，番茄和鸡蛋这两种食材非常容易获得，几乎在所有的超市和市场都可以买到。
# 其次，番茄炒蛋的烹饪步骤也比较简单，没有复杂的技巧要求，非常适合新手尝试。
# 步骤通常包括切番茄、打鸡蛋、热锅上油、炒熟等，整个过程比较直观。
# 再者，番茄炒蛋作为一道家常菜，口味鲜美，营养丰富，成品容易让人满意，能给新手带来成就感。
# 此外，这道菜还可以根据个人口味进行简易的调味调整，无需严格遵循复杂的配方。
# 这些因素使得番茄炒蛋成为新手下厨时的首选之一。

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
blossom		blossom
example		example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml.example		config.yaml.example
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BlossomData

使用示例

翻译数据

翻译并重新蒸馏数据

根据答案校验

多模型推理校验

自定义算子

About

Releases 1

Packages

Languages

License

Azure99/BlossomData

Folders and files

Latest commit

History

Repository files navigation

BlossomData

使用示例

翻译数据

翻译并重新蒸馏数据

根据答案校验

多模型推理校验

自定义算子

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages