Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add grouper and aggregator op for system_prompt #500

Merged
merged 90 commits into from
Dec 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
63d430a
add api call
drcege Oct 23, 2024
6720da4
add call_api ops
drcege Oct 24, 2024
8daa6e1
clean
drcege Oct 29, 2024
ef11951
minor update
drcege Oct 29, 2024
5597d5c
more tests
drcege Oct 29, 2024
4b6e769
update tests
drcege Oct 29, 2024
835be22
Merge branch 'main' into dev/api_model
drcege Oct 29, 2024
325a753
update prompts
drcege Oct 29, 2024
4f04bdd
fix unittest
drcege Oct 30, 2024
0adbdcd
update tests
drcege Oct 30, 2024
0aa4069
add docs
drcege Nov 1, 2024
f007532
minor fix
drcege Nov 1, 2024
9aa7390
Merge branch 'main' into dev/api_model
drcege Nov 5, 2024
ee4f461
add API processor
drcege Nov 5, 2024
9bbfe47
Merge branch 'main' into dev/api_model
drcege Nov 5, 2024
b00b182
refine API processor
drcege Nov 5, 2024
b718de7
refine
drcege Nov 5, 2024
6d1d433
chunk and extract events
BeachWang Nov 6, 2024
4d1670f
fix bugs
drcege Nov 6, 2024
9e11aa3
fix tests
drcege Nov 6, 2024
cc40fc0
extract attribute
BeachWang Nov 7, 2024
4c262ad
Merge branch 'dev/api_model' of github.com:alibaba/data-juicer into d…
BeachWang Nov 7, 2024
347bc0f
refine tests
drcege Nov 7, 2024
c9d5051
extract nickname
BeachWang Nov 8, 2024
8a128ca
Merge branch 'dev/api_model' of github.com:alibaba/data-juicer into d…
BeachWang Nov 8, 2024
9262777
nickname test done
BeachWang Nov 8, 2024
58fc020
merge main
BeachWang Nov 8, 2024
c7dc28e
lightRAG to OP
BeachWang Nov 11, 2024
238869e
merge main
BeachWang Nov 11, 2024
0e51a43
doc done
BeachWang Nov 11, 2024
6d9d8a5
remove extra test
BeachWang Nov 11, 2024
a637a64
relavant -> relevant
BeachWang Nov 11, 2024
56e7988
fix minor error
BeachWang Nov 11, 2024
03880b7
group by op done
BeachWang Nov 12, 2024
23174fd
ValueError -> Exception
BeachWang Nov 12, 2024
e82cc06
merge main
BeachWang Nov 12, 2024
20a8dee
fix config_all error
BeachWang Nov 12, 2024
38a9511
fix prepare_api_model
BeachWang Nov 13, 2024
35f0eb3
fix rank sample None
BeachWang Nov 13, 2024
155d3dd
constant fix key
BeachWang Nov 13, 2024
f862897
aggregator op
BeachWang Nov 14, 2024
2d4da5e
merge llm_info_extract
BeachWang Nov 14, 2024
7e66057
init python_lambda_mapper
drcege Nov 20, 2024
a61859b
set default arg
drcege Nov 20, 2024
8031a31
fix init
drcege Nov 21, 2024
67711f9
add python_file_mapper
drcege Nov 21, 2024
cdeb692
support text & most relavant entities
BeachWang Nov 22, 2024
125a8f3
coverage ignore_errors
drcege Nov 25, 2024
0c68089
index sample
BeachWang Nov 25, 2024
651789d
role_playing_system_prompt_yaml
BeachWang Nov 25, 2024
c5d7b9e
merge python_file_mapper
BeachWang Nov 26, 2024
cf6a53a
Merge branch 'main' of github.com:alibaba/data-juicer into dev/group_…
BeachWang Nov 26, 2024
222790e
system_prompt begin
BeachWang Nov 27, 2024
75f2911
support batched
drcege Nov 27, 2024
11fa852
remove unforkable
BeachWang Nov 27, 2024
4af2bfb
support batched & add docs
drcege Nov 27, 2024
8867580
Merge branch 'main' into op/python_lambda
drcege Nov 28, 2024
553d5ad
add docs
drcege Nov 28, 2024
470ca19
fix docs
drcege Nov 28, 2024
399a238
update docs
drcege Nov 28, 2024
706365f
Merge branch 'main' into op/python_file
drcege Nov 28, 2024
115ab9a
pre-commit done
BeachWang Nov 28, 2024
ecb8635
fix batch bug
BeachWang Dec 2, 2024
03e3469
fix batch bug
BeachWang Dec 2, 2024
1788fa6
merge fix_batch_bug
BeachWang Dec 3, 2024
735ff4d
Merge branch 'main' of github.com:alibaba/data-juicer into debug/fix_…
BeachWang Dec 3, 2024
00ff624
fix filter batch
BeachWang Dec 3, 2024
8601519
fix filter batch
BeachWang Dec 3, 2024
eeefcab
system prompt recipe done
BeachWang Dec 3, 2024
6eaa50c
Merge branch 'main' of github.com:alibaba/data-juicer into dev/group_…
BeachWang Dec 3, 2024
1575717
not rank for filter
BeachWang Dec 5, 2024
2c5c4a1
limit pyav version
BeachWang Dec 5, 2024
5c96dd5
Merge branch 'debug/fix_batch_bug' of github.com:alibaba/data-juicer …
BeachWang Dec 5, 2024
49be467
add test for op
BeachWang Dec 5, 2024
9ab02fe
tmp
BeachWang Dec 5, 2024
ba086de
tmp
BeachWang Dec 5, 2024
f712131
doc done
BeachWang Dec 5, 2024
12b7616
Merge branch 'op/python_lambda' of github.com:alibaba/data-juicer int…
BeachWang Dec 5, 2024
e57b64a
merge python_lambda
BeachWang Dec 5, 2024
5f463cd
merge python_lambda
BeachWang Dec 5, 2024
a786070
skip api test
BeachWang Dec 6, 2024
73f4e77
merge main
BeachWang Dec 6, 2024
4b6f0b9
merge main
BeachWang Dec 6, 2024
788a212
add env dependency
BeachWang Dec 6, 2024
10242c4
install by recipe
BeachWang Dec 10, 2024
b46d105
change to dj_install
BeachWang Dec 12, 2024
a0da444
change to dj_install
BeachWang Dec 12, 2024
02f8dda
developer doc done
BeachWang Dec 12, 2024
b4d0798
merge dj-install
BeachWang Dec 12, 2024
2a52bfb
merge main
BeachWang Dec 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 81 additions & 2 deletions configs/config_all.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -79,9 +79,9 @@ process:
- clean_copyright_mapper: # remove copyright comments.
- expand_macro_mapper: # expand macro definitions in Latex text.
- extract_entity_attribute_mapper: # Extract attributes for given entities from the text.
api_model: 'gpt-4o' # API model name.
query_entities: ["孙悟空", "猪八戒"] # Entity list to be queried.
query_attributes: ["人物性格"] # Attribute list to be queried.
api_model: 'gpt-4o' # API model name.
entity_key: '__dj__entity__' # The field name to store the given main entity for attribute extraction.
entity_attribute_key: '__dj__attribute__' # The field name to store the given attribute to be extracted.
attribute_desc_key: '__dj__attribute_description__' # The field name to store the extracted attribute description.
Expand Down Expand Up @@ -153,6 +153,18 @@ process:
drop_text: false # If drop the text in the output.
model_params: {} # Parameters for initializing the API model.
sampling_params: {} # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
- extract_support_text_mapper: # extract support sub text for a summary.
api_model: 'gpt-4o' # API model name.
summary_key: '__dj__event_description__' # The field name to store the input summary. Support for nested keys such as "__dj__stats__.text_len".
support_text_key: '__dj__support_text__' # The field name to store the output support text for the summary.
api_endpoint: null # URL endpoint for the API.
response_path: null # Path to extract content from the API response. Defaults to 'choices.0.message.content'.
system_prompt: null # System prompt for the task.
input_template: null # Template for building the model input.
try_num: 3 # The number of retry attempts when there is an API call error or output parsing error.
drop_text: false # If drop the text in the output.
model_params: {} # Parameters for initializing the API model.
sampling_params: {} # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
- fix_unicode_mapper: # fix unicode errors in text.
- generate_qa_from_examples_mapper: # mapper to generate question and answer pairs from examples.
hf_model: 'Qwen/Qwen2.5-7B-Instruct' # Model name on huggingface to generate question and answer pairs.
Expand Down Expand Up @@ -259,12 +271,27 @@ process:
model_params: {} # Parameters for initializing the API model.
sampling_params: {} # Extra parameters passed to the API call.
- punctuation_normalization_mapper: # normalize unicode punctuations to English punctuations.
- python_python_mapper: # executing Python lambda function defined in a file.
- python_file_mapper: # executing Python lambda function defined in a file.
file_path: '' # The path to the Python file containing the function to be executed.
function_name: 'process_single' # The name of the function defined in the file to be executed.
- python_lambda_mapper: # executing Python lambda function on data samples.
lambda_str: '' # A string representation of the lambda function to be executed on data samples. If empty, the identity function is used.
batched: False # A boolean indicating whether to process input data in batches.
- relation_identity_mapper: # identify relation between two entity in the text.
api_model: 'gpt-4o' # API model name.
source_entity: '孙悟空' # The source entity of the relation to be dentified.
target_entity: '猪八戒' # The target entity of the relation to be identified.
input_key: null # The input field key in the samples. Support for nested keys such as "__dj__stats__.text_len". It is text_key in default.
output_key: null # The output field key in the samples. Support for nested keys such as "__dj__stats__.text_len". It is input_key in default.
api_endpoint: null # URL endpoint for the API.
response_path: null # Path to extract content from the API response. Defaults to 'choices.0.message.content'.
system_prompt_template: null # System prompt template for the task. Need to specify by entity1 and entity2.
input_template: null # Template for building the model input.
output_pattern_template: null # Regular expression template for parsing model output.
try_num: 3 # The number of retry attempts when there is an API call error or output parsing error.
drop_text: false # If drop the text in the output.
model_params: {} # Parameters for initializing the API model.
sampling_params: {} # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
- remove_bibliography_mapper: # remove bibliography from Latex text.
- remove_comments_mapper: # remove comments from Latex text, code, etc.
doc_type: tex # comment type you want to remove. Only support 'tex' for now.
Expand Down Expand Up @@ -693,3 +720,55 @@ process:
top_ratio: # ratio of selected top samples
topk: # number of selected top sample
reverse: True # determine the sorting rule, if reverse=True, then sort in descending order

# Grouper ops.
- naive_grouper: # Group all samples to one batched sample.
- key_value_grouper: # Group samples to batched samples according values in given keys.
group_by_keys: null # Group samples according values in the keys. Support for nested keys such as "__dj__stats__.text_len". It is [self.text_key] in default.

# Aggregator ops.
- entity_attribute_aggregator: # Return conclusion of the given entity's attribute from some docs.
api_model: 'gpt-4o' # API model name.
entity: '孙悟空' # The given entity.
attribute: '人物经历' # The given attribute.
input_key: null # The input field key in the samples. Support for nested keys such as "__dj__stats__.text_len". It is text_key in default.
output_key: null # The output field key in the samples. Support for nested keys such as "__dj__stats__.text_len". It is same as the input_key in default.
word_limit: 100 # Prompt the output length.
max_token_num: null # The max token num of the total tokens of the sub documents. Without limitation if it is None.
api_endpoint: null # URL endpoint for the API.
response_path: null # Path to extract content from the API response. Defaults to 'choices.0.message.content'.
system_prompt_template: null # System prompt template for the task. Need to be specified by given entity and attribute.
example_prompt: null # The example part in the system prompt.
input_template: null # The input template.
output_pattern_template: null # The output template.
try_num: 3 # The number of retry attempts when there is an API call error or output parsing error.
model_params: {} # Parameters for initializing the API model.
sampling_params: {} # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
- most_relavant_entities_aggregator: # Extract entities closely related to a given entity from some texts, and sort them in descending order of importance.
api_model: 'gpt-4o' # API model name.
entity: '孙悟空' # The given entity.
query_entity_type: '人物' # The type of queried relavant entities.
input_key: null # The input field key in the samples. Support for nested keys such as "__dj__stats__.text_len". It is text_key in default.
output_key: null # The output field key in the samples. Support for nested keys such as "__dj__stats__.text_len". It is same as the input_key in default.
max_token_num: null # The max token num of the total tokens of the sub documents. Without limitation if it is None.
api_endpoint: null # URL endpoint for the API.
response_path: null # Path to extract content from the API response. Defaults to 'choices.0.message.content'.
system_prompt_template: null # System prompt template for the task. Need to be specified by given entity and entity_type.
input_template: null # The input template.
output_pattern: null # The output pattern.
try_num: 3 # The number of retry attempts when there is an API call error or output parsing error.
model_params: {} # Parameters for initializing the API model.
sampling_params: {} # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
- nested_aggregator: # Considering the limitation of input length, nested aggregate contents for each given number of samples.
api_model: 'gpt-4o' # API model name.
input_key: null # The input field key in the samples. Support for nested keys such as "__dj__stats__.text_len". It is text_key in default.
output_key: null # The output field key in the samples. Support for nested keys such as "__dj__stats__.text_len". It is same as the input_key in default.
max_token_num: null # The max token num of the total tokens of the sub documents. Without limitation if it is None.
api_endpoint: null # URL endpoint for the API.
response_path: null # Path to extract content from the API response. Defaults to 'choices.0.message.content'.
system_prompt: null # The system prompt.
sub_doc_template: null # The template for input text in each sample.
input_template: null # The input template.
try_num: 3 # The number of retry attempts when there is an API call error or output parsing error.
model_params: {} # Parameters for initializing the API model.
sampling_params: {} # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
7 changes: 6 additions & 1 deletion data_juicer/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -565,8 +565,13 @@ def sort_op_by_types_and_names(op_name_classes):
if 'deduplicator' in name]
selector_ops = [(name, c) for (name, c) in op_name_classes
if 'selector' in name]
grouper_ops = [(name, c) for (name, c) in op_name_classes
if 'grouper' in name]
aggregator_ops = [(name, c) for (name, c) in op_name_classes
if 'aggregator' in name]
ops_sorted_by_types = sorted(mapper_ops) + sorted(filter_ops) + sorted(
deduplicator_ops) + sorted(selector_ops)
deduplicator_ops) + sorted(selector_ops) + sorted(grouper_ops) + \
sorted(aggregator_ops)
return ops_sorted_by_types


Expand Down
8 changes: 5 additions & 3 deletions data_juicer/ops/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from . import deduplicator, filter, mapper, selector
from .base_op import (OPERATORS, UNFORKABLE, Deduplicator, Filter, Mapper,
Selector)
from . import aggregator, deduplicator, filter, grouper, mapper, selector
from .base_op import (OPERATORS, UNFORKABLE, Aggregator, Deduplicator, Filter,
Grouper, Mapper, Selector)
from .load import load_ops

__all__ = [
Expand All @@ -9,4 +9,6 @@
'Mapper',
'Deduplicator',
'Selector',
'Grouper',
'Aggregator',
]
8 changes: 8 additions & 0 deletions data_juicer/ops/aggregator/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from .entity_attribute_aggregator import EntityAttributeAggregator
from .most_relavant_entities_aggregator import MostRelavantEntitiesAggregator
from .nested_aggregator import NestedAggregator

__all__ = [
'NestedAggregator', 'EntityAttributeAggregator',
'MostRelavantEntitiesAggregator'
]
Loading
Loading