Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
31c5a64
feat: add config and operator node types
ChenZiHong-Gavin Dec 3, 2025
8bcbe51
refactor: refactor readers with ray data
ChenZiHong-Gavin Dec 3, 2025
246348f
fix: delete param parallelism for readers
ChenZiHong-Gavin Dec 3, 2025
319e1e7
fix: fix import error
ChenZiHong-Gavin Dec 3, 2025
42fcb09
refactor read and chunk operators with no side effects
ChenZiHong-Gavin Dec 4, 2025
b458e48
fix: fix import error
ChenZiHong-Gavin Dec 4, 2025
95c4783
fix: fix return logic
ChenZiHong-Gavin Dec 4, 2025
c844d65
refactor: rename operator split to chunk
ChenZiHong-Gavin Dec 4, 2025
c447936
refactor: refactor build_kg to accomodate ray data
ChenZiHong-Gavin Dec 4, 2025
3edbb81
feat: add StorageFactory & global params
ChenZiHong-Gavin Dec 4, 2025
ee0639d
refactor: refactor quiz to accomodata ray data engine
ChenZiHong-Gavin Dec 5, 2025
157f0b0
fix: reload graph before quizzing
ChenZiHong-Gavin Dec 5, 2025
99a6e5f
Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…
ChenZiHong-Gavin Dec 5, 2025
ec2033b
Potential fix for pull request finding 'Unreachable code'
ChenZiHong-Gavin Dec 5, 2025
bc07222
fix: fix quiz params
ChenZiHong-Gavin Dec 5, 2025
c9435d7
refactor: refactor quiz&judge to ray actors
ChenZiHong-Gavin Dec 10, 2025
c55fc09
Merge branch 'refactor/refactor-with-ray-data' of https://github.com/…
ChenZiHong-Gavin Dec 10, 2025
d7d6c2a
fix: fix transferring quizzed data to JudgeService
ChenZiHong-Gavin Dec 10, 2025
a6aedaf
refactor: refactor partition to accomodate ray data
ChenZiHong-Gavin Dec 10, 2025
ea1603b
fix: fix lint problem
ChenZiHong-Gavin Dec 10, 2025
244deb4
refactor: refactor op generate
ChenZiHong-Gavin Dec 11, 2025
d460a2a
feat: write results in output folder
ChenZiHong-Gavin Dec 11, 2025
cd011ad
fix: raise error when no dataset is created
ChenZiHong-Gavin Dec 11, 2025
aab7438
fix: return generator in ece_partitioner
ChenZiHong-Gavin Dec 11, 2025
7643b9f
fix: return generator in ece_partitioner
ChenZiHong-Gavin Dec 11, 2025
c42b604
refactor: refactor data format to support multi-modal input
ChenZiHong-Gavin Dec 11, 2025
42dc73e
fix: delete fetching schema to avoid ray's duplicate execution
ChenZiHong-Gavin Dec 11, 2025
73f70a5
fix: fix operators' registry
ChenZiHong-Gavin Dec 11, 2025
37cbfcf
feat: refactor schema_guided_extraction & add examples
ChenZiHong-Gavin Dec 11, 2025
b400d2e
feat: seperate ray logs and service logs
ChenZiHong-Gavin Dec 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
File renamed without changes.
File renamed without changes.
1 change: 1 addition & 0 deletions examples/extract/extract_schema_guided/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Extract Schema-Guided Information from Documents
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
python3 -m graphgen.run \
--config_file examples/extract/extract_schema_guided/schema_guided_extraction_config.yaml \
--output_dir cache/
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
global_params:
working_dir: cache

nodes:
- id: read
op_name: read
type: source
dependencies: []
params:
input_path:
- examples/input_examples/extract_demo.txt

- id: chunk
op_name: chunk
type: map_batch
dependencies:
- read
execution_params:
replicas: 4
params:
chunk_size: 20480 # larger chunk size for better context
chunk_overlap: 2000

- id: extract
op_name: extract
type: map_batch
dependencies:
- chunk
execution_params:
replicas: 1
batch_size: 128
params:
method: schema_guided
schema_path: graphgen/templates/extraction/schemas/legal_contract.json
3 changes: 3 additions & 0 deletions examples/generate/generate_aggregated_qa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Generate Aggregated QAs

Aggregated mode is one of three question-answering scenarios in GraphGen (alongside atomic and multi-hop) designed to generate synthetic training data that incorporates complex, integrated knowledge from multiple sources.
77 changes: 77 additions & 0 deletions examples/generate/generate_aggregated_qa/aggregated_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
global_params:
working_dir: cache

nodes:
- id: read_files # id is unique in the pipeline, and can be referenced by other steps
op_name: read
type: source
dependencies: []
params:
input_path:
- examples/input_examples/jsonl_demo.jsonl # input file path, support json, jsonl, txt, pdf. See examples/input_examples for examples

- id: chunk_documents
op_name: chunk
type: map_batch
dependencies:
- read_files
execution_params:
replicas: 4
params:
chunk_size: 1024 # chunk size for text splitting
chunk_overlap: 100 # chunk overlap for text splitting

- id: build_kg
op_name: build_kg
type: map_batch
dependencies:
- chunk_documents
execution_params:
replicas: 1
batch_size: 128

- id: quiz
op_name: quiz
type: aggregate
dependencies:
- build_kg
execution_params:
replicas: 1
batch_size: 128
params:
quiz_samples: 2 # number of quiz samples to generate
concurrency_limit: 200

- id: judge
op_name: judge
type: map_batch
dependencies:
- quiz
execution_params:
replicas: 1
batch_size: 128

- id: partition
op_name: partition
type: aggregate
dependencies:
- judge
params:
method: ece # ece is a custom partition method based on comprehension loss
method_params:
max_units_per_community: 20 # max nodes and edges per community
min_units_per_community: 5 # min nodes and edges per community
max_tokens_per_community: 10240 # max tokens per community
unit_sampling: max_loss # unit sampling strategy, support: random, max_loss, min_loss

- id: generate
op_name: generate
type: map_batch
dependencies:
- partition
execution_params:
replicas: 1
batch_size: 128
params:
method: aggregated # atomic, aggregated, multi_hop, cot, vqa
data_format: ChatML # Alpaca, Sharegpt, ChatML
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
python3 -m graphgen.run \
--config_file examples/generate/generate_aggregated_qa/aggregated_config.yaml \
--output_dir cache/
3 changes: 3 additions & 0 deletions examples/generate/generate_atomic_qa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Generate Atomic QAs

Atomic mode generates question-answer pairs that test basic, isolated knowledge from individual facts or relationships in the knowledge graph.
53 changes: 53 additions & 0 deletions examples/generate/generate_atomic_qa/atomic_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
global_params:
working_dir: cache

nodes:
- id: read
op_name: read
type: source
dependencies: []
params:
input_path:
- examples/input_examples/json_demo.json

- id: chunk
op_name: chunk
type: map_batch
dependencies:
- read
execution_params:
replicas: 4
params:
chunk_size: 1024
chunk_overlap: 100

- id: build_kg
op_name: build_kg
type: map_batch
execution_params:
replicas: 1
batch_size: 128
dependencies:
- chunk

- id: partition
op_name: partition
type: aggregate
dependencies:
- build_kg
params:
method: dfs
method_params:
max_units_per_community: 1

- id: generate
op_name: generate
type: map_batch
dependencies:
- partition
execution_params:
replicas: 1
batch_size: 128
params:
method: atomic
data_format: Alpaca
3 changes: 3 additions & 0 deletions examples/generate/generate_atomic_qa/generate_atomic.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
python3 -m graphgen.run \
--config_file examples/generate/generate_atomic_qa/atomic_config.yaml \
--output_dir cache/
1 change: 1 addition & 0 deletions examples/generate/generate_cot_qa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Generate CoT QAs
55 changes: 55 additions & 0 deletions examples/generate/generate_cot_qa/cot_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
global_params:
working_dir: cache

nodes:
- id: read
op_name: read
type: source
dependencies: []
params:
input_path:
- examples/input_examples/txt_demo.txt

- id: chunk
op_name: chunk
type: map_batch
dependencies:
- read
execution_params:
replicas: 4
params:
chunk_size: 1024
chunk_overlap: 100

- id: build_kg
op_name: build_kg
type: map_batch
execution_params:
replicas: 1
batch_size: 128
dependencies:
- chunk

- id: partition
op_name: partition
type: aggregate
dependencies:
- build_kg
params:
method: leiden
method_params:
max_size: 20
use_lcc: false
random_seed: 42

- id: generate
op_name: generate
type: map_batch
dependencies:
- partition
execution_params:
replicas: 1
batch_size: 128
params:
method: cot
data_format: Sharegpt
3 changes: 3 additions & 0 deletions examples/generate/generate_cot_qa/generate_cot.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
python3 -m graphgen.run \
--config_file examples/generate/generate_cot_qa/cot_config.yaml \
--output_dir cache/
1 change: 1 addition & 0 deletions examples/generate/generate_multi_hop_qa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Generate Multi-hop QAs
3 changes: 3 additions & 0 deletions examples/generate/generate_multi_hop_qa/generate_multi_hop.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
python3 -m graphgen.run \
--config_file examples/generate/generate_multi_hop_qa/multi_hop_config.yaml \
--output_dir cache/
56 changes: 56 additions & 0 deletions examples/generate/generate_multi_hop_qa/multi_hop_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
global_params:
working_dir: cache

nodes:
- id: read
op_name: read
type: source
dependencies: []
params:
input_path:
- examples/input_examples/csv_demo.csv

- id: chunk
op_name: chunk
type: map_batch
dependencies:
- read
execution_params:
replicas: 4
params:
chunk_size: 1024
chunk_overlap: 100

- id: build_kg
op_name: build_kg
type: map_batch
dependencies:
- chunk
execution_params:
replicas: 1
batch_size: 128

- id: partition
op_name: partition
type: aggregate
dependencies:
- build_kg
params:
method: ece
method_params:
max_units_per_community: 3
min_units_per_community: 3
max_tokens_per_community: 10240
unit_sampling: random

- id: generate
op_name: generate
type: map_batch
dependencies:
- partition
execution_params:
replicas: 1
batch_size: 128
params:
method: multi_hop
data_format: ChatML
1 change: 1 addition & 0 deletions examples/generate/generate_vqa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Generate VQAs
3 changes: 3 additions & 0 deletions examples/generate/generate_vqa/generate_vqa.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
python3 -m graphgen.run \
--config_file examples/generate/generate_vqa/vqa_config.yaml \
--output_dir cache/
57 changes: 57 additions & 0 deletions examples/generate/generate_vqa/vqa_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
global_params:
working_dir: cache

nodes:
- id: read
op_name: read
type: source
dependencies: []
params:
input_path:
- examples/input_examples/vqa_demo.json
modalities:
- text
- image

- id: chunk
op_name: chunk
type: map_batch
dependencies:
- read
execution_params:
replicas: 4
params:
chunk_size: 1024
chunk_overlap: 100

- id: build_kg
op_name: build_kg
type: map_batch
dependencies:
- chunk
execution_params:
replicas: 1
batch_size: 128

- id: partition
op_name: partition
type: aggregate
dependencies:
- build_kg
params:
method: anchor_bfs
method_params:
anchor_type: image
max_units_per_community: 10

- id: generate
op_name: generate
type: map_batch
dependencies:
- partition
execution_params:
replicas: 1
batch_size: 128
params:
method: vqa
data_format: ChatML
Loading
Loading