From c4aab28958c0670c0e631f1778ece2392079fc61 Mon Sep 17 00:00:00 2001 From: ocean-zhc Date: Wed, 28 Jan 2026 11:29:33 +0800 Subject: [PATCH] [Feature][SeaTunnel Skill] add seatunnel-skills --- README.md | 653 +++++++++++++++++++- README_CN.md | 641 ++++++++++++++++++++ SKILL_SETUP_GUIDE.md | 673 +++++++++++++++++++++ seatunnel-skill/SKILL.md | 1212 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 3154 insertions(+), 25 deletions(-) create mode 100644 README_CN.md create mode 100644 SKILL_SETUP_GUIDE.md create mode 100644 seatunnel-skill/SKILL.md diff --git a/README.md b/README.md index 1817050..9c1f431 100644 --- a/README.md +++ b/README.md @@ -1,38 +1,641 @@ # Apache SeaTunnel Tools -This repository hosts auxiliary tools for Apache SeaTunnel. It focuses on developer/operator productivity around configuration, conversion, packaging and diagnostics. Current modules: +**English** | [中文](README_CN.md) -- x2seatunnel: Convert configurations (e.g., DataX) into SeaTunnel configuration files. +Auxiliary tools for Apache SeaTunnel focusing on developer/operator productivity around configuration, conversion, LLM integration, packaging, and diagnostics. -More tools may be added in the future. For the main data integration engine, see the -[Apache SeaTunnel](https://github.com/apache/seatunnel) project. +## 🎯 What's Inside -## Tool 1 - SeaTunnel MCP Server +| Tool | Purpose | Status | +|------|---------|--------| +| **SeaTunnel Skill** | Claude AI integration for SeaTunnel operations | ✅ New | +| **SeaTunnel MCP Server** | Model Context Protocol for LLM integration | ✅ Available | +| **x2seatunnel** | Configuration converter (DataX → SeaTunnel) | ✅ Available | -What is MCP? -- MCP (Model Context Protocol) is an open protocol for connecting LLMs to tools, data, and systems. With SeaTunnel MCP, you can operate SeaTunnel directly from an LLM-powered interface while keeping the server-side logic secure and auditable. -- Learn more: https://github.com/modelcontextprotocol +--- -SeaTunnel MCP Server -- Source folder: [seatunnel-mcp/](seatunnel-mcp/) -- English README: [seatunnel-mcp/README.md](seatunnel-mcp/README.md) -- Chinese: [seatunnel-mcp/README_CN.md](seatunnel-mcp/README_CN.md) -- Quick Start: [seatunnel-mcp/docs/QUICK_START.md](seatunnel-mcp/docs/QUICK_START.md) -- User Guide: [seatunnel-mcp/docs/USER_GUIDE.md](seatunnel-mcp/docs/USER_GUIDE.md) -- Developer Guide: [seatunnel-mcp/docs/DEVELOPER_GUIDE.md](seatunnel-mcp/docs/DEVELOPER_GUIDE.md) +## ⚡ Quick Start -For screenshots, demo video, features, installation and usage instructions, please refer to the README in the seatunnel-mcp directory. +### For SeaTunnel Skill (Claude Code Integration) -## Tool 2 - x2seatunnel +**Installation & Setup:** -What is x2seatunnel? -- x2seatunnel is a configuration conversion tool that helps users migrate from other data integration tools (e.g., DataX) to SeaTunnel by converting existing configurations into SeaTunnel-compatible formats. -- x2seatunnel - - English: [x2seatunnel/README.md](x2seatunnel/README.md) - - Chinese: [x2seatunnel/README_zh.md](x2seatunnel/README_zh.md) +```bash +# 1. Clone this repository +git clone https://github.com/apache/seatunnel-tools.git +cd seatunnel-tools -## Contributing +# 2. Copy seatunnel-skill to Claude Code skills directory +cp -r seatunnel-skill ~/.claude/skills/ -Issues and PRs are welcome. +# 3. Restart Claude Code or reload skills +# Then use: /seatunnel-skill "your prompt here" +``` -Get the main project from [Apache SeaTunnel](https://github.com/apache/seatunnel) +**Quick Example:** + +```bash +# Query SeaTunnel documentation +/seatunnel-skill "How do I configure a MySQL to PostgreSQL job?" + +# Get connector information +/seatunnel-skill "List all available Kafka connector options" + +# Debug configuration issues +/seatunnel-skill "Why is my job failing with OutOfMemoryError?" +``` + +### For SeaTunnel Core (Direct Installation) + +```bash +# Download binary (recommended) +wget https://archive.apache.org/dist/seatunnel/2.3.12/apache-seatunnel-2.3.12-bin.tar.gz +tar -xzf apache-seatunnel-2.3.12-bin.tar.gz +cd apache-seatunnel-2.3.12 + +# Verify installation +./bin/seatunnel.sh --version + +# Run your first job +./bin/seatunnel.sh -c config/hello_world.conf -e spark +``` + +--- + +## 📋 Features Overview + +### SeaTunnel Skill +- 🤖 **AI-Powered Assistant**: Get instant help with SeaTunnel concepts and configurations +- 📚 **Knowledge Integration**: Query official documentation and best practices +- 🔍 **Smart Debugging**: Analyze errors and suggest fixes +- 💡 **Code Examples**: Generate configuration examples for your use case + +### SeaTunnel Core Engine +- **Multimodal Support**: Structured, unstructured, and semi-structured data +- **100+ Connectors**: Databases, data warehouses, cloud services, message queues +- **Multiple Engines**: Zeta (lightweight), Spark, Flink +- **Synchronization Modes**: Batch, Streaming, CDC (Change Data Capture) +- **Real-time Performance**: 100K - 1M records/second throughput + +--- + +## 🔧 Installation & Setup + +### Method 1: SeaTunnel Skill (AI Integration) + +**Step 1: Copy Skill File** +```bash +mkdir -p ~/.claude/skills +cp -r seatunnel-skill ~/.claude/skills/ +``` + +**Step 2: Verify Installation** +```bash +# In Claude Code, try: +/seatunnel-skill "What is SeaTunnel?" +``` + +**Step 3: Start Using** +```bash +# Help with configuration +/seatunnel-skill "Create a MySQL to Elasticsearch job config" + +# Troubleshoot errors +/seatunnel-skill "My Kafka connector keeps timing out" + +# Learn features +/seatunnel-skill "Explain CDC (Change Data Capture) in SeaTunnel" +``` + +### Method 2: SeaTunnel Binary Installation + +**Supported Platforms**: Linux, macOS, Windows + +```bash +# Download latest version +VERSION=2.3.12 +wget https://archive.apache.org/dist/seatunnel/${VERSION}/apache-seatunnel-${VERSION}-bin.tar.gz + +# Extract +tar -xzf apache-seatunnel-${VERSION}-bin.tar.gz +cd apache-seatunnel-${VERSION} + +# Set environment +export JAVA_HOME=/path/to/java +export PATH=$PATH:$(pwd)/bin + +# Verify +seatunnel.sh --version +``` + +### Method 3: Build from Source + +```bash +# Clone repository +git clone https://github.com/apache/seatunnel.git +cd seatunnel + +# Build +mvn clean install -DskipTests + +# Run from distribution +cd seatunnel-dist/target/apache-seatunnel-*-bin/apache-seatunnel-* +./bin/seatunnel.sh --version +``` + +### Method 4: Docker + +```bash +# Pull official image +docker pull apache/seatunnel:latest + +# Run container +docker run -it apache/seatunnel:latest /bin/bash + +# Run job directly +docker run -v /path/to/config:/config \ + apache/seatunnel:latest \ + seatunnel.sh -c /config/job.conf -e spark +``` + +--- + +## 💻 Usage Guide + +### Use Case 1: MySQL to PostgreSQL (Batch) + +**config/mysql_to_postgres.conf** +```hocon +env { + job.mode = "BATCH" + job.name = "MySQL to PostgreSQL" +} + +source { + Jdbc { + driver = "com.mysql.cj.jdbc.Driver" + url = "jdbc:mysql://mysql-host:3306/mydb" + user = "root" + password = "password" + query = "SELECT * FROM users" + connection_check_timeout_sec = 100 + } +} + +sink { + Jdbc { + driver = "org.postgresql.Driver" + url = "jdbc:postgresql://pg-host:5432/mydb" + user = "postgres" + password = "password" + database = "mydb" + table = "users" + primary_keys = ["id"] + connection_check_timeout_sec = 100 + } +} +``` + +**Run:** +```bash +seatunnel.sh -c config/mysql_to_postgres.conf -e spark +``` + +### Use Case 2: Kafka Streaming to Elasticsearch + +**config/kafka_to_es.conf** +```hocon +env { + job.mode = "STREAMING" + job.name = "Kafka to Elasticsearch" + parallelism = 2 +} + +source { + Kafka { + bootstrap.servers = "kafka-host:9092" + topic = "events" + consumer.group = "seatunnel-group" + format = "json" + schema = { + fields { + event_id = "bigint" + event_name = "string" + timestamp = "bigint" + } + } + } +} + +sink { + Elasticsearch { + hosts = ["es-host:9200"] + index = "events" + username = "elastic" + password = "password" + } +} +``` + +**Run:** +```bash +seatunnel.sh -c config/kafka_to_es.conf -e flink +``` + +### Use Case 3: MySQL CDC to Kafka + +**config/mysql_cdc_kafka.conf** +```hocon +env { + job.mode = "STREAMING" + job.name = "MySQL CDC to Kafka" +} + +source { + Mysql { + server_id = 5400 + hostname = "mysql-host" + port = 3306 + username = "root" + password = "password" + database = ["mydb"] + table = ["users", "orders"] + startup.mode = "initial" + } +} + +sink { + Kafka { + bootstrap.servers = "kafka-host:9092" + topic = "mysql_cdc" + format = "canal_json" + semantic = "EXACTLY_ONCE" + } +} +``` + +**Run:** +```bash +seatunnel.sh -c config/mysql_cdc_kafka.conf -e flink +``` + +--- + +## 📚 API Reference + +### Core Connector Types + +**Source Connectors** +- `Jdbc` - Generic JDBC databases (MySQL, PostgreSQL, Oracle, SQL Server) +- `Kafka` - Apache Kafka topics +- `Mysql` - MySQL with CDC support +- `MongoDB` - MongoDB collections +- `PostgreSQL` - PostgreSQL with CDC +- `S3` - Amazon S3 and compatible storage +- `Http` - HTTP/HTTPS endpoints +- `FakeSource` - For testing + +**Sink Connectors** +- `Jdbc` - Write to JDBC-compatible databases +- `Kafka` - Publish to Kafka topics +- `Elasticsearch` - Write to Elasticsearch indices +- `S3` - Write to S3 buckets +- `Redis` - Write to Redis +- `HBase` - Write to HBase tables +- `Console` - Output to console + +**Transform Connectors** +- `Sql` - Execute SQL transformations +- `FieldMapper` - Rename/map columns +- `JsonPath` - Extract data from JSON + +--- + +## ⚙️ Configuration & Tuning + +### Environment Variables + +```bash +# Java configuration +export JAVA_HOME=/path/to/java +export JVM_OPTS="-Xms1G -Xmx4G" + +# Spark configuration (if using Spark engine) +export SPARK_HOME=/path/to/spark +export SPARK_MASTER=spark://master:7077 + +# Flink configuration (if using Flink engine) +export FLINK_HOME=/path/to/flink + +# SeaTunnel configuration +export SEATUNNEL_HOME=/path/to/seatunnel +``` + +### Performance Tuning for Batch Jobs + +```hocon +env { + job.mode = "BATCH" + parallelism = 8 # Increase for larger clusters +} + +source { + Jdbc { + split_size = 100000 # Parallel reads + fetch_size = 5000 + } +} + +sink { + Jdbc { + batch_size = 1000 # Batch inserts + max_retries = 3 + } +} +``` + +### Performance Tuning for Streaming Jobs + +```hocon +env { + job.mode = "STREAMING" + parallelism = 4 + checkpoint.interval = 30000 # 30 seconds +} + +source { + Kafka { + consumer.group = "seatunnel-consumer" + max_poll_records = 500 + } +} +``` + +--- + +## 🛠️ Development Guide + +### Project Structure + +``` +seatunnel-tools/ +├── seatunnel-skill/ # Claude Code AI skill +├── seatunnel-mcp/ # MCP server for LLM integration +├── x2seatunnel/ # DataX to SeaTunnel converter +└── README.md +``` + +### SeaTunnel Core Architecture + +``` +seatunnel/ +├── seatunnel-api/ # Core APIs +├── seatunnel-core/ # Execution engine +├── seatunnel-engines/ # Engine implementations +│ ├── seatunnel-engine-flink/ +│ ├── seatunnel-engine-spark/ +│ └── seatunnel-engine-zeta/ +├── seatunnel-connectors/ # Connector implementations +└── seatunnel-dist/ # Distribution package +``` + +### Building SeaTunnel from Source + +```bash +# Full build +git clone https://github.com/apache/seatunnel.git +cd seatunnel +mvn clean install -DskipTests + +# Build specific module +mvn clean install -pl seatunnel-connectors/seatunnel-connectors-seatunnel-kafka -DskipTests +``` + +### Running Tests + +```bash +# Unit tests +mvn test + +# Specific test class +mvn test -Dtest=MySqlConnectorTest + +# Integration tests +mvn verify +``` + +--- + +## 🐛 Troubleshooting (6 Common Issues) + +### Issue 1: ClassNotFoundException: com.mysql.jdbc.Driver + +**Solution:** +```bash +wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.33.jar +cp mysql-connector-java-8.0.33.jar $SEATUNNEL_HOME/lib/ +seatunnel.sh -c config/job.conf -e spark +``` + +### Issue 2: OutOfMemoryError: Java heap space + +**Solution:** +```bash +export JVM_OPTS="-Xms2G -Xmx8G" +echo 'JVM_OPTS="-Xms2G -Xmx8G"' >> $SEATUNNEL_HOME/bin/seatunnel-env.sh +``` + +### Issue 3: Connection refused: connect + +**Solution:** +```bash +# Verify connectivity +ping source-host +telnet source-host 3306 + +# Check credentials +mysql -h source-host -u root -p +``` + +### Issue 4: Table not found during CDC + +**Solution:** +```sql +-- Check binlog status +SHOW VARIABLES LIKE 'log_bin'; + +-- Enable binlog in my.cnf +[mysqld] +log_bin = mysql-bin +binlog_format = row +``` + +### Issue 5: Slow Job Performance + +**Solution:** +```hocon +env { + parallelism = 8 # Increase parallelism +} + +source { + Jdbc { + fetch_size = 5000 + split_size = 100000 + } +} + +sink { + Jdbc { + batch_size = 2000 + } +} +``` + +### Issue 6: Kafka offset out of range + +**Solution:** +```hocon +source { + Kafka { + auto.offset.reset = "earliest" # or "latest" + } +} +``` + +--- + +## ❓ FAQ (8 Common Questions) + +**Q: What's the difference between BATCH and STREAMING mode?** + +A: +- **BATCH**: One-time execution, suitable for full database migration +- **STREAMING**: Continuous execution, suitable for real-time sync and CDC + +**Q: How do I handle schema changes during CDC?** + +A: Configure auto-detection in source: +```hocon +source { + Mysql { + schema_change_mode = "auto" + } +} +``` + +**Q: Can I transform data during synchronization?** + +A: Yes, use SQL transform: +```hocon +transform { + Sql { + sql = "SELECT id, UPPER(name) as name FROM source" + } +} +``` + +**Q: What's the maximum throughput?** + +A: Typical throughput is 100K - 1M records/second per executor. Depends on: +- Hardware (CPU, RAM, Network) +- Database configuration +- Data size per record +- Network latency + +**Q: How do I handle errors in production?** + +A: Configure restart strategy: +```hocon +env { + restart_strategy = "exponential_delay" + restart_strategy.exponential_delay.initial_delay = 1000 + restart_strategy.exponential_delay.max_delay = 30000 + restart_strategy.exponential_delay.multiplier = 2.0 +} +``` + +**Q: Is there a web UI for job management?** + +A: Yes! Use SeaTunnel Web Project: +```bash +git clone https://github.com/apache/seatunnel-web.git +cd seatunnel-web +mvn clean install +java -jar target/seatunnel-web-*.jar +# Access at http://localhost:8080 +``` + +**Q: How do I use the SeaTunnel Skill with Claude Code?** + +A: After copying to `~/.claude/skills/`, use: +```bash +/seatunnel-skill "your question about SeaTunnel" +``` + +**Q: Which engine should I use: Spark, Flink, or Zeta?** + +A: +- **Zeta**: Lightweight, no external dependencies, single machine +- **Spark**: Batch and batch-stream processing on distributed clusters +- **Flink**: Advanced streaming and CDC on distributed clusters + +--- + +## 🔗 Resources & Links + +### Official Documentation +- [SeaTunnel Website](https://seatunnel.apache.org/) +- [GitHub Repository](https://github.com/apache/seatunnel) +- [Connector List](https://seatunnel.apache.org/docs/2.3.12/connector-v2/overview) +- [HOCON Configuration Guide](https://github.com/lightbend/config/blob/main/HOCON.md) + +### Community & Support +- [Slack Channel](https://the-asf.slack.com/archives/C01CB5186TL) +- [Mailing Lists](https://seatunnel.apache.org/community/mail-lists/) +- [GitHub Issues](https://github.com/apache/seatunnel/issues) +- [Discussion Forum](https://github.com/apache/seatunnel/discussions) + +### Related Projects +- [SeaTunnel Web UI](https://github.com/apache/seatunnel-web) +- [SeaTunnel Tools](https://github.com/apache/seatunnel-tools) +- [Apache Kafka](https://kafka.apache.org/) +- [Apache Flink](https://flink.apache.org/) +- [Apache Spark](https://spark.apache.org/) + +--- + +## 📄 Individual Tools + +### 1. SeaTunnel Skill (New) +- **Purpose**: AI-powered assistant for SeaTunnel in Claude Code +- **Location**: [seatunnel-skill/](seatunnel-skill/) +- **Quick Setup**: `cp -r seatunnel-skill ~/.claude/skills/` +- **Usage**: `/seatunnel-skill "your question"` + +### 2. SeaTunnel MCP Server +- **Purpose**: Model Context Protocol integration for LLM systems +- **Location**: [seatunnel-mcp/](seatunnel-mcp/) +- **English**: [README.md](seatunnel-mcp/README.md) +- **Chinese**: [README_CN.md](seatunnel-mcp/README_CN.md) +- **Quick Start**: [QUICK_START.md](seatunnel-mcp/docs/QUICK_START.md) + +### 3. x2seatunnel +- **Purpose**: Convert DataX and other configurations to SeaTunnel format +- **Location**: [x2seatunnel/](x2seatunnel/) +- **English**: [README.md](x2seatunnel/README.md) +- **Chinese**: [README_zh.md](x2seatunnel/README_zh.md) + +--- + +## 🤝 Contributing + +Issues and PRs are welcome! + +For the main SeaTunnel engine, see [Apache SeaTunnel](https://github.com/apache/seatunnel). + +For these tools, please contribute to [SeaTunnel Tools](https://github.com/apache/seatunnel-tools). + +--- + +**Last Updated**: 2026-01-28 | **License**: Apache 2.0 diff --git a/README_CN.md b/README_CN.md new file mode 100644 index 0000000..5541aa1 --- /dev/null +++ b/README_CN.md @@ -0,0 +1,641 @@ +# Apache SeaTunnel 工具集 + +[English](README.md) | **中文** + +Apache SeaTunnel 辅助工具集,重点关注开发者/运维生产力,包括配置转换、LLM 集成、打包和诊断。 + +## 🎯 工具概览 + +| 工具 | 用途 | 状态 | +|------|------|------| +| **SeaTunnel Skill** | Claude AI 集成 | ✅ 新功能 | +| **SeaTunnel MCP 服务** | LLM 集成协议 | ✅ 可用 | +| **x2seatunnel** | 配置转换工具 (DataX → SeaTunnel) | ✅ 可用 | + +--- + +## ⚡ 快速开始 + +### SeaTunnel Skill (Claude Code 集成) + +**安装步骤:** + +```bash +# 1. 克隆本仓库 +git clone https://github.com/apache/seatunnel-tools.git +cd seatunnel-tools + +# 2. 复制 seatunnel-skill 到 Claude Code 技能目录 +cp -r seatunnel-skill ~/.claude/skills/ + +# 3. 重启 Claude Code 或重新加载技能 +# 然后使用: /seatunnel-skill "你的问题" +``` + +**快速示例:** + +```bash +# 查询 SeaTunnel 文档 +/seatunnel-skill "如何配置 MySQL 到 PostgreSQL 的数据同步?" + +# 获取连接器信息 +/seatunnel-skill "列出所有可用的 Kafka 连接器选项" + +# 调试配置问题 +/seatunnel-skill "为什么我的任务出现 OutOfMemoryError 错误?" +``` + +### SeaTunnel 核心引擎(直接安装) + +```bash +# 下载二进制文件(推荐) +wget https://archive.apache.org/dist/seatunnel/2.3.12/apache-seatunnel-2.3.12-bin.tar.gz +tar -xzf apache-seatunnel-2.3.12-bin.tar.gz +cd apache-seatunnel-2.3.12 + +# 验证安装 +./bin/seatunnel.sh --version + +# 运行第一个任务 +./bin/seatunnel.sh -c config/hello_world.conf -e spark +``` + +--- + +## 📋 功能概览 + +### SeaTunnel Skill +- 🤖 **AI 助手**: 获得 SeaTunnel 概念和配置的即时帮助 +- 📚 **知识集成**: 查询官方文档和最佳实践 +- 🔍 **智能调试**: 分析错误并提出修复建议 +- 💡 **代码示例**: 为您的用例生成配置示例 + +### SeaTunnel 核心引擎 +- **多模式支持**: 结构化、非结构化和半结构化数据 +- **100+ 连接器**: 数据库、数据仓库、云服务、消息队列 +- **多引擎支持**: Zeta(轻量级)、Spark、Flink +- **同步模式**: 批处理、流处理、CDC(变更数据捕获) +- **实时性能**: 每秒 100K - 1M 条记录吞吐量 + +--- + +## 🔧 安装与设置 + +### 方法 1: SeaTunnel Skill (AI 集成) + +**第一步:复制技能文件** +```bash +mkdir -p ~/.claude/skills +cp -r seatunnel-skill ~/.claude/skills/ +``` + +**第二步:验证安装** +```bash +# 在 Claude Code 中尝试: +/seatunnel-skill "什么是 SeaTunnel?" +``` + +**第三步:开始使用** +```bash +# 帮助配置 +/seatunnel-skill "创建一个 MySQL 到 Elasticsearch 的任务配置" + +# 故障排除 +/seatunnel-skill "我的 Kafka 连接器一直超时" + +# 学习功能 +/seatunnel-skill "在 SeaTunnel 中解释 CDC(变更数据捕获)" +``` + +### 方法 2: 二进制安装 + +**支持平台**: Linux、macOS、Windows + +```bash +# 下载最新版本 +VERSION=2.3.12 +wget https://archive.apache.org/dist/seatunnel/${VERSION}/apache-seatunnel-${VERSION}-bin.tar.gz + +# 解压 +tar -xzf apache-seatunnel-${VERSION}-bin.tar.gz +cd apache-seatunnel-${VERSION} + +# 设置环境 +export JAVA_HOME=/path/to/java +export PATH=$PATH:$(pwd)/bin + +# 验证 +seatunnel.sh --version +``` + +### 方法 3: 从源代码构建 + +```bash +# 克隆仓库 +git clone https://github.com/apache/seatunnel.git +cd seatunnel + +# 构建 +mvn clean install -DskipTests + +# 从分发目录运行 +cd seatunnel-dist/target/apache-seatunnel-*-bin/apache-seatunnel-* +./bin/seatunnel.sh --version +``` + +### 方法 4: Docker + +```bash +# 拉取官方镜像 +docker pull apache/seatunnel:latest + +# 运行容器 +docker run -it apache/seatunnel:latest /bin/bash + +# 直接运行任务 +docker run -v /path/to/config:/config \ + apache/seatunnel:latest \ + seatunnel.sh -c /config/job.conf -e spark +``` + +--- + +## 💻 使用指南 + +### 用例 1: MySQL 到 PostgreSQL(批处理) + +**config/mysql_to_postgres.conf** +```hocon +env { + job.mode = "BATCH" + job.name = "MySQL 到 PostgreSQL" +} + +source { + Jdbc { + driver = "com.mysql.cj.jdbc.Driver" + url = "jdbc:mysql://mysql-host:3306/mydb" + user = "root" + password = "password" + query = "SELECT * FROM users" + connection_check_timeout_sec = 100 + } +} + +sink { + Jdbc { + driver = "org.postgresql.Driver" + url = "jdbc:postgresql://pg-host:5432/mydb" + user = "postgres" + password = "password" + database = "mydb" + table = "users" + primary_keys = ["id"] + connection_check_timeout_sec = 100 + } +} +``` + +**运行:** +```bash +seatunnel.sh -c config/mysql_to_postgres.conf -e spark +``` + +### 用例 2: Kafka 流到 Elasticsearch + +**config/kafka_to_es.conf** +```hocon +env { + job.mode = "STREAMING" + job.name = "Kafka 到 Elasticsearch" + parallelism = 2 +} + +source { + Kafka { + bootstrap.servers = "kafka-host:9092" + topic = "events" + consumer.group = "seatunnel-group" + format = "json" + schema = { + fields { + event_id = "bigint" + event_name = "string" + timestamp = "bigint" + } + } + } +} + +sink { + Elasticsearch { + hosts = ["es-host:9200"] + index = "events" + username = "elastic" + password = "password" + } +} +``` + +**运行:** +```bash +seatunnel.sh -c config/kafka_to_es.conf -e flink +``` + +### 用例 3: MySQL CDC 到 Kafka + +**config/mysql_cdc_kafka.conf** +```hocon +env { + job.mode = "STREAMING" + job.name = "MySQL CDC 到 Kafka" +} + +source { + Mysql { + server_id = 5400 + hostname = "mysql-host" + port = 3306 + username = "root" + password = "password" + database = ["mydb"] + table = ["users", "orders"] + startup.mode = "initial" + } +} + +sink { + Kafka { + bootstrap.servers = "kafka-host:9092" + topic = "mysql_cdc" + format = "canal_json" + semantic = "EXACTLY_ONCE" + } +} +``` + +**运行:** +```bash +seatunnel.sh -c config/mysql_cdc_kafka.conf -e flink +``` + +--- + +## 📚 API 参考 + +### 核心连接器类型 + +**源连接器** +- `Jdbc` - 通用 JDBC 数据库(MySQL、PostgreSQL、Oracle、SQL Server) +- `Kafka` - Apache Kafka 主题 +- `Mysql` - 支持 CDC 的 MySQL +- `MongoDB` - MongoDB 集合 +- `PostgreSQL` - 支持 CDC 的 PostgreSQL +- `S3` - Amazon S3 和兼容存储 +- `Http` - HTTP/HTTPS 端点 +- `FakeSource` - 用于测试 + +**宿连接器** +- `Jdbc` - 写入 JDBC 兼容数据库 +- `Kafka` - 发布到 Kafka 主题 +- `Elasticsearch` - 写入 Elasticsearch 索引 +- `S3` - 写入 S3 存储桶 +- `Redis` - 写入 Redis +- `HBase` - 写入 HBase 表 +- `Console` - 输出到控制台 + +**转换连接器** +- `Sql` - 执行 SQL 转换 +- `FieldMapper` - 列重命名/映射 +- `JsonPath` - 从 JSON 提取数据 + +--- + +## ⚙️ 配置与优化 + +### 环境变量 + +```bash +# Java 配置 +export JAVA_HOME=/path/to/java +export JVM_OPTS="-Xms1G -Xmx4G" + +# Spark 配置(使用 Spark 引擎时) +export SPARK_HOME=/path/to/spark +export SPARK_MASTER=spark://master:7077 + +# Flink 配置(使用 Flink 引擎时) +export FLINK_HOME=/path/to/flink + +# SeaTunnel 配置 +export SEATUNNEL_HOME=/path/to/seatunnel +``` + +### 批处理任务性能调优 + +```hocon +env { + job.mode = "BATCH" + parallelism = 8 # 根据集群大小增加 +} + +source { + Jdbc { + split_size = 100000 # 并行读取 + fetch_size = 5000 + } +} + +sink { + Jdbc { + batch_size = 1000 # 批量插入 + max_retries = 3 + } +} +``` + +### 流处理任务性能调优 + +```hocon +env { + job.mode = "STREAMING" + parallelism = 4 + checkpoint.interval = 30000 # 30 秒 +} + +source { + Kafka { + consumer.group = "seatunnel-consumer" + max_poll_records = 500 + } +} +``` + +--- + +## 🛠️ 开发指南 + +### 项目结构 + +``` +seatunnel-tools/ +├── seatunnel-skill/ # Claude Code AI 技能 +├── seatunnel-mcp/ # LLM 集成 MCP 服务 +├── x2seatunnel/ # DataX 到 SeaTunnel 转换器 +└── README_CN.md +``` + +### SeaTunnel 核心架构 + +``` +seatunnel/ +├── seatunnel-api/ # 核心 API +├── seatunnel-core/ # 执行引擎 +├── seatunnel-engines/ # 引擎实现 +│ ├── seatunnel-engine-flink/ +│ ├── seatunnel-engine-spark/ +│ └── seatunnel-engine-zeta/ +├── seatunnel-connectors/ # 连接器实现 +└── seatunnel-dist/ # 分发包 +``` + +### 从源代码构建 SeaTunnel + +```bash +# 完整构建 +git clone https://github.com/apache/seatunnel.git +cd seatunnel +mvn clean install -DskipTests + +# 构建特定模块 +mvn clean install -pl seatunnel-connectors/seatunnel-connectors-seatunnel-kafka -DskipTests +``` + +### 运行测试 + +```bash +# 单元测试 +mvn test + +# 特定测试类 +mvn test -Dtest=MySqlConnectorTest + +# 集成测试 +mvn verify +``` + +--- + +## 🐛 故障排查(6 个常见问题) + +### 问题 1: ClassNotFoundException: com.mysql.jdbc.Driver + +**解决方案:** +```bash +wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.33.jar +cp mysql-connector-java-8.0.33.jar $SEATUNNEL_HOME/lib/ +seatunnel.sh -c config/job.conf -e spark +``` + +### 问题 2: OutOfMemoryError: Java heap space + +**解决方案:** +```bash +export JVM_OPTS="-Xms2G -Xmx8G" +echo 'JVM_OPTS="-Xms2G -Xmx8G"' >> $SEATUNNEL_HOME/bin/seatunnel-env.sh +``` + +### 问题 3: Connection refused: connect + +**解决方案:** +```bash +# 验证连接 +ping source-host +telnet source-host 3306 + +# 检查凭证 +mysql -h source-host -u root -p +``` + +### 问题 4: CDC 期间找不到表 + +**解决方案:** +```sql +-- 检查二进制日志状态 +SHOW VARIABLES LIKE 'log_bin'; + +-- 在 my.cnf 中启用二进制日志 +[mysqld] +log_bin = mysql-bin +binlog_format = row +``` + +### 问题 5: 任务性能缓慢 + +**解决方案:** +```hocon +env { + parallelism = 8 # 增加并行性 +} + +source { + Jdbc { + fetch_size = 5000 + split_size = 100000 + } +} + +sink { + Jdbc { + batch_size = 2000 + } +} +``` + +### 问题 6: Kafka 偏移量超出范围 + +**解决方案:** +```hocon +source { + Kafka { + auto.offset.reset = "earliest" # 或 "latest" + } +} +``` + +--- + +## ❓ 常见问题(8 个常见问题) + +**Q: BATCH 和 STREAMING 模式有什么区别?** + +A: +- **BATCH**: 一次性执行,适合全量数据库迁移 +- **STREAMING**: 持续执行,适合实时同步和 CDC + +**Q: 如何在 CDC 期间处理架构更改?** + +A: 在源中配置自动检测: +```hocon +source { + Mysql { + schema_change_mode = "auto" + } +} +``` + +**Q: 我能在同步期间转换数据吗?** + +A: 可以,使用 SQL 转换: +```hocon +transform { + Sql { + sql = "SELECT id, UPPER(name) as name FROM source" + } +} +``` + +**Q: 最大吞吐量是多少?** + +A: 典型吞吐量为每个执行器每秒 100K - 1M 条记录。取决于: +- 硬件(CPU、RAM、网络) +- 数据库配置 +- 每条记录的数据大小 +- 网络延迟 + +**Q: 如何在生产环境中处理错误?** + +A: 配置重启策略: +```hocon +env { + restart_strategy = "exponential_delay" + restart_strategy.exponential_delay.initial_delay = 1000 + restart_strategy.exponential_delay.max_delay = 30000 + restart_strategy.exponential_delay.multiplier = 2.0 +} +``` + +**Q: 是否有用于任务管理的 Web UI?** + +A: 是的!使用 SeaTunnel Web 项目: +```bash +git clone https://github.com/apache/seatunnel-web.git +cd seatunnel-web +mvn clean install +java -jar target/seatunnel-web-*.jar +# 访问 http://localhost:8080 +``` + +**Q: 如何在 Claude Code 中使用 SeaTunnel Skill?** + +A: 复制到 `~/.claude/skills/` 后,使用: +```bash +/seatunnel-skill "关于 SeaTunnel 的问题" +``` + +**Q: 应该使用哪个引擎:Spark、Flink 还是 Zeta?** + +A: +- **Zeta**: 轻量级,无外部依赖,单机 +- **Spark**: 分布式集群的批处理和批流混合 +- **Flink**: 分布式集群的高级流处理和 CDC + +--- + +## 🔗 资源与链接 + +### 官方文档 +- [SeaTunnel 官网](https://seatunnel.apache.org/) +- [GitHub 仓库](https://github.com/apache/seatunnel) +- [连接器列表](https://seatunnel.apache.org/docs/2.3.12/connector-v2/overview) +- [HOCON 配置指南](https://github.com/lightbend/config/blob/main/HOCON.md) + +### 社区与支持 +- [Slack 频道](https://the-asf.slack.com/archives/C01CB5186TL) +- [邮件列表](https://seatunnel.apache.org/community/mail-lists/) +- [GitHub Issues](https://github.com/apache/seatunnel/issues) +- [讨论论坛](https://github.com/apache/seatunnel/discussions) + +### 相关项目 +- [SeaTunnel Web UI](https://github.com/apache/seatunnel-web) +- [SeaTunnel 工具集](https://github.com/apache/seatunnel-tools) +- [Apache Kafka](https://kafka.apache.org/) +- [Apache Flink](https://flink.apache.org/) +- [Apache Spark](https://spark.apache.org/) + +--- + +## 📄 单个工具说明 + +### 1. SeaTunnel Skill(新功能) +- **用途**: Claude Code 中 SeaTunnel 的 AI 助手 +- **位置**: [seatunnel-skill/](seatunnel-skill/) +- **快速设置**: `cp -r seatunnel-skill ~/.claude/skills/` +- **使用方法**: `/seatunnel-skill "你的问题"` + +### 2. SeaTunnel MCP 服务 +- **用途**: LLM 系统的模型上下文协议集成 +- **位置**: [seatunnel-mcp/](seatunnel-mcp/) +- **英文**: [README.md](seatunnel-mcp/README.md) +- **中文**: [README_CN.md](seatunnel-mcp/README_CN.md) +- **快速开始**: [QUICK_START.md](seatunnel-mcp/docs/QUICK_START.md) + +### 3. x2seatunnel +- **用途**: 将 DataX 等配置转换为 SeaTunnel 格式 +- **位置**: [x2seatunnel/](x2seatunnel/) +- **英文**: [README.md](x2seatunnel/README.md) +- **中文**: [README_zh.md](x2seatunnel/README_zh.md) + +--- + +## 🤝 贡献 + +欢迎提交 Issues 和 Pull Requests! + +对于主要的 SeaTunnel 引擎,请参阅 [Apache SeaTunnel](https://github.com/apache/seatunnel)。 + +对于这些工具,请贡献到 [SeaTunnel 工具集](https://github.com/apache/seatunnel-tools)。 + +--- + +**最后更新**: 2026-01-28 | **许可证**: Apache 2.0 \ No newline at end of file diff --git a/SKILL_SETUP_GUIDE.md b/SKILL_SETUP_GUIDE.md new file mode 100644 index 0000000..c9922d3 --- /dev/null +++ b/SKILL_SETUP_GUIDE.md @@ -0,0 +1,673 @@ +# SeaTunnel Skill Setup Guide + +**English** | [中文](#中文版本) + +## Getting Started with SeaTunnel Skill in Claude Code + +SeaTunnel Skill is an AI-powered assistant for Apache SeaTunnel integrated directly into Claude Code. It helps you with configuration, troubleshooting, learning, and best practices. + +--- + +## Installation + +### Step 1: Clone the Repository + +```bash +git clone https://github.com/apache/seatunnel-tools.git +cd seatunnel-tools +``` + +### Step 2: Locate Skills Directory + +Claude Code stores skills in your home directory. Create the skills directory if it doesn't exist: + +```bash +# Create ~/.claude/skills directory if it doesn't exist +mkdir -p ~/.claude/skills +``` + +**Directory Locations by OS:** +- **macOS/Linux**: `~/.claude/skills/` +- **Windows**: `%USERPROFILE%\.claude\skills\` + +### Step 3: Copy the Skill + +```bash +# Copy seatunnel-skill to Claude Code skills directory +cp -r seatunnel-skill ~/.claude/skills/ + +# Verify installation +ls ~/.claude/skills/seatunnel-skill/ +``` + +You should see: +``` +SKILL.md # Skill definition and metadata +README.md # Skill documentation +``` + +### Step 4: Verify Installation + +**Option A: Using Claude Code Terminal** + +```bash +# In Claude Code terminal, run: +ls ~/.claude/skills/seatunnel-skill/ + +# You should see the skill files listed +``` + +**Option B: Check Skill Loading** + +In Claude Code, you might see a skill reload notification. If not: +1. Restart Claude Code +2. Or reload the skills manually through the skill menu + +### Step 5: Test the Skill + +Open Claude Code and try: + +```bash +/seatunnel-skill "What is SeaTunnel?" +``` + +You should get an AI-powered response about SeaTunnel. + +--- + +## Usage Examples + +### Getting Help with Configuration + +**Question:** How do I configure a MySQL to PostgreSQL job? + +```bash +/seatunnel-skill "Create a job configuration to sync data from MySQL to PostgreSQL with batch mode" +``` + +**Response:** The skill will provide a complete HOCON configuration example with explanations. + +### Learning SeaTunnel Concepts + +**Question:** Explain CDC mode + +```bash +/seatunnel-skill "Explain Change Data Capture (CDC) in SeaTunnel. When should I use it?" +``` + +**Response:** Comprehensive explanation of CDC, use cases, and configuration examples. + +### Troubleshooting + +**Question:** My job is failing + +```bash +/seatunnel-skill "I'm getting 'OutOfMemoryError: Java heap space' in my batch job. How do I fix it?" +``` + +**Response:** Detailed diagnosis and solutions, including: +- Root cause explanation +- Configuration fixes +- Environment variable adjustments +- Performance tuning tips + +### Connector Information + +**Question:** Available Kafka options + +```bash +/seatunnel-skill "What are all the configuration options for Kafka source connector?" +``` + +**Response:** Complete list of options with descriptions and examples. + +### Performance Optimization + +**Question:** How to optimize streaming + +```bash +/seatunnel-skill "How do I optimize a Kafka to Elasticsearch streaming job for maximum throughput?" +``` + +**Response:** Performance tuning recommendations for parallelism, batch sizes, and resource allocation. + +--- + +## Common Questions + +### Q: Why doesn't the skill show up? + +**A:** Make sure you: +1. Copied the folder to `~/.claude/skills/` (not a subdirectory) +2. Restarted Claude Code or reloaded skills +3. The folder is named exactly `seatunnel-skill` + +**Fix:** +```bash +# Verify the path +ls -la ~/.claude/skills/seatunnel-skill/SKILL.md + +# If it doesn't exist, copy it again +cp -r seatunnel-skill ~/.claude/skills/ +``` + +### Q: How do I update the skill? + +**A:** +```bash +# Navigate to seatunnel-tools directory +cd /path/to/seatunnel-tools + +# Pull latest changes +git pull origin main + +# Update the skill +rm -rf ~/.claude/skills/seatunnel-skill +cp -r seatunnel-skill ~/.claude/skills/ + +# Restart Claude Code +``` + +### Q: Can I customize the skill? + +**A:** Yes! Edit `seatunnel-skill/SKILL.md`: + +```bash +# Open the skill definition +nano ~/.claude/skills/seatunnel-skill/SKILL.md + +# Make your changes +# The skill will use your customizations +``` + +### Q: Where are skill responses saved? + +**A:** Skill responses are part of your Claude Code conversation history. They are saved in: +- Local Claude Code workspace +- Optionally synced to Claude.ai if configured + +--- + +## Advanced Usage + +### Chaining Questions + +You can build on previous questions in the same conversation: + +```bash +/seatunnel-skill "What is batch mode?" + +# In next message, reference previous context: +/seatunnel-skill "Show me a complete example combining batch mode with MySQL source" + +# The skill understands the context from previous messages +``` + +### Getting Code Examples + +The skill can generate complete, production-ready configurations: + +```bash +/seatunnel-skill "Generate a complete SeaTunnel job configuration that: +1. Reads from MySQL database 'sales_db' table 'orders' +2. Filters orders from last 30 days +3. Writes to PostgreSQL 'analytics_db' table 'orders_processed' +4. Uses batch mode with 4 parallelism" +``` + +### Integration with Your Workflow + +**Development Pipeline:** +```bash +# 1. Understand requirements +/seatunnel-skill "Explain how to set up CDC from MySQL" + +# 2. Design solution +/seatunnel-skill "Design a real-time data pipeline from MySQL CDC to Kafka" + +# 3. Generate configuration +/seatunnel-skill "Generate the complete HOCON configuration for the pipeline" + +# 4. Debug issues +/seatunnel-skill "My job is timing out. Debug this configuration: [paste config]" + +# 5. Optimize performance +/seatunnel-skill "How can I optimize this job for better throughput?" +``` + +--- + +## Troubleshooting + +### Issue: Skill not found error + +``` +Error: Unknown skill: seatunnel-skill +``` + +**Solution:** +```bash +# 1. Verify skill exists +ls ~/.claude/skills/seatunnel-skill/ + +# 2. Check file permissions +chmod +r ~/.claude/skills/seatunnel-skill/* + +# 3. Restart Claude Code and try again +``` + +### Issue: Outdated responses + +**Solution:** +```bash +# Update skill to latest version +cd seatunnel-tools +git pull origin main +rm -rf ~/.claude/skills/seatunnel-skill +cp -r seatunnel-skill ~/.claude/skills/ +``` + +### Issue: Responses are too generic + +**Try:** +```bash +# Be more specific in your question: +# Instead of: +/seatunnel-skill "How to configure MySQL?" + +# Try: +/seatunnel-skill "Configure MySQL source for a batch job that reads table 'users' with filters" +``` + +--- + +## Tips for Best Results + +1. **Be Specific**: More details in your question = better responses +2. **Include Context**: Mention your use case (batch/streaming, source/sink types) +3. **Show Configuration**: Paste your HOCON config for debugging +4. **Reference Versions**: Specify SeaTunnel version (e.g., 2.3.12) +5. **Ask Follow-ups**: The skill remembers conversation context + +--- + +## Keyboard Shortcuts + +- **Cmd+K** (macOS) / **Ctrl+K** (Windows/Linux): Quick open skill +- **Type** `/seatunnel-skill`: Invoke skill +- **Tab**: Auto-complete skill parameters +- **Esc**: Cancel skill input + +--- + +## File Locations + +``` +seatunnel-tools/ +├── seatunnel-skill/ # AI Skill +│ ├── SKILL.md # Skill definition +│ └── README.md # Documentation +├── README.md # Main documentation +├── README_CN.md # Chinese documentation +└── SKILL_SETUP_GUIDE.md # This file +``` + +--- + +## Getting Help + +- **Skill Issues**: Try `/seatunnel-skill "How do I troubleshoot..."` +- **SeaTunnel Questions**: Ask the skill directly +- **Installation Help**: See [README.md](README.md) or [README_CN.md](README_CN.md) +- **Report Issues**: [GitHub Issues](https://github.com/apache/seatunnel-tools/issues) + +--- + +## Next Steps + +1. ✅ Install skill (`cp -r seatunnel-skill ~/.claude/skills/`) +2. ✅ Test skill (`/seatunnel-skill "What is SeaTunnel?"`) +3. 📚 Explore examples in this guide +4. 🚀 Use skill for your SeaTunnel projects +5. 📝 Share feedback and improvements + +--- + +--- + +# 中文版本 + +# SeaTunnel Skill 安装使用指南 + +## 开始使用 SeaTunnel Skill + +SeaTunnel Skill 是一个集成到 Claude Code 中的 AI 助手,帮助您进行 Apache SeaTunnel 的配置、故障排查、学习和最佳实践。 + +--- + +## 安装步骤 + +### 第一步:克隆仓库 + +```bash +git clone https://github.com/apache/seatunnel-tools.git +cd seatunnel-tools +``` + +### 第二步:定位技能目录 + +Claude Code 在您的主目录中存储技能。如果目录不存在,请创建: + +```bash +# 如果目录不存在,则创建 ~/.claude/skills 目录 +mkdir -p ~/.claude/skills +``` + +**不同操作系统的目录位置:** +- **macOS/Linux**: `~/.claude/skills/` +- **Windows**: `%USERPROFILE%\.claude\skills\` + +### 第三步:复制技能文件 + +```bash +# 复制 seatunnel-skill 到 Claude Code 技能目录 +cp -r seatunnel-skill ~/.claude/skills/ + +# 验证安装 +ls ~/.claude/skills/seatunnel-skill/ +``` + +您应该看到: +``` +SKILL.md # 技能定义和元数据 +README.md # 技能文档 +``` + +### 第四步:验证安装 + +**选项 A:使用 Claude Code 终端** + +```bash +# 在 Claude Code 终端中运行: +ls ~/.claude/skills/seatunnel-skill/ + +# 您应该看到技能文件列出 +``` + +**选项 B:检查技能加载** + +在 Claude Code 中,您可能会看到技能重新加载通知。如果没有: +1. 重启 Claude Code +2. 或通过技能菜单手动重新加载 + +### 第五步:测试技能 + +打开 Claude Code 并尝试: + +```bash +/seatunnel-skill "什么是 SeaTunnel?" +``` + +您应该获得关于 SeaTunnel 的 AI 驱动响应。 + +--- + +## 使用示例 + +### 获取配置帮助 + +**问题:** 如何配置从 MySQL 到 PostgreSQL 的任务? + +```bash +/seatunnel-skill "创建一个任务配置,以批处理模式将数据从 MySQL 同步到 PostgreSQL" +``` + +**响应:** 技能将提供完整的 HOCON 配置示例和说明。 + +### 学习 SeaTunnel 概念 + +**问题:** 解释 CDC 模式 + +```bash +/seatunnel-skill "在 SeaTunnel 中解释变更数据捕获 (CDC)。何时应该使用它?" +``` + +**响应:** 关于 CDC 的全面解释、用例和配置示例。 + +### 故障排查 + +**问题:** 我的任务失败了 + +```bash +/seatunnel-skill "我的批处理任务出现 'OutOfMemoryError: Java heap space' 错误。我应该如何修复?" +``` + +**响应:** 详细的诊断和解决方案,包括: +- 根本原因说明 +- 配置修复 +- 环境变量调整 +- 性能调优建议 + +### 连接器信息 + +**问题:** 可用的 Kafka 选项 + +```bash +/seatunnel-skill "Kafka 源连接器的所有配置选项是什么?" +``` + +**响应:** 完整的选项列表,带有描述和示例。 + +### 性能优化 + +**问题:** 如何优化流处理 + +```bash +/seatunnel-skill "如何优化从 Kafka 到 Elasticsearch 的流处理任务以获得最大吞吐量?" +``` + +**响应:** 并行度、批大小和资源分配的性能调优建议。 + +--- + +## 常见问题 + +### Q: 为什么技能不显示? + +**A:** 请确保您: +1. 将文件夹复制到 `~/.claude/skills/`(不是子目录) +2. 重启了 Claude Code 或重新加载了技能 +3. 文件夹名称完全是 `seatunnel-skill` + +**修复:** +```bash +# 验证路径 +ls -la ~/.claude/skills/seatunnel-skill/SKILL.md + +# 如果不存在,再次复制 +cp -r seatunnel-skill ~/.claude/skills/ +``` + +### Q: 如何更新技能? + +**A:** +```bash +# 导航到 seatunnel-tools 目录 +cd /path/to/seatunnel-tools + +# 拉取最新更改 +git pull origin main + +# 更新技能 +rm -rf ~/.claude/skills/seatunnel-skill +cp -r seatunnel-skill ~/.claude/skills/ + +# 重启 Claude Code +``` + +### Q: 我可以自定义技能吗? + +**A:** 可以!编辑 `seatunnel-skill/SKILL.md`: + +```bash +# 打开技能定义 +nano ~/.claude/skills/seatunnel-skill/SKILL.md + +# 进行更改 +# 技能将使用您的自定义设置 +``` + +### Q: 技能响应保存在哪里? + +**A:** 技能响应是您的 Claude Code 对话历史的一部分。它们保存在: +- 本地 Claude Code 工作区 +- 如果配置,可选地同步到 Claude.ai + +--- + +## 高级用法 + +### 链接问题 + +您可以在同一对话中基于之前的问题进行构建: + +```bash +/seatunnel-skill "什么是批处理模式?" + +# 在下一条消息中,参考之前的上下文: +/seatunnel-skill "展示一个结合批处理模式和 MySQL 源的完整示例" + +# 技能理解来自之前消息的上下文 +``` + +### 获取代码示例 + +技能可以生成完整的、生产就绪的配置: + +```bash +/seatunnel-skill "生成一个完整的 SeaTunnel 任务配置,该配置: +1. 从 MySQL 数据库 'sales_db' 表 'orders' 读取 +2. 过滤最近 30 天的订单 +3. 写入 PostgreSQL 'analytics_db' 表 'orders_processed' +4. 使用 4 个并行度的批处理模式" +``` + +### 与您的工作流集成 + +**开发流程:** +```bash +# 1. 了解需求 +/seatunnel-skill "解释如何从 MySQL 设置 CDC" + +# 2. 设计解决方案 +/seatunnel-skill "设计从 MySQL CDC 到 Kafka 的实时数据管道" + +# 3. 生成配置 +/seatunnel-skill "为管道生成完整的 HOCON 配置" + +# 4. 调试问题 +/seatunnel-skill "我的任务超时。调试此配置:[粘贴配置]" + +# 5. 优化性能 +/seatunnel-skill "我应该如何优化此任务以获得更好的吞吐量?" +``` + +--- + +## 故障排查 + +### 问题:技能未找到错误 + +``` +Error: Unknown skill: seatunnel-skill +``` + +**解决方案:** +```bash +# 1. 验证技能存在 +ls ~/.claude/skills/seatunnel-skill/ + +# 2. 检查文件权限 +chmod +r ~/.claude/skills/seatunnel-skill/* + +# 3. 重启 Claude Code 并重试 +``` + +### 问题:响应过时 + +**解决方案:** +```bash +# 更新技能到最新版本 +cd seatunnel-tools +git pull origin main +rm -rf ~/.claude/skills/seatunnel-skill +cp -r seatunnel-skill ~/.claude/skills/ +``` + +### 问题:响应过于笼统 + +**尝试:** +```bash +# 在问题中更具体: +# 不是: +/seatunnel-skill "如何配置 MySQL?" + +# 而是: +/seatunnel-skill "配置 MySQL 源进行批处理任务,读取 'users' 表并应用过滤器" +``` + +--- + +## 获得最佳结果的提示 + +1. **具体明确**: 问题中的细节越多 = 响应越好 +2. **包含上下文**: 提及您的用例(批/流、源/宿类型) +3. **显示配置**: 粘贴您的 HOCON 配置以进行调试 +4. **参考版本**: 指定 SeaTunnel 版本(例如 2.3.12) +5. **提出后续问题**: 技能会记住对话上下文 + +--- + +## 键盘快捷键 + +- **Cmd+K** (macOS) / **Ctrl+K** (Windows/Linux): 快速打开技能 +- **输入** `/seatunnel-skill`: 调用技能 +- **Tab**: 自动完成技能参数 +- **Esc**: 取消技能输入 + +--- + +## 文件位置 + +``` +seatunnel-tools/ +├── seatunnel-skill/ # AI 技能 +│ ├── SKILL.md # 技能定义 +│ └── README.md # 文档 +├── README.md # 主文档 +├── README_CN.md # 中文文档 +└── SKILL_SETUP_GUIDE.md # 此文件 +``` + +--- + +## 获取帮助 + +- **技能问题**: 尝试 `/seatunnel-skill "我应该如何故障排查..."` +- **SeaTunnel 问题**: 直接向技能提问 +- **安装帮助**: 查看 [README.md](README.md) 或 [README_CN.md](README_CN.md) +- **报告问题**: [GitHub Issues](https://github.com/apache/seatunnel-tools/issues) + +--- + +## 后续步骤 + +1. ✅ 安装技能 (`cp -r seatunnel-skill ~/.claude/skills/`) +2. ✅ 测试技能 (`/seatunnel-skill "什么是 SeaTunnel?"`) +3. 📚 探索本指南中的示例 +4. 🚀 将技能用于您的 SeaTunnel 项目 +5. 📝 分享反馈和改进 + +--- + +**最后更新**: 2026-01-28 | **许可证**: Apache 2.0 \ No newline at end of file diff --git a/seatunnel-skill/SKILL.md b/seatunnel-skill/SKILL.md new file mode 100644 index 0000000..b0ceb2d --- /dev/null +++ b/seatunnel-skill/SKILL.md @@ -0,0 +1,1212 @@ +--- +name: seatunnel-skill +description: Apache SeaTunnel - A multimodal, high-performance, distributed data integration tool for massive data synchronization across 100+ connectors +author: auto-generated by repo2skill +platform: github +source: https://github.com/apache/seatunnel +tags: [data-integration, data-pipeline, etl, elt, real-time-streaming, batch-processing, cdc, distributed-computing, apache, java] +version: 2.3.13 +generated: 2026-01-28 +license: Apache 2.0 +repository: apache/seatunnel +--- + +# Apache SeaTunnel OpenCode Skill + +Apache SeaTunnel is a **multimodal, high-performance, distributed data integration tool** capable of synchronizing vast amounts of data daily. It connects hundreds of evolving data sources with support for real-time, CDC (Change Data Capture), and full database synchronization. + +## Quick Start + +### Prerequisites + +- **Java**: JDK 8 or higher +- **Maven**: 3.6.0 or higher (for building from source) +- **Python**: 3.7+ (optional, for Python API) +- **Git**: For cloning the repository + +### Installation + +#### 1. Clone Repository +```bash +git clone https://github.com/apache/seatunnel.git +cd seatunnel +``` + +#### 2. Build from Source +```bash +# Build entire project +mvn clean install -DskipTests + +# Build specific module +mvn clean install -pl seatunnel-core -DskipTests +``` + +#### 3. Download Pre-built Binary (Recommended) +Visit the [official download page](https://seatunnel.apache.org/download) and select your version: + +```bash +# Example: Download SeaTunnel 2.3.12 +wget https://archive.apache.org/dist/seatunnel/2.3.12/apache-seatunnel-2.3.12-bin.tar.gz +tar -xzf apache-seatunnel-2.3.12-bin.tar.gz +cd apache-seatunnel-2.3.12 +``` + +#### 4. Basic Configuration +```bash +# Set JAVA_HOME +export JAVA_HOME=/path/to/java + +# Add to PATH +export PATH=$PATH:/path/to/seatunnel/bin +``` + +#### 5. Verify Installation +```bash +seatunnel --version +``` + +### Hello World Example + +Create a simple data integration job: + +**config/hello_world.conf** +```hocon +env { + job.mode = "BATCH" + job.name = "Hello World" +} + +source { + FakeSource { + row.num = 100 + schema = { + fields { + id = "bigint" + name = "string" + age = "int" + } + } + } +} + +sink { + Console { + format = "json" + } +} +``` + +Run the job: +```bash +seatunnel.sh -c config/hello_world.conf -e spark +# or with flink +seatunnel.sh -c config/hello_world.conf -e flink +``` + +--- + +## Overview + +### Purpose and Target Users + +**SeaTunnel** is designed for: +- **Data Engineers**: Building large-scale data pipelines with minimal complexity +- **DevOps Teams**: Managing data integration infrastructure +- **Enterprise Platforms**: Handling 100+ billion data records daily +- **Real-time Analytics**: Supporting streaming data synchronization +- **Legacy System Migration**: CDC-based incremental sync from transactional databases + +### Core Capabilities + +1. **Multimodal Support** + - Structured data (databases, data warehouses) + - Unstructured data (video, images, binaries) + - Semi-structured data (JSON, logs, binlog streams) + +2. **Multiple Synchronization Methods** + - **Batch**: Full historical data transfer + - **Streaming**: Real-time data pipeline + - **CDC**: Incremental capture from databases + - **Full + Incremental**: Combined approach + +3. **100+ Pre-built Connectors** + - Databases: MySQL, PostgreSQL, Oracle, SQL Server, MongoDB + - Data Warehouses: Snowflake, BigQuery, Redshift, Iceberg + - Cloud SaaS: Salesforce, Shopify, Google Sheets + - Message Queues: Kafka, RabbitMQ, Pulsar + - Search Engines: Elasticsearch, OpenSearch + - Object Storage: S3, GCS, HDFS + +4. **Multi-Engine Support** + - **Zeta Engine**: Lightweight, standalone deployment (no Spark/Flink required) + - **Apache Flink**: Distributed streaming engine + - **Apache Spark**: Distributed batch/batch-stream processing + +--- + +## Features + +### High-Performance + +- **Distributed Snapshot Algorithm**: Ensures data consistency without locks +- **JDBC Multiplexing**: Minimizes database connections for real-time sync +- **Log Parsing**: Efficient CDC implementation with binary log analysis +- **Resource Optimization**: Reduces computing resources and I/O overhead + +### Data Quality & Reliability + +- **Real-time Monitoring**: Track synchronization progress and data metrics +- **Data Loss Prevention**: Transactional guarantees (exactly-once semantics) +- **Deduplication**: Prevents duplicate records during reprocessing +- **Error Handling**: Graceful failure recovery and retry logic + +### Developer-Friendly + +- **SQL-like Configuration**: Intuitive job definition syntax +- **Visual Web UI**: Drag-and-drop job builder (SeaTunnel Web Project) +- **Extensive Documentation**: Comprehensive guides and examples +- **Community Support**: Active community via Slack and mailing lists + +### Production Ready + +- **Proven at Scale**: Used in enterprises processing billions of records daily +- **Version Stability**: Regular releases with backward compatibility +- **Enterprise Features**: Multi-tenancy, RBAC, audit logging +- **Cloud Native**: Kubernetes-ready deployment + +--- + +## Installation + +### Installation Methods + +#### Method 1: Binary Download (Recommended for Quick Start) + +```bash +# 1. Download binary +VERSION=2.3.12 +wget https://archive.apache.org/dist/seatunnel/${VERSION}/apache-seatunnel-${VERSION}-bin.tar.gz + +# 2. Extract +tar -xzf apache-seatunnel-${VERSION}-bin.tar.gz +cd apache-seatunnel-${VERSION} + +# 3. Verify +./bin/seatunnel.sh --version +``` + +#### Method 2: Build from Source + +```bash +# 1. Clone repository +git clone https://github.com/apache/seatunnel.git +cd seatunnel + +# 2. Build with Maven +mvn clean install -DskipTests + +# 3. Navigate to distribution +cd seatunnel-dist/target/apache-seatunnel-*-bin/apache-seatunnel-* + +# 4. Verify +./bin/seatunnel.sh --version +``` + +#### Method 3: Docker + +```bash +# Pull official Docker image +docker pull apache/seatunnel:latest + +# Run container +docker run -it apache/seatunnel:latest /bin/bash + +# Or run a job directly +docker run -v /path/to/config:/config \ + apache/seatunnel:latest \ + seatunnel.sh -c /config/job.conf -e spark +``` + +### Environment Setup + +#### Set Java Home +```bash +# Bash/Zsh +export JAVA_HOME=/path/to/java +export PATH=$JAVA_HOME/bin:$PATH + +# Verify +java -version +``` + +#### Configure for Spark Engine +```bash +# Set Spark Home (if using Spark engine) +export SPARK_HOME=/path/to/spark + +# Verify Spark installation +$SPARK_HOME/bin/spark-submit --version +``` + +#### Configure for Flink Engine +```bash +# Set Flink Home (if using Flink engine) +export FLINK_HOME=/path/to/flink + +# Verify Flink installation +$FLINK_HOME/bin/flink --version +``` + +### System Requirements + +| Requirement | Version/Spec | +|---|---| +| Java | JDK 1.8+ | +| Memory | 2GB+ (minimum), 8GB+ (recommended) | +| Disk | 500MB (binary) + job storage | +| Network | Connectivity to source/sink systems | +| Scala | 2.12.15 (for Spark/Flink integration) | + +--- + +## Usage + +### 1. Job Configuration (HOCON Format) + +SeaTunnel uses **HOCON** (Human-Optimized Config Object Notation) for job configuration. + +**Basic Structure:** +```hocon +env { + job.mode = "BATCH" # or STREAMING + job.name = "My Job" + parallelism = 4 +} + +source { + SourceConnector { + option1 = value1 + option2 = value2 + schema = { + fields { + column1 = "type" + column2 = "type" + } + } + } +} + +# Optional: Transform data +transform { + TransformName { + option = value + } +} + +sink { + SinkConnector { + option1 = value1 + option2 = value2 + } +} +``` + +### 2. Common Use Cases + +#### Use Case 1: MySQL to PostgreSQL (Batch) + +**config/mysql_to_postgres.conf** +```hocon +env { + job.mode = "BATCH" + job.name = "MySQL to PostgreSQL" +} + +source { + Jdbc { + driver = "com.mysql.cj.jdbc.Driver" + url = "jdbc:mysql://mysql-host:3306/mydb" + user = "root" + password = "password" + query = "SELECT * FROM users" + connection_check_timeout_sec = 100 + } +} + +sink { + Jdbc { + driver = "org.postgresql.Driver" + url = "jdbc:postgresql://pg-host:5432/mydb" + user = "postgres" + password = "password" + database = "mydb" + table = "users" + primary_keys = ["id"] + connection_check_timeout_sec = 100 + } +} +``` + +Run: +```bash +seatunnel.sh -c config/mysql_to_postgres.conf -e spark +``` + +#### Use Case 2: Kafka Streaming to Elasticsearch + +**config/kafka_to_es.conf** +```hocon +env { + job.mode = "STREAMING" + job.name = "Kafka to Elasticsearch" + parallelism = 2 +} + +source { + Kafka { + bootstrap.servers = "kafka-host:9092" + topic = "events" + patterns = "event.*" + consumer.group = "seatunnel-group" + format = "json" + schema = { + fields { + event_id = "bigint" + event_name = "string" + timestamp = "bigint" + payload = "string" + } + } + } +} + +transform { + Sql { + sql = "SELECT event_id, event_name, FROM_UNIXTIME(timestamp/1000) as ts, payload FROM source" + } +} + +sink { + Elasticsearch { + hosts = ["es-host:9200"] + index = "events" + index_type = "_doc" + username = "elastic" + password = "password" + } +} +``` + +Run: +```bash +seatunnel.sh -c config/kafka_to_es.conf -e flink +``` + +#### Use Case 3: CDC from MySQL to Kafka + +**config/mysql_cdc_kafka.conf** +```hocon +env { + job.mode = "STREAMING" + job.name = "MySQL CDC to Kafka" +} + +source { + Mysql { + server_id = 5400 + hostname = "mysql-host" + port = 3306 + username = "root" + password = "password" + database = ["mydb"] + table = ["users", "orders"] + startup.mode = "initial" + snapshot.split.size = 8096 + incremental.snapshot.chunk.size = 1024 + snapshot_fetch_size = 1024 + snapshot_lock_timeout_sec = 10 + server_time_zone = "UTC" + } +} + +sink { + Kafka { + bootstrap.servers = "kafka-host:9092" + topic = "mysql_cdc" + format = "canal_json" + semantic = "EXACTLY_ONCE" + } +} +``` + +Run: +```bash +seatunnel.sh -c config/mysql_cdc_kafka.conf -e flink +``` + +### 3. Running Jobs + +#### Local Mode (Single Machine) +```bash +# Using Spark (default) +seatunnel.sh -c config/job.conf -e spark + +# Using Flink +seatunnel.sh -c config/job.conf -e flink + +# Using Zeta (lightweight) +seatunnel.sh -c config/job.conf -e zeta +``` + +#### Cluster Mode +```bash +# Submit to Spark cluster +seatunnel.sh -c config/job.conf -e spark -m cluster -n hadoop-master:7077 + +# Submit to Flink cluster +seatunnel.sh -c config/job.conf -e flink -m remote -s localhost 8081 +``` + +#### Verbose Output +```bash +seatunnel.sh -c config/job.conf -e spark -l DEBUG +``` + +#### Check Status +```bash +# View running jobs (Spark Cluster) +spark-submit --status + +# View running jobs (Flink Cluster) +$FLINK_HOME/bin/flink list +``` + +### 4. SQL API (Advanced) + +Use SQL for complex transformations: + +```hocon +source { + MySQL { + # Source config... + } +} + +transform { + Sql { + # Multiple SQL statements + sql = """ + SELECT + id, + UPPER(name) as name, + age + 10 as age_plus_10, + CURRENT_TIMESTAMP as created_at + FROM source + WHERE age > 18 + """ + } +} + +sink { + PostgreSQL { + # Sink config... + } +} +``` + +--- + +## API Reference + +### Core Connector Types + +#### Source Connectors +- **Jdbc**: Generic JDBC databases (MySQL, PostgreSQL, Oracle, SQL Server) +- **Kafka**: Apache Kafka topics +- **Mysql**: MySQL with CDC support +- **MongoDB**: MongoDB collections +- **PostgreSQL**: PostgreSQL with CDC +- **S3**: Amazon S3 and compatible storage +- **Http**: HTTP/HTTPS endpoints +- **FakeSource**: For testing and development + +#### Transform Connectors +- **Sql**: Execute SQL transformations +- **Dummy**: Pass-through (testing) +- **FieldMapper**: Rename/map columns +- **JsonPath**: Extract data from JSON + +#### Sink Connectors +- **Jdbc**: Write to JDBC-compatible databases +- **Kafka**: Publish to Kafka topics +- **Elasticsearch**: Write to Elasticsearch indices +- **S3**: Write to S3 buckets +- **Redis**: Write to Redis +- **HBase**: Write to HBase tables +- **StarRocks**: Write to StarRocks tables +- **Console**: Output to console (testing) + +### Configuration Options + +#### Common Source Options +```hocon +source { + ConnectorName { + # Connection + hostname = "host" + port = 3306 + username = "user" + password = "pass" + + # Data selection + database = "db_name" + table = "table_name" + + # Performance + fetch_size = 1000 + connection_check_timeout_sec = 100 + split_size = 10000 + + # Schema + schema = { + fields { + id = "bigint" + name = "string" + } + } + } +} +``` + +#### Common Sink Options +```hocon +sink { + ConnectorName { + # Connection + hostname = "host" + port = 3306 + username = "user" + password = "pass" + + # Write behavior + database = "db_name" + table = "table_name" + primary_keys = ["id"] + batch_size = 500 + + # Error handling + max_retries = 3 + retry_wait_time_ms = 1000 + on_duplicate_key_update_column_names = ["field1"] + } +} +``` + +#### Engine Options +```hocon +env { + # Execution mode + job.mode = "BATCH" # or STREAMING + + # Job identity + job.name = "My Job" + + # Parallelism + parallelism = 4 + + # Checkpoint (streaming) + checkpoint.interval = 60000 + + # Restart strategy + restart_strategy = "fixed_delay" + restart_strategy.fixed_delay.attempts = 3 + restart_strategy.fixed_delay.delay = 10000 +} +``` + +### Debugging and Monitoring + +#### View Job Metrics +```bash +# During execution, monitor logs +tail -f logs/seatunnel.log + +# Check specific log level +grep ERROR logs/seatunnel.log +``` + +#### Enable Debug Logging +```bash +# Set log level to DEBUG +seatunnel.sh -c config/job.conf -e spark -l DEBUG +``` + +#### Use Test Sources +```hocon +source { + FakeSource { + row.num = 1000 + schema = { + fields { + id = "bigint" + name = "string" + } + } + } +} + +sink { + Console { + format = "json" + } +} +``` + +--- + +## Configuration + +### Project Structure +``` +seatunnel/ +├── bin/ # Executable scripts +│ ├── seatunnel.sh # Main entry point +│ └── seatunnel-submit.sh # Spark submission script +├── config/ # Configuration examples +│ ├── flink-conf.yaml # Flink configuration +│ └── spark-conf.yaml # Spark configuration +├── connectors/ # Pre-built connectors +├── lib/ # JAR dependencies +├── logs/ # Runtime logs +└── plugin/ # Plugin directory +``` + +### Environment Variables + +```bash +# Java configuration +export JAVA_HOME=/path/to/java +export JVM_OPTS="-Xms1G -Xmx4G" + +# Spark configuration +export SPARK_HOME=/path/to/spark +export SPARK_MASTER=spark://master:7077 + +# Flink configuration +export FLINK_HOME=/path/to/flink + +# SeaTunnel configuration +export SEATUNNEL_HOME=/path/to/seatunnel +``` + +### Key Configuration Files + +#### `seatunnel-env.sh` +Configure runtime environment: +```bash +JAVA_HOME=/path/to/java +JVM_OPTS="-Xms1G -Xmx4G -XX:+UseG1GC" +SPARK_HOME=/path/to/spark +FLINK_HOME=/path/to/flink +``` + +#### `flink-conf.yaml` (for Flink engine) +```yaml +taskmanager.memory.process.size: 2g +taskmanager.memory.jvm-overhead.fraction: 0.1 +parallelism.default: 4 +``` + +#### `spark-conf.yaml` (for Spark engine) +```yaml +driver-memory: 2g +executor-memory: 4g +num-executors: 4 +``` + +### Performance Tuning + +#### For Batch Jobs +```hocon +env { + job.mode = "BATCH" + parallelism = 8 # Increase for larger clusters +} + +source { + Jdbc { + # Split large tables for parallel reads + split_size = 100000 + fetch_size = 5000 + } +} + +sink { + Jdbc { + # Batch inserts for better throughput + batch_size = 1000 + # Use connection pooling + max_retries = 3 + } +} +``` + +#### For Streaming Jobs +```hocon +env { + job.mode = "STREAMING" + parallelism = 4 + checkpoint.interval = 30000 # 30 seconds +} + +source { + Kafka { + # Consumer group for parallel reads + consumer.group = "seatunnel-consumer" + # Batch reading + max_poll_records = 500 + } +} +``` + +--- + +## Development + +### Project Architecture + +**SeaTunnel** follows a modular architecture: + +``` +seatunnel/ +├── seatunnel-api/ # Core APIs +├── seatunnel-core/ # Execution engine +├── seatunnel-engines/ # Engine implementations +│ ├── seatunnel-engine-flink/ +│ ├── seatunnel-engine-spark/ +│ └── seatunnel-engine-zeta/ +├── seatunnel-connectors/ # Connector implementations +│ ├── seatunnel-connectors-*/ # One per connector type +└── seatunnel-dist/ # Distribution package +``` + +### Building from Source + +#### Full Build +```bash +# Clone repository +git clone https://github.com/apache/seatunnel.git +cd seatunnel + +# Build all modules +mvn clean install -DskipTests + +# Build with tests (slower) +mvn clean install +``` + +#### Build Specific Module +```bash +# Build only Kafka connector +mvn clean install -pl seatunnel-connectors/seatunnel-connectors-seatunnel-kafka -DskipTests + +# Build only Flink engine +mvn clean install -pl seatunnel-engines/seatunnel-engine-flink -DskipTests +``` + +#### Build Distribution +```bash +# Create binary distribution +mvn clean install -DskipTests +cd seatunnel-dist +tar -tzf target/apache-seatunnel-*-bin.tar.gz | head +``` + +### Running Tests + +#### Unit Tests +```bash +# Run all tests +mvn test + +# Run specific test class +mvn test -Dtest=MySqlConnectorTest + +# Run with coverage +mvn test jacoco:report +``` + +#### Integration Tests +```bash +# Run integration tests +mvn verify + +# Run with Docker containers +mvn verify -Pintegration-tests +``` + +### Development Setup + +#### IDE Configuration (IntelliJ IDEA) + +1. **Import Project** + - File → Open → Select seatunnel directory + - Choose Maven as build system + +2. **Configure JDK** + - Project Settings → Project → JDK + - Select JDK 1.8 or higher + +3. **Enable Annotation Processing** + - Project Settings → Build → Compiler → Annotation Processors + - Enable annotation processing + +4. **Run Configuration** + - Run → Edit Configurations + - Add new "Application" configuration + - Set main class: `org.apache.seatunnel.core.starter.command.CommandExecuteRunner` + +#### Code Style + +SeaTunnel follows Apache project conventions: +- 4-space indentation (not tabs) +- Line length max 120 characters +- Standard Java naming conventions +- Organize imports alphabetically + +Use the provided `.editorconfig`: +```bash +# Install EditorConfig plugin (IntelliJ) +# Then your IDE will auto-format code +``` + +### Creating Custom Connectors + +#### 1. Extend SeConnector Interface +```java +import org.apache.seatunnel.api.source.SeSource; +import org.apache.seatunnel.api.table.catalog.Table; + +public class CustomSource extends SeSource { + @Override + public String getPluginName() { + return "Custom"; + } + + @Override + public void validate() { + // Validation logic + } + + @Override + public ResultSet read(Boundedness boundedness) { + // Implementation + } +} +``` + +#### 2. Create Configuration Class +```java +import org.apache.seatunnel.api.configuration.Option; +import org.apache.seatunnel.api.configuration.Options; + +public class CustomSourceOptions { + public static final Option HOST = + Options.key("host") + .stringType() + .noDefaultValue() + .withDescription("Source hostname"); + + public static final Option PORT = + Options.key("port") + .intType() + .defaultValue(9200) + .withDescription("Source port"); +} +``` + +#### 3. Register Connector +``` +META-INF/services/org.apache.seatunnel.api.source.SeSource +``` + +### Contributing Guide + +1. **Fork and Clone** + ```bash + git clone https://github.com/YOUR_USERNAME/seatunnel.git + cd seatunnel + git remote add upstream https://github.com/apache/seatunnel.git + ``` + +2. **Create Feature Branch** + ```bash + git checkout -b feature/my-feature + ``` + +3. **Make Changes** + - Follow code style guide + - Add tests for new features + - Update documentation + +4. **Commit and Push** + ```bash + git add . + git commit -m "feat: add new feature" + git push origin feature/my-feature + ``` + +5. **Create Pull Request** + - Go to GitHub repository + - Create PR with clear description + - Link any related issues + - Wait for review + +6. **Code Review Process** + - Address feedback from maintainers + - Update PR with changes + - After approval, maintainers will merge + +--- + +## Troubleshooting + +### Common Issues and Solutions + +#### Issue 1: "ClassNotFoundException: com.mysql.jdbc.Driver" + +**Cause**: JDBC driver JAR not in classpath + +**Solution**: +```bash +# 1. Download MySQL JDBC driver +wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.33.jar + +# 2. Copy to lib directory +cp mysql-connector-java-8.0.33.jar $SEATUNNEL_HOME/lib/ + +# 3. Restart job +seatunnel.sh -c config/job.conf -e spark +``` + +#### Issue 2: "OutOfMemoryError: Java heap space" + +**Cause**: Insufficient JVM heap memory + +**Solution**: +```bash +# Increase JVM memory +export JVM_OPTS="-Xms2G -Xmx8G" + +# Or set in seatunnel-env.sh +echo 'JVM_OPTS="-Xms2G -Xmx8G"' >> $SEATUNNEL_HOME/bin/seatunnel-env.sh +``` + +#### Issue 3: "Connection refused: connect" + +**Cause**: Source/sink service not reachable + +**Solution**: +```bash +# 1. Verify connectivity +ping source-host +telnet source-host 3306 + +# 2. Check credentials +mysql -h source-host -u root -p + +# 3. Check firewall rules +# Ensure port 3306 is open +``` + +#### Issue 4: "Table not found" during CDC + +**Cause**: Binlog not enabled on MySQL + +**Solution**: +```sql +-- Check binlog status +SHOW VARIABLES LIKE 'log_bin'; + +-- Enable binlog in my.cnf +[mysqld] +log_bin = mysql-bin +binlog_format = row # Important for CDC + +-- Restart MySQL and verify +SHOW MASTER STATUS; +``` + +#### Issue 5: Slow Job Performance + +**Cause**: Suboptimal configuration + +**Solutions**: +```hocon +# 1. Increase parallelism +env { + parallelism = 8 # or higher based on cluster +} + +# 2. Increase batch sizes +source { + Jdbc { + fetch_size = 5000 + split_size = 100000 + } +} + +sink { + Jdbc { + batch_size = 2000 + } +} + +# 3. Enable connection pooling +source { + Jdbc { + pool_size = 10 + max_idle_time = 300 + } +} +``` + +#### Issue 6: "Kafka topic offset out of range" + +**Cause**: Offset doesn't exist in topic + +**Solution**: +```hocon +source { + Kafka { + # Reset to earliest or latest + auto.offset.reset = "earliest" # or "latest" + + # Or specify explicit offsets + start_mode = "earliest" + } +} +``` + +### FAQ + +**Q: What's the difference between BATCH and STREAMING mode?** + +A: +- **BATCH**: One-time execution, suitable for full database migration +- **STREAMING**: Continuous execution, suitable for real-time data sync and CDC + +**Q: How do I handle schema changes during CDC?** + +A: SeaTunnel automatically detects schema changes in CDC mode. Configure in source: +```hocon +source { + Mysql { + schema_change_mode = "auto" # auto-detect and apply + } +} +``` + +**Q: Can I transform data during synchronization?** + +A: Yes, use SQL transform: +```hocon +transform { + Sql { + sql = "SELECT id, UPPER(name) as name FROM source" + } +} +``` + +**Q: What's the maximum throughput?** + +A: Depends on: +- Hardware (CPU, RAM, Network) +- Source/sink database configuration +- Data size per record +- Network latency + +Typical throughput: 100K - 1M records/second per executor + +**Q: How do I handle errors in production?** + +A: Configure restart strategy: +```hocon +env { + restart_strategy = "exponential_delay" + restart_strategy.exponential_delay.initial_delay = 1000 + restart_strategy.exponential_delay.max_delay = 30000 + restart_strategy.exponential_delay.multiplier = 2.0 + restart_strategy.exponential_delay.attempts_unlimited = true +} +``` + +**Q: Is there a web UI for job management?** + +A: Yes! Use **SeaTunnel Web Project**: +```bash +# Check out web UI project +git clone https://github.com/apache/seatunnel-web.git +cd seatunnel-web +mvn clean install + +# Run web UI +java -jar target/seatunnel-web-*.jar +# Access at http://localhost:8080 +``` + +--- + +## Resources + +### Official Documentation +- [SeaTunnel Official Website](https://seatunnel.apache.org/) +- [GitHub Repository](https://github.com/apache/seatunnel) +- [Documentation Hub](https://seatunnel.apache.org/docs/) +- [Connector List](https://seatunnel.apache.org/docs/2.3.12/connector-v2/overview) + +### Community +- [Slack Channel](https://the-asf.slack.com/archives/C01CB5186TL) +- [Mailing Lists](https://seatunnel.apache.org/community/mail-lists/) +- [Issue Tracker](https://github.com/apache/seatunnel/issues) +- [Discussion Forum](https://github.com/apache/seatunnel/discussions) + +### Related Projects +- [SeaTunnel Web UI](https://github.com/apache/seatunnel-web) +- [Apache Kafka](https://kafka.apache.org/) +- [Apache Flink](https://flink.apache.org/) +- [Apache Spark](https://spark.apache.org/) + +### Learning Resources +- [HOCON Configuration Guide](https://github.com/lightbend/config/blob/main/HOCON.md) +- [SQL Functions Reference](https://seatunnel.apache.org/docs/2.3.12/transform-v2/sql) +- [CDC Pattern Explained](https://en.wikipedia.org/wiki/Change_data_capture) +- [Distributed Systems Concepts](https://en.wikipedia.org/wiki/Distributed_computing) + +### Version History +- **2.3.12** (Stable) - Current recommended version +- **2.3.13-SNAPSHOT** (Development) +- [All Releases](https://archive.apache.org/dist/seatunnel/) + +--- + +## Additional Notes + +### License +Apache License 2.0 - See [LICENSE](https://github.com/apache/seatunnel/blob/master/LICENSE) file + +### Security +- Report security issues via [Apache Security](https://www.apache.org/security/) +- Do NOT create public issues for security vulnerabilities + +### Support & Contribution +- Join the community Slack for support +- Submit feature requests on GitHub Issues +- Contribute code via Pull Requests +- Follow [Contributing Guide](https://github.com/apache/seatunnel/blob/master/CONTRIBUTING.md) + +--- + +**Last Updated**: 2026-01-28 +**Skill Version**: 2.3.13 +**Status**: Production Ready ✓