YT Framework supports two execution modes: dev (development) and prod (production). Understanding the differences and when to use each mode is crucial for effective pipeline development.
**Start with Dev Mode**
Always develop and test your pipelines in dev mode first. It's faster, doesn't require YT credentials, and makes debugging easier.
- Dev Mode: Simulates YT operations locally using the file system. Perfect for development, testing, and debugging.
- Prod Mode: Executes operations on the actual YT cluster. Used for production workloads.
Both modes use the same code and configuration, making it easy to develop locally and deploy to production.
**Credentials Required for Prod Mode**
Production mode requires YT credentials in `configs/secrets.env`. Make sure to set up credentials before running in prod mode.
Dev mode simulates YT operations using the local file system:
- Tables: Stored as
.jsonlfiles in.dev/directory - Operations: Executed locally using subprocess
- Code Upload: No-op (code runs directly from local filesystem)
- YQL Operations: Executed using DuckDB for local simulation
Set mode in pipeline config:
# configs/config.yaml
pipeline:
mode: "dev"When running in dev mode, the framework creates a .dev/ directory:
my_pipeline/
├── .dev/
│ ├── table1.jsonl # Simulated YT tables
│ ├── table2.jsonl
│ └── operation.log # Operation logs
├── configs/
├── stages/
└── pipeline.py
Writing tables:
# In dev mode, writes to .dev/table_name.jsonl
self.deps.yt_client.write_table(
table_path="//tmp/my_pipeline/data",
rows=[{"id": 1, "name": "Alice"}]
)
# Creates: .dev/data.jsonlReading tables:
# In dev mode, reads from .dev/table_name.jsonl
rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data"))
# Reads from: .dev/data.jsonlMap operations run locally using subprocess:
- Creates sandbox directory:
.dev/sandbox_<input>-><output>/ - Copies input table to sandbox
- Executes mapper.py script
- Collects output to
.dev/<output>.jsonl
Example:
# Dev mode execution
.dev/sandbox_input->output/
├── input.jsonl
├── source.tar.gz (extracted)
└── operation_wrapper_*.shVanilla operations run locally using subprocess:
- Creates sandbox directory:
.dev/<stage_name>_sandbox/ - Extracts source archive
- Executes vanilla.py script
- Logs output to
.dev/<stage_name>.log
YQL operations are simulated using DuckDB:
- Joins, filters, aggregations run locally
- Results written to
.dev/directory - Full YQL syntax supported
- Development: Writing and testing new stages
- Debugging: Investigating issues locally
- Testing: Validating pipeline logic
- CI/CD: Running tests without YT cluster access
- Learning: Understanding framework behavior
- ✅ Fast iteration (no network latency)
- ✅ No YT cluster access required
- ✅ Easy debugging (files are local)
- ✅ Free (no cluster resources used)
- ✅ Works offline
- ❌ Not suitable for large datasets (limited by local disk)
- ❌ Some YT-specific features may differ
- ❌ Performance characteristics differ from production
Prod mode executes operations on the actual YT cluster:
- Tables: Stored on YT cluster at specified paths
- Operations: Executed on YT cluster nodes
- Code Upload: Code is packaged and uploaded to YT
- YQL Operations: Executed using YT's YQL engine
**Cluster Dependencies Required**
In prod mode, `ytjobs` code executes on YT cluster nodes. The cluster's Docker image must include required dependencies or you must use custom Docker images. See [Cluster Requirements](configuration/cluster-requirements.md) for details.
Set mode in pipeline config:
# configs/config.yaml
pipeline:
mode: "prod"
build_folder: "//tmp/my_pipeline/build"Required credentials (configs/secrets.env):
YT_PROXY=your-yt-proxy-url
YT_TOKEN=your-yt-tokenWriting tables:
# In prod mode, writes to YT cluster
self.deps.yt_client.write_table(
table_path="//tmp/my_pipeline/data",
rows=[{"id": 1, "name": "Alice"}]
)
# Creates: //tmp/my_pipeline/data on YT clusterReading tables:
# In prod mode, reads from YT cluster
rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data"))
# Reads from: //tmp/my_pipeline/data on YT clusterMap operations run on YT cluster:
- Code is uploaded to
build_folder - YT creates jobs on cluster nodes
- Each job processes a portion of input table
- Results are written to output table on cluster
Vanilla operations run on YT cluster:
- Code is uploaded to
build_folder - YT creates job on cluster node
- Job executes vanilla.py script
- Logs available in YT web UI
YQL operations execute on YT cluster:
- Uses YT's distributed YQL engine
- Handles large datasets efficiently
- Full YT YQL syntax supported
- Production: Running production workloads
- Large Datasets: Processing data that doesn't fit locally
- Performance: Need cluster performance and parallelism
- Integration: Integrating with other YT-based systems
- ✅ Handles large datasets (distributed storage)
- ✅ High performance (distributed processing)
- ✅ Scalability (cluster resources)
- ✅ Production-ready (real YT environment)
- ❌ Requires YT cluster access
- ❌ Slower iteration (network latency)
- ❌ Costs cluster resources
- ❌ Harder to debug (remote execution)
```{tab-item} Configuration
**Dev Mode:**
```yaml
pipeline:
mode: "dev"
Prod Mode:
pipeline:
mode: "prod"
build_folder: "//tmp/my_pipeline/build"**Dev Mode:**
- No credentials required
- Works offline
**Prod Mode:**
- Requires `configs/secrets.env`
- Must have YT cluster access
**Dev Mode:**
- Fast iteration
- Limited by local resources
- Sequential execution
**Prod Mode:**
- Distributed processing
- Scales with cluster size
- Parallel execution
**Dev Mode:**
- Files in `.dev/` directory
- Immediate error feedback
- Easy to inspect
**Prod Mode:**
- YT web UI for logs
- Remote debugging
- Requires cluster access
Switching between modes is simple - just change the mode setting:
# Development
pipeline:
mode: "dev"
# Production
pipeline:
mode: "prod"**Same Code, Different Execution**
The same code and configuration work in both modes. The framework handles the differences automatically.
Important considerations:
- Table paths: Same paths work in both modes (dev mode maps them to
.dev/) - Credentials: Prod mode requires
secrets.envwith YT credentials - Build folder: Prod mode requires
build_folderfor code execution - Code changes: Dev mode uses local code, prod mode uploads code
While the framework tries to abstract away differences, some leak through:
Dev mode:
- Tables stored as
.jsonlfiles - Path
//tmp/my_pipeline/databecomes.dev/data.jsonl
Prod mode:
- Tables stored on YT cluster
- Path
//tmp/my_pipeline/datais actual YT path
What to know:
- Same code works in both modes
- Path format is the same (
//tmp/...) - Dev mode automatically maps paths to local files
Dev mode:
- Map operations run sequentially (one job)
- Limited parallelism
- Uses local resources
Prod mode:
- Map operations run in parallel (multiple jobs)
- Full cluster parallelism
- Uses cluster resources
What to know:
- Performance characteristics differ
- Dev mode may not catch all concurrency issues
- Test in prod mode for production workloads
Dev mode:
- Code runs directly from local filesystem
- No code upload needed
- Changes are immediately available
Prod mode:
- Code is packaged and uploaded
- Must upload before execution
- Changes require re-upload
What to know:
- Dev mode is faster for iteration
- Prod mode requires
build_folderconfiguration - Code structure must be compatible with both modes
Dev mode:
- Errors show in terminal
- Stack traces are immediate
- Easy to debug
Prod mode:
- Errors in YT web UI
- Stack traces in operation logs
- Requires YT access to debug
What to know:
- Use dev mode for debugging
- Check YT web UI for prod errors
- Logs are crucial for prod debugging
- Check
.dev/directory: See generated files and tables - Check logs: Operation logs in
.dev/directory - Inspect tables: Open
.jsonlfiles directly - Add print statements: Output appears immediately
- Check YT web UI: View operations and logs
- Use logging:
self.loggeroutput appears in YT logs - Check operation status: Monitor in YT web UI
- Download results: Download tables for local inspection
- Check table paths exist on YT cluster
- Verify YT credentials are correct
- Check YT proxy URL is accessible
- Code is uploaded once per pipeline run
- Changes require re-running pipeline
- Check
build_folderis correct
- Check for YT-specific features
- Verify resource limits
- Test with similar data sizes
**Development Workflow**
1. Develop and test in dev mode
2. Validate in prod mode with small dataset
3. Deploy to production with full dataset
- Develop in dev mode: Faster iteration and debugging
- Test in prod mode: Validate before production deployment
- Use same configs: Keep dev and prod configs similar
- Monitor resources: Check resource usage in prod mode
- Version control: Track config changes between modes
**Test Before Production**
Always test your pipeline in prod mode with a small dataset before running on production data. This helps catch mode-specific issues early.
- Understand Cluster Requirements for production mode dependencies
- Learn about Configuration management
- Explore Operations for different operation types
- Check out Examples for mode-specific examples
- Review Troubleshooting for mode-specific issues