Dev vs Prod Modes

YT Framework supports two execution modes: dev (development) and prod (production). Understanding the differences and when to use each mode is crucial for effective pipeline development.

Overview

**Start with Dev Mode**

Always develop and test your pipelines in dev mode first. It's faster, doesn't require YT credentials, and makes debugging easier.

Dev Mode: Simulates YT operations locally using the file system. Perfect for development, testing, and debugging.
Prod Mode: Executes operations on the actual YT cluster. Used for production workloads.

Both modes use the same code and configuration, making it easy to develop locally and deploy to production.

**Credentials Required for Prod Mode**

Production mode requires YT credentials in `configs/secrets.env`. Make sure to set up credentials before running in prod mode.

Dev Mode

How It Works (dev)

Dev mode simulates YT operations using the local file system:

Tables: Stored as .jsonl files in .dev/ directory
Operations: Executed locally using subprocess
Code Upload: No-op (code runs directly from local filesystem)
YQL Operations: Executed using DuckDB for local simulation

Configuration (dev)

Set mode in pipeline config:

# configs/config.yaml
pipeline:
  mode: "dev"

Directory Structure (dev)

When running in dev mode, the framework creates a .dev/ directory:

my_pipeline/
├── .dev/
│   ├── table1.jsonl      # Simulated YT tables
│   ├── table2.jsonl
│   └── operation.log     # Operation logs
├── configs/
├── stages/
└── pipeline.py

Table Operations (dev)

Writing tables:

# In dev mode, writes to .dev/table_name.jsonl
self.deps.yt_client.write_table(
    table_path="//tmp/my_pipeline/data",
    rows=[{"id": 1, "name": "Alice"}]
)
# Creates: .dev/data.jsonl

Reading tables:

# In dev mode, reads from .dev/table_name.jsonl
rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data"))
# Reads from: .dev/data.jsonl

Map Operations (dev)

Map operations run locally using subprocess:

Creates sandbox directory: .dev/sandbox_<input>-><output>/
Copies input table to sandbox
Executes mapper.py script
Collects output to .dev/<output>.jsonl

Example:

# Dev mode execution
.dev/sandbox_input->output/
├── input.jsonl
├── source.tar.gz (extracted)
└── operation_wrapper_*.sh

Vanilla Operations (dev)

Vanilla operations run locally using subprocess:

Creates sandbox directory: .dev/<stage_name>_sandbox/
Extracts source archive
Executes vanilla.py script
Logs output to .dev/<stage_name>.log

YQL Operations (dev)

YQL operations are simulated using DuckDB:

Joins, filters, aggregations run locally
Results written to .dev/ directory
Full YQL syntax supported

When to Use Dev Mode

Development: Writing and testing new stages
Debugging: Investigating issues locally
Testing: Validating pipeline logic
CI/CD: Running tests without YT cluster access
Learning: Understanding framework behavior

Advantages (dev)

✅ Fast iteration (no network latency)
✅ No YT cluster access required
✅ Easy debugging (files are local)
✅ Free (no cluster resources used)
✅ Works offline

Limitations (dev)

❌ Not suitable for large datasets (limited by local disk)
❌ Some YT-specific features may differ
❌ Performance characteristics differ from production

Prod Mode

How It Works (prod)

Prod mode executes operations on the actual YT cluster:

Tables: Stored on YT cluster at specified paths
Operations: Executed on YT cluster nodes
Code Upload: Code is packaged and uploaded to YT
YQL Operations: Executed using YT's YQL engine

**Cluster Dependencies Required**

In prod mode, `ytjobs` code executes on YT cluster nodes. The cluster's Docker image must include required dependencies or you must use custom Docker images. See [Cluster Requirements](configuration/cluster-requirements.md) for details.

Configuration (prod)

Set mode in pipeline config:

# configs/config.yaml
pipeline:
  mode: "prod"
  build_folder: "//tmp/my_pipeline/build"

Required credentials (configs/secrets.env):

YT_PROXY=your-yt-proxy-url
YT_TOKEN=your-yt-token

Table Operations (prod)

Writing tables:

# In prod mode, writes to YT cluster
self.deps.yt_client.write_table(
    table_path="//tmp/my_pipeline/data",
    rows=[{"id": 1, "name": "Alice"}]
)
# Creates: //tmp/my_pipeline/data on YT cluster

Reading tables:

# In prod mode, reads from YT cluster
rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data"))
# Reads from: //tmp/my_pipeline/data on YT cluster

Map Operations (prod)

Map operations run on YT cluster:

Code is uploaded to build_folder
YT creates jobs on cluster nodes
Each job processes a portion of input table
Results are written to output table on cluster

Vanilla Operations (prod)

Vanilla operations run on YT cluster:

Code is uploaded to build_folder
YT creates job on cluster node
Job executes vanilla.py script
Logs available in YT web UI

YQL Operations (prod)

YQL operations execute on YT cluster:

Uses YT's distributed YQL engine
Handles large datasets efficiently
Full YT YQL syntax supported

When to Use Prod Mode

Production: Running production workloads
Large Datasets: Processing data that doesn't fit locally
Performance: Need cluster performance and parallelism
Integration: Integrating with other YT-based systems

Advantages (prod)

✅ Handles large datasets (distributed storage)
✅ High performance (distributed processing)
✅ Scalability (cluster resources)
✅ Production-ready (real YT environment)

Limitations (prod)

❌ Requires YT cluster access
❌ Slower iteration (network latency)
❌ Costs cluster resources
❌ Harder to debug (remote execution)

Quick Comparison

```{tab-item} Configuration
**Dev Mode:**
```yaml
pipeline:
  mode: "dev"

Prod Mode:

pipeline:
  mode: "prod"
  build_folder: "//tmp/my_pipeline/build"

**Dev Mode:**
- No credentials required
- Works offline

**Prod Mode:**
- Requires `configs/secrets.env`
- Must have YT cluster access

**Dev Mode:**
- Fast iteration
- Limited by local resources
- Sequential execution

**Prod Mode:**
- Distributed processing
- Scales with cluster size
- Parallel execution

**Dev Mode:**
- Files in `.dev/` directory
- Immediate error feedback
- Easy to inspect

**Prod Mode:**
- YT web UI for logs
- Remote debugging
- Requires cluster access

Switching Between Modes

Switching between modes is simple - just change the mode setting:

# Development
pipeline:
  mode: "dev"

# Production
pipeline:
  mode: "prod"

**Same Code, Different Execution**

The same code and configuration work in both modes. The framework handles the differences automatically.

Important considerations:

Table paths: Same paths work in both modes (dev mode maps them to .dev/)
Credentials: Prod mode requires secrets.env with YT credentials
Build folder: Prod mode requires build_folder for code execution
Code changes: Dev mode uses local code, prod mode uploads code

Leaky Abstractions

While the framework tries to abstract away differences, some leak through:

File Paths

Dev mode:

Tables stored as .jsonl files
Path //tmp/my_pipeline/data becomes .dev/data.jsonl

Prod mode:

Tables stored on YT cluster
Path //tmp/my_pipeline/data is actual YT path

What to know:

Same code works in both modes
Path format is the same (//tmp/...)
Dev mode automatically maps paths to local files

Operation Execution

Dev mode:

Map operations run sequentially (one job)
Limited parallelism
Uses local resources

Prod mode:

Map operations run in parallel (multiple jobs)
Full cluster parallelism
Uses cluster resources

What to know:

Performance characteristics differ
Dev mode may not catch all concurrency issues
Test in prod mode for production workloads

Code Execution

Dev mode:

Code runs directly from local filesystem
No code upload needed
Changes are immediately available

Prod mode:

Code is packaged and uploaded
Must upload before execution
Changes require re-upload

What to know:

Dev mode is faster for iteration
Prod mode requires build_folder configuration
Code structure must be compatible with both modes

Error Handling

Dev mode:

Errors show in terminal
Stack traces are immediate
Easy to debug

Prod mode:

Errors in YT web UI
Stack traces in operation logs
Requires YT access to debug

What to know:

Use dev mode for debugging
Check YT web UI for prod errors
Logs are crucial for prod debugging

Debugging Tips

Dev Mode Debugging

Check .dev/ directory: See generated files and tables
Check logs: Operation logs in .dev/ directory
Inspect tables: Open .jsonl files directly
Add print statements: Output appears immediately

Prod Mode Debugging

Check YT web UI: View operations and logs
Use logging: self.logger output appears in YT logs
Check operation status: Monitor in YT web UI
Download results: Download tables for local inspection

Common Issues

Issue: Tables not found in prod mode

Check table paths exist on YT cluster
Verify YT credentials are correct
Check YT proxy URL is accessible

Issue: Code not updating in prod mode

Code is uploaded once per pipeline run
Changes require re-running pipeline
Check build_folder is correct

Issue: Different behavior in dev vs prod

Check for YT-specific features
Verify resource limits
Test with similar data sizes

Best Practices

**Development Workflow**

1. Develop and test in dev mode
2. Validate in prod mode with small dataset
3. Deploy to production with full dataset

Develop in dev mode: Faster iteration and debugging
Test in prod mode: Validate before production deployment
Use same configs: Keep dev and prod configs similar
Monitor resources: Check resource usage in prod mode
Version control: Track config changes between modes

**Test Before Production**

Always test your pipeline in prod mode with a small dataset before running on production data. This helps catch mode-specific issues early.

Next Steps

Understand Cluster Requirements for production mode dependencies
Learn about Configuration management
Explore Operations for different operation types
Check out Examples for mode-specific examples
Review Troubleshooting for mode-specific issues

FilesExpand file tree

dev-vs-prod.md

Latest commit

History

dev-vs-prod.md

File metadata and controls

Dev vs Prod Modes

Overview

Dev Mode

How It Works (dev)

Configuration (dev)

Directory Structure (dev)

Table Operations (dev)

Map Operations (dev)

Vanilla Operations (dev)

YQL Operations (dev)

When to Use Dev Mode

Advantages (dev)

Limitations (dev)

Prod Mode

How It Works (prod)

Configuration (prod)

Table Operations (prod)

Map Operations (prod)

Vanilla Operations (prod)

YQL Operations (prod)

When to Use Prod Mode

Advantages (prod)

Limitations (prod)

Quick Comparison

Switching Between Modes

Leaky Abstractions

File Paths

Operation Execution

Code Execution

Error Handling

Debugging Tips

Dev Mode Debugging

Prod Mode Debugging

Common Issues

Issue: Tables not found in prod mode

Issue: Code not updating in prod mode

Issue: Different behavior in dev vs prod

Best Practices

Next Steps