Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definitions explanation is not clear or understandable (and other suggestions) #25717

Open
cobrienbeam opened this issue Nov 3, 2024 · 5 comments
Labels
area: docs Related to documentation in general

Comments

@cobrienbeam
Copy link

cobrienbeam commented Nov 3, 2024

What's the issue or suggestion?

A Definitions object is a set of Dagster definitions available and loadable by Dagster tools.

This is a circular sentence. If a definitions object is a set of Dagster definitions available then what are the Dagster definitions and what makes them available vs not available? It's totally unclear.

Additionally, the added explanation does not really help explain:

The Definitions object is used to assign definitions to a code location, and each code location can only have a single Definitions object. This object maps to one code location. With code locations, users isolate multiple Dagster projects from each other without requiring multiple deployments. You’ll learn more about code locations a bit later in this lesson.

What are code locations, and why can they have only a single Definitions object? Okay so the cardinality between Defintions objects and code locations are 1:1, but that doesn't really explain the rest of it.

Additional information

A Definitions object is like a project manifest for Dagster - it bundles together all the assets, jobs, schedules, and other components that make up a single Dagster project. It's like a menu that tells Dagster exactly what's available to run in this specific project. Each separate project (called a code location) needs its own Definitions object, and you can't have multiple Definitions objects in the same location. This setup lets you keep different Dagster projects completely separate from each other, without needing to set up multiple Dagster deployments.

Why do we need this?

Two main reasons:

  1. Project Isolation: Let's say you have two different data projects:
# analytics/definitions.py
defs = Definitions(
    assets=[revenue_dashboard, customer_metrics]
)

# marketing/definitions.py
defs = Definitions(
    assets=[email_campaigns, social_media_stats]
)

Each project has its own Definitions, so they don't interfere with each other.

  1. Discovery: When Dagster starts up, it looks for these Definitions objects to know what assets, jobs, and resources are available to run.

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@cobrienbeam cobrienbeam added the area: docs Related to documentation in general label Nov 3, 2024
@cobrienbeam
Copy link
Author

cobrienbeam commented Nov 4, 2024

Additionally, maybe there could be a link out to a page that discusses the use of projects vs deployments. I like how in the next and react documentation that it links out to different sections to discuss potential tradeoffs of one selection vs another.

In this discussion of when to use additional projects vs additional deployments:

  1. Security/Compliance Requirements:

Company Infrastructure
|── Production Deployment (PCI Compliant)
│ └── Financial Projects
│ |── payment_processing
│ └── customer_billing

└── Standard Deployment
|── Marketing Projects
└── Analytics Projects

  • If some projects need stricter security or compliance requirements (like PCI for payment data), separating them into different deployments helps with compliance.
  1. Resource Isolation:

Infrastructure
|── Heavy Computing Deployment (32 CPU, 128GB RAM)
│ └── ML Training Projects
│ |── model_training
│ └── batch_inference

└── Light Computing Deployment (4 CPU, 16GB RAM)
└── ETL Projects
|── daily_reports
└── data_ingestion

  • When projects have vastly different resource needs, separate deployments prevent resource contention.
  1. Team/Organization Structure:

Company
|── Team A Deployment
│ └── Projects with specific permissions/access

└── Team B Deployment
└── Different security groups/access patterns

When teams need complete isolation or different access patterns.

  1. Environment Criticality:

Business Critical Deployment
|── Revenue impacting jobs
└── Customer-facing data pipelines

Non-Critical Deployment
|── Internal analytics
└── Experimental projects

  • When downtime impact varies significantly between projects.
  1. Scale/Performance:
  • When you have so many projects that the UI becomes slow

  • When job runs start queueing too much

  • When the deployment's database gets too large

  • The key question is: "Do these projects NEED to be separate?" rather than "CAN they be separate?".

Using a single deployment has the following benefits:

  • Easier maintenance
  • Centralized monitoring
  • Shared resources
  • Simpler infrastructure

And then provide more information on workspaces using the definitions.py files instead of init.py:

You need to explicitly tell Dagster where to find your definitions through the workspace.yaml file:

load_from:
  - python_file: marketing/definitions.py
    location_name: marketing_tools
  
  - python_file: finance/definitions.py
    location_name: finance_tools

@cobrienbeam cobrienbeam changed the title Definitions explanation is not clear or understandable Definitions explanation is not clear or understandable (and other suggestions) Nov 4, 2024
@cobrienbeam
Copy link
Author

cobrienbeam commented Nov 4, 2024

I didn't quite understand the use of the unpacking operator notation in the definition example:

The asterisk * in Python is the "unpacking operator".

# Let's say trip_assets contains these assets:
trip_assets = [taxi_trips, taxi_zones, taxi_trips_file]

# And metric_assets contains:
metric_assets = [revenue_by_day, trips_by_day]

# When you use * it "unpacks" the lists:
defs = Definitions(
    assets=[*trip_assets, *metric_assets]
)

# This is equivalent to writing:
defs = Definitions(
    assets=[
        taxi_trips,
        taxi_zones, 
        taxi_trips_file,
        revenue_by_day,
        trips_by_day
    ]
)

Without the *, you'd get nested lists:

# Without unpacking (WRONG):
defs = Definitions(
    assets=[trip_assets, metric_assets]
)
# This would be like:
assets=[[taxi_trips, taxi_zones], [revenue_by_day]]  # Nested lists!

# With unpacking (CORRECT):
defs = Definitions(
    assets=[*trip_assets, *metric_assets]
)
# This correctly flattens to:
assets=[taxi_trips, taxi_zones, revenue_by_day]  # Flat list!

You'll often see this pattern when you want to combine multiple lists into a single flat list.

It's like saying "take everything out of these lists and put them all together in one new list."

@cobrienbeam
Copy link
Author

I wish the explanation on os.getenv and EnvVar was a little bit clearer:

With os.getenv:

  1. Start Dagster server
  2. Value of DUCKDB_DATABASE is locked in
  3. Change environment variable
  4. Run asset → still uses old database path
  5. Must restart server to pick up new value

With EnvVar:

  1. Start Dagster server
  2. Run asset → checks DUCKDB_DATABASE value
  3. Change environment variable
  4. Run asset again → uses new database path
  5. No server restart needed!

It's especially useful for:

  • Switching between development/staging/production databases
  • Updating API keys
  • Changing resource configurations without downtime
  • Testing with different configurations

@lydialimlh
Copy link

You seem to have understood the unpacking operator of python quite well, you've correctly explained how it works. (I'm just a rando, not from the Dagster team)

@cobrienbeam
Copy link
Author

You seem to have understood the unpacking operator of python quite well, you've correctly explained how it works. (I'm just a rando, not from the Dagster team)

That was my proposal for the documentation in a callout or side link, etc. regarding the asterisk notation in the example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: docs Related to documentation in general
Projects
None yet
Development

No branches or pull requests

2 participants