0.19.0
Simplified backend integration
To provide best multi-cloud experience and GPU availability, dstack
integrates with many cloud GPU providers including AWS, Azure, GCP, RunPod, Lambda, Vultr, and others. As we'd like to see even more GPU providers supported by dstack
, this release comes with a major internal refactoring aimed to simplify the process of adding new integrations. See the Backend integration guide for more details. Join our Discord if have any questions about the integration process.
MPI workloads and NCCL tests
dstack
now configures internode SSH connectivity for distributed tasks. You can log in to any node from any node via SSH with a simple ssh <node_ip>
command. The out-of-the-box SSH connectivity also allows running mpirun
. See the NCCL Tests example.
Cost and usage metrics
In addition to DCGM metrics, dstack
now exports a set of Prometheus metrics for cost and usage tracking. Here's how it may look in the Grafana dashboard:
See the documentation for a full list of metrics and labels.
Cursor IDE support
dstack
can now launch Cursor dev environments. Just specify ide: cursor
in the run configuration:
type: dev-environment
ide: cursor
Deprecations
- The Python API methods
get_plan()
,exec_plan()
, andsubmit()
are deprecated in favor ofget_run_plan()
,apply_plan()
, andapply_configuration()
. The deprecated methods had clumsy signatures with many top-level parameters. The new signatures align better with the CLI and HTTP API.
Breaking changes
The 0.19.0 release drops several previously deprecated or undocumented features. There are no other significant breaking changes. The 0.19.0 server continues to support 0.18.x CLI versions. But the 0.19.0 CLI does not work with older 0.18.x servers, so you should update the server first or the server and the clients simultaneously.
- Drop the
dstack run
CLI command. - Drop the
--attach
mode for thedstack logs
CLI command. - Drop Pools functionality:
- The
dstack pool
CLI commands. /api/project/{project_name}/runs/get_offers
,/api/project/{project_name}/runs/create_instance
,/api/pools/list_instances
,/api/project/{project_name}/pool/*
API endpoints.pool_name
andinstance_name
parameters in profiles and run configurations.
- The
- Remove
retry_policy
from profiles. - Remove
termination_idle_time
andtermination_policy
from profiles and fleet configurations. - Drop
RUN_NAME
andREPO_ID
run environment variables. - Drop the
/api/backends/config_values
endpoint used for interactive configuration. - The API accepts and returns
azure_config["regions"]
instead ofazure_config["locations"]
(unified withserver/config.yml
).
What's Changed
- Fix gateways with a previously used IP address by @jvstme in #2388
- Simplify backend configurators and models by @r4victor in #2389
- Store BackendType as string instead of enum in the DB by @r4victor in #2393
- Introduce ComputeWith classes to detect compute features by @r4victor in #2392
- Move backend/compute configs from config.py to models.py by @r4victor in #2395
- Provide default run_job implementation for VM backends by @r4victor in #2396
- Configure inter-node SSH on multi-node tasks by @un-def in #2394
- [Blog] Using SSH fleets with TensorWave's private AMD cloud by @peterschmidt85 in #2391
- Add script to generate boilerplate code for new backend by @r4victor in #2397
- Add
datacenter-gpu-manager-4-proprietary
to CUDA images by @un-def in #2399 - Drop pools by @r4victor in #2401
- Transition high-level Python runs API to new methods by @r4victor in #2403
- Drop dstack run by @r4victor in #2404
- Drop dstack logs --attach by @r4victor in #2405
- Remove retry_policy from profiles by @r4victor in #2406
- Remove termination_idle_time and termination_policy by @r4victor in #2407
- Clean up models backward compatibility code by @r4victor in #2408
- Restore removed models fields for compatibility with 0.18 clients by @r4victor in #2414
- Clean up legacy repo fields by @jvstme in #2411
- Switch AWS gateways from t2.micro to t3.micro by @r4victor in #2416
- Remove old client excludes by @r4victor in #2417
- Use new JobTerminationReason values by @r4victor in #2418
- Drop RUN_NAME and REPO_ID env vars by @r4victor in #2419
- Drop irrelevant Nebius backend implementation by @jvstme in #2421
- [Feature]: Support the cursor IDE #2412 by @peterschmidt85 in #2413
- Simplify implementation of new backends #2372 by @olgenn in #2423
- Support multiple domains with Entra login by @r4victor in #2424
- Support setting project members by email by @r4victor in #2429
- Fix json schema reference and invalid properties errors by @r4victor in #2433
- [Blog]: DeepSeek R1 inference performance: MI300X vs. H200 by @peterschmidt85 in #2425
- Add new metrics by @un-def in #2434
- Add instance and job cost/usage Prometheus metrics by @un-def in #2432
- [Docker] Add dstackai/efa image by @un-def in #2422
- Restore fleet termination_policy for 0.18 backward compatibility by @r4victor in #2436
- [Bug]: Search over users doesn't work by @olgenn in #2439
- [Feature]: Support activating/deactivating users via the UI by @olgenn in #2440
- [Feature]: Display Assigned Gateway Information on Run Pages by @olgenn in #2438
- [Docs]: Update the
Metrics
guide by @peterschmidt85 in #2441 - [Examples] Update nccl-tests by @un-def in #2415
Full Changelog: 0.18.44...0.19.0