From 7154424847f18f8210fddf029a43f8d63ca47460 Mon Sep 17 00:00:00 2001 From: Anda Zhou <83614683+azhou-determined@users.noreply.github.com> Date: Fri, 22 Nov 2024 13:03:14 -0800 Subject: [PATCH] chore: release notes 0.38.0 (#10231) --- docs/release-notes.rst | 175 ++++++++++++++++++ docs/release-notes/9966-fix-grid.rst | 7 - .../add-host-port-scheme-to-helm.rst | 9 - docs/release-notes/api-cli-access-token.rst | 28 --- docs/release-notes/config-policies.rst | 15 -- docs/release-notes/helm-db-snapshot.rst | 6 - docs/release-notes/log-signal.rst | 10 - .../pytorch-tensorboard-plugin.rst | 10 - .../rbac-new-tokenCreator-role.rst | 7 - docs/release-notes/remove-custom-searcher.rst | 7 - .../searcher-context-removal.rst | 72 ------- docs/release-notes/ssh-crypto-system.rst | 8 - .../unsupport-aurora-postgres-reminder.rst | 19 -- 13 files changed, 175 insertions(+), 198 deletions(-) delete mode 100644 docs/release-notes/9966-fix-grid.rst delete mode 100644 docs/release-notes/add-host-port-scheme-to-helm.rst delete mode 100644 docs/release-notes/api-cli-access-token.rst delete mode 100644 docs/release-notes/config-policies.rst delete mode 100644 docs/release-notes/helm-db-snapshot.rst delete mode 100644 docs/release-notes/log-signal.rst delete mode 100644 docs/release-notes/pytorch-tensorboard-plugin.rst delete mode 100644 docs/release-notes/rbac-new-tokenCreator-role.rst delete mode 100644 docs/release-notes/remove-custom-searcher.rst delete mode 100644 docs/release-notes/searcher-context-removal.rst delete mode 100644 docs/release-notes/ssh-crypto-system.rst delete mode 100644 docs/release-notes/unsupport-aurora-postgres-reminder.rst diff --git a/docs/release-notes.rst b/docs/release-notes.rst index 98bf8843186..c86a80c14a2 100644 --- a/docs/release-notes.rst +++ b/docs/release-notes.rst @@ -6,6 +6,181 @@ Release Notes ############### +************** + Version 0.38 +************** + +Version 0.38.0 +============== + +**Release Date:** November 22, 2024 + +**Breaking Changes** + +- ASHA: All experiments using ASHA hyperparameter search must now configure ``max_time`` and + ``time_metric`` in the experiment config, instead of ``max_length``. Additionally, training code + must report the configured ``time_metric`` in validation metrics. As a convenience, Determined + training loops now automatically report ``batches`` and ``epochs`` with metrics, which you can + use as your ``time_metric``. ASHA experiments without this modification will no longer run. + +- Custom Searchers: All custom searchers including DeepSpeed Autotune were deprecated in ``0.36.0`` + and are now being removed. Users are encouraged to use a preset searcher, which can be easily + :ref:`configured ` for any experiment. + +- API: Custom Searcher (including DeepSpeed AutoTune) was deprecated in 0.36.0 and is now removed. + We will maintain first-class support for a variety of preset searchers, which can be easily + configured for any experiment. Visit :ref:`search-methods` for details. + +**New Features** + +- API/CLI: Add support for access tokens. Add the ability to create and administer access tokens + for users to authenticate in automated workflows. Users can define the lifespan of these tokens, + making it easier to securely authenticate and run processes. Users can set global defaults and + limits for the validity of access tokens by configuring ``default_lifespan_days`` and + ``max_lifespan_days`` in the master configuration. Setting ``max_lifespan_days`` to ``-1`` + indicates an **infinite** lifespan for the access token. This feature enhances automation while + maintaining strong security protocols by allowing tighter control over token usage and + expiration. This feature requires Determined Enterprise Edition. + + - CLI: + + - ``det token create``: Create a new access token. + - ``det token login``: Sign in with an access token. + - ``det token edit``: Update an access token's description. + - ``det token list``: List all active access tokens, with options for displaying revoked + tokens. + - ``det token describe``: Show details of specific access tokens. + - ``det token revoke``: Revoke an access token. + + - API: + + - ``POST /api/v1/tokens``: Create a new access token. + - ``GET /api/v1/tokens``: Retrieve a list of access tokens. + - ``PATCH /api/v1/tokens/{token_id}``: Edit an existing access token. + +- API: Introduce ``keras.DeterminedCallback``, a new high-level training API for TF Keras that + integrates Keras training code with Determined through a single :ref:`Keras Callback + `. + +- API: Introduce ``deepspeed.Trainer``, a new high-level training API for DeepSpeedTrial that + allows for Python-side training loop configurations and includes support for local training. + +- Cluster: In the enterprise edition of Determined, add :ref:`config policies ` to + enable administrators to set limits on how users can define workloads (e.g., experiments, + notebooks, TensorBoards, shells, and commands). Administrators can define two types of + configurations: + + - **Invariant Configs for Experiments**: Settings applied to all experiments within a specific + scope (global or workspace). Invariant configs for other tasks (e.g. notebooks, TensorBoards, + shells, and commands) is not yet supported. + + - **Constraints**: Restrictions that prevent users from exceeding resource limits within a + scope. Constraints can be set independently for experiments and tasks. + +- Helm: Support configuring ``determined_master_host``, ``determined_master_port``, and + ``determined_master_scheme``. These control how tasks address the Determined API server and are + useful when installations span multiple Kubernetes clusters or there are proxies in between tasks + and the master. Also, ``determined_master_host`` now defaults to the service host, + ``..svc.cluster.local``, instead of the service IP. + +- Helm: Add support for capturing and restoring snapshots of the database persistent volume. Visit + :ref:`helm-config-reference` for more details. + +- New RBAC role: In the enterprise edition of Determined, add a ``TokenCreator`` RBAC role, which + allows users to create, view, and revoke their own :ref:`access tokens `. This + role can only be assigned globally. + +- Experiments: Add a ``name`` field to ``log_policies``. When a log policy matches, its name shows + as a label in the WebUI, making it easy to spot specific issues during a run. Labels appear in + both the run table and run detail views. + + In addition, there is a new format: ``name`` is required, and ``action`` is now a plain string. + For more details, refer to :ref:`log_policies `. + +**Improvements** + +- Master Configuration: Add support for crypto system configuration for ssh connection. + ``security.key_type`` now accepts ``RSA``, ``ECDSA`` or ``ED25519``. Default key type is changed + from ``1024-bit RSA`` to ``ED25519``, since ``ED25519`` keys are faster and more secure than the + old default, and ``ED25519`` is also the default key type for ``ssh-keygen``. + +**Removed Features** + +- WebUI: "Continue Training" no longer supports configurable number of batches in the Web UI and + will simply resume the trial from the last checkpoint. + +**Known Issues** + +- PyTorch has `deprecated + ` + their Profiler TensorBoard Plugin (``tb_plugin``), so some features may not be compatible with + PyTorch 2.0 and above. Our current default environment image comes with PyTorch 2.3. If users are + experiencing issues with this plugin, we suggest using an image with a PyTorch version earlier + than 2.0. + +**Bug Fixes** + +- Previously, during a grid search, if a hyperparameter contained an empty nested hyperparameter + (that is, just an empty map), that hyperparameter would not appear in the hparams passed to the + trial. + +**Deprecations** + +- Experiment Config: The ``max_length`` field of the searcher configuration section has been + deprecated for all experiments and searchers. Users are expected to configure the desired + training length directly in training code. + +- Experiment Config: The ``optimizations`` config has been deprecated. Please see :ref:`Training + APIs ` to configure supported optimizations through training code directly. + +- Experiment Config: The ``scheduling_unit``, ``min_checkpoint_period``, and + ``min_validation_period`` config fields have been deprecated. Instead, these configuration + options should be specified in training code. + +- Experiment Config: The ``entrypoint`` field no longer accepts ``model_def:TrialClass`` as trial + definitions. Please invoke your training script directly (``python3 train.py``). + +- Core API: The ``SearcherContext`` (``core.searcher``) has been deprecated. Training code no + longer requires ``core.searcher.operations`` to run, and progress should be reported through + ``core.train.report_progress``. + +- DeepSpeed: The ``num_micro_batches_per_slot`` and ``train_micro_batch_size_per_gpu`` attributes + on ``DeepSpeedContext`` have been replaced with ``get_train_micro_batch_size_per_gpu()`` and + ``get_num_micro_batches_per_slot()``. + +- Horovod: The Horovod distributed training backend has been deprecated. Users are encouraged to + migrate to the native distributed backend of their training framework (``torch.distributed`` or + ``tf.distribute``). + +- Trial APIs: ``TFKerasTrial`` has been deprecated. Users are encouraged to migrate to the new + :ref:`Keras Callback `. + +- Launchers: The ``--trial`` argument in Determined launchers has been deprecated. Please invoke + your training script directly. + +- ASHA: The ``stop_once`` field of the ``searcher`` config for ASHA searchers has been deprecated. + All ASHA searches are now early-stopping based (``stop_once: true``) instead of promotion based. + +- CLI: The ``--test`` and ``--local`` flags for ``det experiment create`` have been deprecated. All + training APIs now support local execution (``python3 train.py``). Please see ``training apis`` + for details specific to your framework. + +- Web UI: Previously, trials that reported an ``epoch`` metric enabled an epoch X-axis in the Web + UI metrics tab. This metric name has been changed to ``epochs``, with ``epoch`` as a fallback + option. + +- Database: After Amazon Aurora V1 reaches End of Life, support for Amazon Aurora V1 in ``det + deploy aws`` will be removed. Future deployments will default to the ``simple-rds`` type, which + uses Amazon RDS for PostgreSQL. We recommend that users migrate to Amazon RDS for PostgreSQL. For + more information, visit the `migration instructions + `_. + +- Database: As a follow-up to the earlier notice, PostgreSQL 12 will reach End of Life on November + 14, 2024. Instances still using PostgreSQL 12 or earlier should upgrade to PostgreSQL 13 or later + to maintain compatibility. The application will log a warning if it detects a connection to any + PostgreSQL version older than 12, and this warning will be updated to include PostgreSQL 12 once + it is End of Life. + ************** Version 0.37 ************** diff --git a/docs/release-notes/9966-fix-grid.rst b/docs/release-notes/9966-fix-grid.rst deleted file mode 100644 index f36dc1b8dc6..00000000000 --- a/docs/release-notes/9966-fix-grid.rst +++ /dev/null @@ -1,7 +0,0 @@ -:orphan: - -**Fixes** - -- Previously, during a grid search, if a hyperparameter contained an empty nested hyperparameter - (that is, just an empty map), that hyperparameter would not appear in the hparams passed to the - trial. diff --git a/docs/release-notes/add-host-port-scheme-to-helm.rst b/docs/release-notes/add-host-port-scheme-to-helm.rst deleted file mode 100644 index d0f49a72c86..00000000000 --- a/docs/release-notes/add-host-port-scheme-to-helm.rst +++ /dev/null @@ -1,9 +0,0 @@ -:orphan: - -**New Features** - -- Helm: Support configuring ``determined_master_host``, ``determined_master_port``, and - ``determined_master_scheme``. These control how tasks address the Determined API server and are - useful when installations span multiple Kubernetes clusters or there are proxies in between tasks - and the master. Also, ``determined_master_host`` now defaults to the service host, - ``..svc.cluster.local``, instead of the service IP. diff --git a/docs/release-notes/api-cli-access-token.rst b/docs/release-notes/api-cli-access-token.rst deleted file mode 100644 index 67fb614c350..00000000000 --- a/docs/release-notes/api-cli-access-token.rst +++ /dev/null @@ -1,28 +0,0 @@ -:orphan: - -**New Features** - -- API/CLI: Add support for access tokens. Add the ability to create and administer access tokens - for users to authenticate in automated workflows. Users can define the lifespan of these tokens, - making it easier to securely authenticate and run processes. Users can set global defaults and - limits for the validity of access tokens by configuring ``default_lifespan_days`` and - ``max_lifespan_days`` in the master configuration. Setting ``max_lifespan_days`` to ``-1`` - indicates an **infinite** lifespan for the access token. This feature enhances automation while - maintaining strong security protocols by allowing tighter control over token usage and - expiration. This feature requires Determined Enterprise Edition. - - - CLI: - - - ``det token create``: Create a new access token. - - ``det token login``: Sign in with an access token. - - ``det token edit``: Update an access token's description. - - ``det token list``: List all active access tokens, with options for displaying revoked - tokens. - - ``det token describe``: Show details of specific access tokens. - - ``det token revoke``: Revoke an access token. - - - API: - - - ``POST /api/v1/tokens``: Create a new access token. - - ``GET /api/v1/tokens``: Retrieve a list of access tokens. - - ``PATCH /api/v1/tokens/{token_id}``: Edit an existing access token. diff --git a/docs/release-notes/config-policies.rst b/docs/release-notes/config-policies.rst deleted file mode 100644 index 66e768b62d8..00000000000 --- a/docs/release-notes/config-policies.rst +++ /dev/null @@ -1,15 +0,0 @@ -:orphan: - -**New Features** - -- Cluster: In the enterprise edition of Determined, add :ref:`config policies ` to - enable administrators to set limits on how users can define workloads (e.g., experiments, - notebooks, TensorBoards, shells, and commands). Administrators can define two types of - configurations: - - - **Invariant Configs for Experiments**: Settings applied to all experiments within a specific - scope (global or workspace). Invariant configs for other tasks (e.g. notebooks, TensorBoards, - shells, and commands) is not yet supported. - - - **Constraints**: Restrictions that prevent users from exceeding resource limits within a - scope. Constraints can be set independently for experiments and tasks. diff --git a/docs/release-notes/helm-db-snapshot.rst b/docs/release-notes/helm-db-snapshot.rst deleted file mode 100644 index c9e276d68e1..00000000000 --- a/docs/release-notes/helm-db-snapshot.rst +++ /dev/null @@ -1,6 +0,0 @@ -:orphan: - -**New Features** - -- Helm: Add support for capturing and restoring snapshots of the database persistent volume. Visit - :ref:`helm-config-reference` for more details. diff --git a/docs/release-notes/log-signal.rst b/docs/release-notes/log-signal.rst deleted file mode 100644 index 743b0c6c56b..00000000000 --- a/docs/release-notes/log-signal.rst +++ /dev/null @@ -1,10 +0,0 @@ -:orphan: - -**New Features** - -- Experiments: Add a ``name`` field to ``log_policies``. When a log policy matches, its name shows - as a label in the WebUI, making it easy to spot specific issues during a run. Labels appear in - both the run table and run detail views. - - In addition, there is a new format: ``name`` is required, and ``action`` is now a plain string. - For more details, refer to :ref:`log_policies `. diff --git a/docs/release-notes/pytorch-tensorboard-plugin.rst b/docs/release-notes/pytorch-tensorboard-plugin.rst deleted file mode 100644 index c0f06b2118c..00000000000 --- a/docs/release-notes/pytorch-tensorboard-plugin.rst +++ /dev/null @@ -1,10 +0,0 @@ -:orphan: - -**Known Issue** - -- PyTorch has `deprecated - ` - their Profiler TensorBoard Plugin (``tb_plugin``), so some features may not be compatible with - PyTorch 2.0 and above. Our current default environment image comes with PyTorch 2.3. If users are - experiencing issues with this plugin, we suggest using an image with a PyTorch version earlier - than 2.0. diff --git a/docs/release-notes/rbac-new-tokenCreator-role.rst b/docs/release-notes/rbac-new-tokenCreator-role.rst deleted file mode 100644 index 0813232b0e9..00000000000 --- a/docs/release-notes/rbac-new-tokenCreator-role.rst +++ /dev/null @@ -1,7 +0,0 @@ -:orphan: - -**New Features** - -- New RBAC role: In the enterprise edition of Determined, add a ``TokenCreator`` RBAC role, which - allows users to create, view, and revoke their own :ref:`access tokens `. This - role can only be assigned globally. diff --git a/docs/release-notes/remove-custom-searcher.rst b/docs/release-notes/remove-custom-searcher.rst deleted file mode 100644 index 3e6c1a642e5..00000000000 --- a/docs/release-notes/remove-custom-searcher.rst +++ /dev/null @@ -1,7 +0,0 @@ -:orphan: - -**Breaking Changes** - -- API: Custom Searcher (including DeepSpeed AutoTune) was deprecated in 0.36.0 and is now removed. - We will maintain first-class support for a variety of preset searchers, which can be easily - configured for any experiment. Visit :ref:`search-methods` for details. diff --git a/docs/release-notes/searcher-context-removal.rst b/docs/release-notes/searcher-context-removal.rst deleted file mode 100644 index 74c81a746b2..00000000000 --- a/docs/release-notes/searcher-context-removal.rst +++ /dev/null @@ -1,72 +0,0 @@ -:orphan: - -**Breaking Changes** - -- ASHA: All experiments using ASHA hyperparameter search must now configure ``max_time`` and - ``time_metric`` in the experiment config, instead of ``max_length``. Additionally, training code - must report the configured ``time_metric`` in validation metrics. As a convenience, Determined - training loops now automatically report ``batches`` and ``epochs`` with metrics, which you can - use as your ``time_metric``. ASHA experiments without this modification will no longer run. - -- Custom Searchers: all custom searchers including DeepSpeed Autotune were deprecated in ``0.36.0`` - and are now being removed. Users are encouraged to use a preset searcher, which can be easily - :ref:`configured ` for any experiment. - -**New Features** - -- API: introduce ``keras.DeterminedCallback``, a new high-level training API for TF Keras that - integrates Keras training code with Determined through a single :ref:`Keras Callback - `. - -- API: introduce ``deepspeed.Trainer``, a new high-level training API for DeepSpeedTrial that - allows for Python-side training loop configurations and includes support for local training. - -**Deprecations** - -- Experiment Config: the ``max_length`` field of the searcher configuration section has been - deprecated for all experiments and searchers. Users are expected to configure the desired - training length directly in training code. - -- Experiment Config: the ``optimizations`` config has been deprecated. Please see :ref:`Training - APIs ` to configure supported optimizations through training code directly. - -- Experiment Config: the ``scheduling_unit``, ``min_checkpoint_period``, and - ``min_validation_period`` config fields have been deprecated. Instead, these configuration - options should be specified in training code. - -- Experiment Config: the ``entrypoint`` field no longer accepts ``model_def:TrialClass`` as trial - definitions. Please invoke your training script directly (``python3 train.py``). - -- Core API: the ``SearcherContext`` (``core.searcher``) has been deprecated. Training code no - longer requires ``core.searcher.operations`` to run, and progress should be reported through - ``core.train.report_progress``. - -- DeepSpeed: the ``num_micro_batches_per_slot`` and ``train_micro_batch_size_per_gpu`` attributes - on ``DeepSpeedContext`` have been replaced with ``get_train_micro_batch_size_per_gpu()`` and - ``get_num_micro_batches_per_slot()``. - -- Horovod: the horovod distributed training backend has been deprecated. Users are encouraged to - migrate to the native distributed backend of their training framework (``torch.distributed`` or - ``tf.distribute``). - -- Trial APIs: ``TFKerasTrial`` has been deprecated. Users are encouraged to migrate to the new - :ref:`Keras Callback `. - -- Launchers: the ``--trial`` argument in Determined launchers has been deprecated. Please invoke - your training script directly. - -- ASHA: the ``stop_once`` field of the ``searcher`` config for ASHA searchers has been deprecated. - All ASHA searches are now early-stopping based (``stop_once: true``) instead of promotion based. - -- CLI: The ``--test`` and ``--local`` flags for ``det experiment create`` have been deprecated. All - training APIs now support local execution (``python3 train.py``). Please see ``training apis`` - for details specific to your framework. - -- Web UI: previously, trials that reported an ``epoch`` metric enabled an epoch X-axis in the Web - UI metrics tab. This metric name has been changed to ``epochs``, with ``epoch`` as a fallback - option. - -**Removed Features** - -- WebUI: "Continue Training" no longer supports configurable number of batches in the Web UI and - will simply resume the trial from the last checkpoint. diff --git a/docs/release-notes/ssh-crypto-system.rst b/docs/release-notes/ssh-crypto-system.rst deleted file mode 100644 index acd54812832..00000000000 --- a/docs/release-notes/ssh-crypto-system.rst +++ /dev/null @@ -1,8 +0,0 @@ -:orphan: - -**Improvements** - -- Master Configuration: Add support for crypto system configuration for ssh connection. - ``security.key_type`` now accepts ``RSA``, ``ECDSA`` or ``ED25519``. Default key type is changed - from ``1024-bit RSA`` to ``ED25519``, since ``ED25519`` keys are faster and more secure than the - old default, and ``ED25519`` is also the default key type for ``ssh-keygen``. diff --git a/docs/release-notes/unsupport-aurora-postgres-reminder.rst b/docs/release-notes/unsupport-aurora-postgres-reminder.rst deleted file mode 100644 index b82c739d064..00000000000 --- a/docs/release-notes/unsupport-aurora-postgres-reminder.rst +++ /dev/null @@ -1,19 +0,0 @@ -:orphan: - -**Deprecations** - -- Cluster: A reminder that Amazon Aurora V1 will reach End of Life at the end of 2024. It is no - longer supported as the default persistent storage for AWS Determined deployments. We recommend - that users migrate to Amazon RDS for PostgreSQL. For more information, visit the `migration - instructions `_. - -- Cluster: After Amazon Aurora V1 reaches End of Life, support for Amazon Aurora V1 in ``det deploy - aws`` will be removed. Future deployments will default to the ``simple-rds`` type, which uses - Amazon RDS for PostgreSQL. Changes to the deployment code will ensure this transition to the new - default. - -- Database: As a follow-up to the earlier notice, PostgreSQL 12 will reach End of Life on November - 14, 2024. Instances still using PostgreSQL 12 or earlier should upgrade to PostgreSQL 13 or later - to maintain compatibility. The application will log a warning if it detects a connection to any - PostgreSQL version older than 12, and this warning will be updated to include PostgreSQL 12 once - it is End of Life.