Skip to content

Commit

Permalink
Merge pull request #1087 from miguelcsx:fix/typos-spelling
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 681872884
  • Loading branch information
OptaxDev committed Oct 3, 2024
2 parents 159b2f1 + 32443cc commit 25485d1
Show file tree
Hide file tree
Showing 40 changed files with 179 additions and 176 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@ Our goals are to

* Provide simple, well-tested, efficient implementations of core components.
* Improve research productivity by enabling to easily combine low-level
ingredients into custom optimisers (or other gradient processing components).
ingredients into custom optimizers (or other gradient processing components).
* Accelerate adoption of new ideas by making it easy for anyone to contribute.

We favour focusing on small composable building blocks that can be effectively
We favor focusing on small composable building blocks that can be effectively
combined into custom solutions. Others may build upon these basic components
in more complicated abstractions. Whenever reasonable, implementations prioritise
in more complicated abstractions. Whenever reasonable, implementations prioritize
readability and structuring code to match standard equations, over code reuse.

An initial prototype of this library was made available in JAX's experimental
Expand Down
2 changes: 1 addition & 1 deletion docs/api/stochastic_gradient_estimators.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Stochastic Gradient Estimators
==============================

.. warning::
This module has been depreated and will be removed in optax 0.3.0.
This module has been deprecated and will be removed in optax 0.3.0.

.. currentmodule:: optax.monte_carlo

Expand Down
4 changes: 2 additions & 2 deletions docs/gallery.rst
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ Examples that make use of the :doc:`api/contrib` module.

.. raw:: html

<div class="sphx-glr-thumbcontainer" tooltip="Example usage of reduce_on_plateau learing rate scheduler.">
<div class="sphx-glr-thumbcontainer" tooltip="Example usage of reduce_on_plateau learning rate scheduler.">

.. only:: html

Expand All @@ -229,7 +229,7 @@ Examples that make use of the :doc:`api/contrib` module.

.. raw:: html

<div class="sphx-glr-thumbnail-title">Example usage of reduce_on_plateau learing rate scheduler.</div>
<div class="sphx-glr-thumbnail-title">Example usage of reduce_on_plateau learning rate scheduler.</div>
</div>


Expand Down
64 changes: 32 additions & 32 deletions docs/getting_started.ipynb

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,19 @@ Optax

Optax is a gradient processing and optimization library for JAX. It is designed
to facilitate research by providing building blocks that can be recombined in
custom ways in order to optimise parametric models such as, but not limited to,
custom ways in order to optimize parametric models such as, but not limited to,
deep neural networks.

Our goals are to

* Provide readable, well-tested, efficient implementations of core components,
* Improve researcher productivity by making it possible to combine low level
ingredients into custom optimiser (or other gradient processing components).
ingredients into custom optimizer (or other gradient processing components).
* Accelerate adoption of new ideas by making it easy for anyone to contribute.

We favour focusing on small composable building blocks that can be effectively
We favor focusing on small composable building blocks that can be effectively
combined into custom solutions. Others may build upon these basic components
more complicated abstractions. Whenever reasonable, implementations prioritise
more complicated abstractions. Whenever reasonable, implementations prioritize
readability and structuring code to match standard equations, over code reuse.

Installation
Expand Down
4 changes: 2 additions & 2 deletions examples/contrib/sam.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@
"id": "7-p_W8vkhnO1"
},
"source": [
"To actually use SAM then, you create your adversarial optimizer, here SGD with normalized gradients, an outer optimzer, and then wrap them with SAM.\n",
"To actually use SAM then, you create your adversarial optimizer, here SGD with normalized gradients, an outer optimizer, and then wrap them with SAM.\n",
"\n",
"The drop-in SAM optimizer described in the paper uses SGD for both optimizers."
]
Expand Down Expand Up @@ -648,7 +648,7 @@
"id": "NLAJFJ-SjmIK"
},
"source": [
"The behavior is identical to transparent mode, but the percieved number of gradient steps is half as many as in transparent mode (or 1/5 as many for SAM Adam)."
"The behavior is identical to transparent mode, but the perceived number of gradient steps is half as many as in transparent mode (or 1/5 as many for SAM Adam)."
]
},
{
Expand Down
21 changes: 12 additions & 9 deletions examples/flax_example.ipynb

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions examples/lbfgs.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"source": [
"# L-BFGS\n",
"\n",
"L-BFGS is a classical optimization method that uses past gradients and parameters informations to iteratively refine a solution to a minimization problem. In this notebook, we illustrate\n",
"L-BFGS is a classical optimization method that uses past gradients and parameters information to iteratively refine a solution to a minimization problem. In this notebook, we illustrate\n",
"1. how to use L-BFGS as a simple gradient transformation,\n",
"2. how to wrap L-BFGS in a solver, and how linesearches are incorporated,\n",
"3. how to debug the solver if needed,\n"
Expand Down Expand Up @@ -123,7 +123,7 @@
"source": [
"## L-BFGS as a solver\n",
"\n",
"L-BFGS is a stample in numerical optimization to solve medium scale problems. It is often the backend of generic minimization functions in software libraries like [scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize). A key ingredient to make it a simple optimization blackbox, is to remove the need of tuning the stepsize, a.k.a. learning rate in machine learning. In a deterministic setting (no additional varying inputs like inputs/labels), such automatic tuning of the stepsize is done by means of linesearches reviewed below."
"L-BFGS is a sample in numerical optimization to solve medium scale problems. It is often the backend of generic minimization functions in software libraries like [scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize). A key ingredient to make it a simple optimization blackbox, is to remove the need of tuning the stepsize, a.k.a. learning rate in machine learning. In a deterministic setting (no additional varying inputs like inputs/labels), such automatic tuning of the stepsize is done by means of linesearches reviewed below."
]
},
{
Expand All @@ -146,7 +146,7 @@
"\n",
"where $c_1$ is some constant set to $10^{-4}$ by default. Consider for example the update direction to be $u_k = -g_k$, i.e., moving along the negative gradient direction. In that case the criterion above reduces to $f(w_k - \\eta_k g_k) \\leq f(w_k) - c_1 \\eta_k ||g_k||_2^2$. The criterion amounts then to choosing the stepsize such that it decreases the objective by an amount proportional to the squared gradient norm.\n",
"\n",
"As long as the update direction is a *descent direction*, that is, $\\langle u_k, g_k\\rangle \u003c 0$ the above criterion is guaranteed to be satisfied by some sufficiently small stepsize.\n",
"As long as the update direction is a *descent direction*, that is, $\\langle u_k, g_k\\rangle < 0$ the above criterion is guaranteed to be satisfied by some sufficiently small stepsize.\n",
"A simple linesearch technique to ensure a sufficient decrease is then to decrease a candidate stepsize by a constant factor up until the criterion is satisfied. This amounts to the backtracking linesearch implemented in [optax.scale_by_backtracking_linesearch](https://optax.readthedocs.io/en/latest/api/transformations.html#optax.scale_by_backtracking_linesearch) and briefly reviewed below.\n",
"\n",
"#### Small curvature (Strong wolfe criterion)\n",
Expand Down Expand Up @@ -303,7 +303,7 @@
" iter_num = otu.tree_get(state, 'count')\n",
" grad = otu.tree_get(state, 'grad')\n",
" err = otu.tree_l2_norm(grad)\n",
" return (iter_num == 0) | ((iter_num \u003c max_iter) \u0026 (err \u003e= tol))\n",
" return (iter_num == 0) | ((iter_num < max_iter) & (err >= tol))\n",
"\n",
" init_carry = (init_params, opt.init(init_params))\n",
" final_params, final_state = jax.lax.while_loop(\n",
Expand Down
6 changes: 3 additions & 3 deletions examples/meta_learning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
"bounded between 0 and 1, we parametrize the learning rate as a sigmoid\n",
"over the meta parameter $\\eta$.\n",
"\n",
"In the following snippts, we will solve the problem using optax. To begin with, we define a generator that samples from the hidden underlying distribution."
"In the following snippets, we will solve the problem using optax. To begin with, we define a generator that samples from the hidden underlying distribution."
]
},
{
Expand Down Expand Up @@ -63,7 +63,7 @@
},
"outputs": [],
"source": [
"def generator() -\u003e Iterator[Tuple[chex.Array, chex.Array]]:\n",
"def generator() -> Iterator[Tuple[chex.Array, chex.Array]]:\n",
" rng = jax.random.PRNGKey(0)\n",
"\n",
" while True:\n",
Expand Down Expand Up @@ -114,7 +114,7 @@
},
"outputs": [],
"source": [
"def f(theta: chex.Array, x: chex.Array) -\u003e chex.Array:\n",
"def f(theta: chex.Array, x: chex.Array) -> chex.Array:\n",
" return x * theta\n",
"\n",
"theta = jax.random.normal(jax.random.PRNGKey(42))"
Expand Down
60 changes: 30 additions & 30 deletions examples/nanolm.ipynb

Large diffs are not rendered by default.

38 changes: 19 additions & 19 deletions examples/ogda_example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -51,14 +51,14 @@
"y_{k+1} = y_k + \\eta_k \\nabla_y f(x_k, y_k),\n",
"$$\n",
"\n",
"where $\\eta_k$ is a step size. However, it's well-documented that GDA can fail to converge in this setting. This is an important issue because gradient-based min-max optimisation is increasingly prevalent in machine learning (e.g., GANs, constrained RL). *Optimistic* GDA (OGDA) addresses this shortcoming by introducing a form of memory-based negative momentum: \n",
"where $\\eta_k$ is a step size. However, it's well-documented that GDA can fail to converge in this setting. This is an important issue because gradient-based min-max optimization is increasingly prevalent in machine learning (e.g., GANs, constrained RL). *Optimistic* GDA (OGDA) addresses this shortcoming by introducing a form of memory-based negative momentum: \n",
"\n",
"$$\n",
"x_{k+1} = x_k - 2 \\eta_k \\nabla_x f(x_k, y_k) + \\eta_k \\nabla_x f(x_{k-1}, y_{k-1}) \\\\\n",
"y_{k+1} = y_k + 2 \\eta_k \\nabla_y f(x_k, y_k) - \\eta_k \\nabla_y f(x_{k-1}, y_{k-1})).\n",
"$$\n",
"\n",
"Thus, to implement OGD (or OGA), the optimiser needs to keep track of the gradient from the previous step. OGDA has been formally shown to converge to the optimum $(x_k, y_k) \\to (x^\\star, y^\\star)$ in this setting. The generalised form of the OGDA update rule is given by\n",
"Thus, to implement OGD (or OGA), the optimizer needs to keep track of the gradient from the previous step. OGDA has been formally shown to converge to the optimum $(x_k, y_k) \\to (x^\\star, y^\\star)$ in this setting. The generalized form of the OGDA update rule is given by\n",
"\n",
"$$\n",
"x_{k+1} = x_k - (\\alpha + \\beta) \\eta_k \\nabla_x f(x_k, y_k) + \\beta \\eta_k \\nabla_x f(x_{k-1}, y_{k-1}) \\\\\n",
Expand All @@ -84,7 +84,7 @@
"\\mu^{k+1} = \\mu^k + 2\\tau_\\mu^k \\nabla_\\mu \\mathcal L(\\pi^k_k, \\mu^k)+ \\tau_\\mu^k \\nabla_\\mu \\mathcal L(\\pi^{k-1}, \\mu^{k-1})\n",
"$$\n",
"\n",
"where $\\eta_k$ is a step size. However, it's well-documented that GDA can fail to converge in this setting. This is an important issue because gradient-based min-max optimisation is increasingly prevalent in machine learning (e.g., GANs, constrained RL). *Optimistic* GDA (OGDA) addresses this shortcoming by introducing a form of memory-based negative momentum:\n",
"where $\\eta_k$ is a step size. However, it's well-documented that GDA can fail to converge in this setting. This is an important issue because gradient-based min-max optimization is increasingly prevalent in machine learning (e.g., GANs, constrained RL). *Optimistic* GDA (OGDA) addresses this shortcoming by introducing a form of memory-based negative momentum:\n",
"\n",
"$$\n",
"x_{k+1} = x_k - 2 \\eta_k \\nabla_x f(x_k, y_k) + \\eta_k \\nabla_x f(x_{k-1}, y_{k-1}) \\\\\n",
Expand Down Expand Up @@ -120,7 +120,7 @@
"id": "G-4JMKlgs-Lr"
},
"source": [
"Define an optimisation loop."
"Define an optimization loop."
]
},
{
Expand All @@ -131,20 +131,20 @@
},
"outputs": [],
"source": [
"def optimise(params: optax.Params, x_optimiser: optax.GradientTransformation, y_optimiser: optax.GradientTransformation, n_steps: int = 1000, display_every: int = 100) -> optax.Params:\n",
" \"\"\"An optimisation loop minimising x and maximising y.\"\"\"\n",
"def optimize(params: optax.Params, x_optimizer: optax.GradientTransformation, y_optimizer: optax.GradientTransformation, n_steps: int = 1000, display_every: int = 100) -> optax.Params:\n",
" \"\"\"An optimization loop minimizing x and maximizing y.\"\"\"\n",
"\n",
" x_opt_state = x_optimiser.init(params[\"x\"])\n",
" y_opt_state = y_optimiser.init(params[\"y\"])\n",
" x_opt_state = x_optimizer.init(params[\"x\"])\n",
" y_opt_state = y_optimizer.init(params[\"y\"])\n",
" param_hist = [params]\n",
" f_hist = []\n",
"\n",
" @jax.jit\n",
" def step(params, x_opt_state, y_opt_state):\n",
" f_value, grads = jax.value_and_grad(f)(params)\n",
" x_update, x_opt_state = x_optimiser.update(grads[\"x\"], x_opt_state, params[\"x\"])\n",
" # note that we\"re maximising y so we feed in the negative gradient to the OGD update\n",
" y_update, y_opt_state = y_optimiser.update(-grads[\"y\"], y_opt_state, params[\"y\"])\n",
" x_update, x_opt_state = x_optimizer.update(grads[\"x\"], x_opt_state, params[\"x\"])\n",
" # note that we\"re maximizing y so we feed in the negative gradient to the OGD update\n",
" y_update, y_opt_state = y_optimizer.update(-grads[\"y\"], y_opt_state, params[\"y\"])\n",
" updates = {\"x\": x_update, \"y\": y_update}\n",
" params = optax.apply_updates(params, updates)\n",
" return params, x_opt_state, y_opt_state, f_value\n",
Expand All @@ -165,7 +165,7 @@
"id": "gDtB7gJdtPZj"
},
"source": [
"Initialise $x$ and $y$, as well as optimisers for each. "
"Initialize $x$ and $y$, as well as optimizers for each. "
]
},
{
Expand All @@ -182,12 +182,12 @@
"}\n",
"\n",
"# GDA\n",
"x_gd_optimiser = optax.sgd(learning_rate=0.1)\n",
"y_ga_optimiser = optax.sgd(learning_rate=0.1)\n",
"x_gd_optimizer = optax.sgd(learning_rate=0.1)\n",
"y_ga_optimizer = optax.sgd(learning_rate=0.1)\n",
"\n",
"# OGDA\n",
"x_ogd_optimiser = optax.optimistic_gradient_descent(learning_rate=0.1)\n",
"y_oga_optimiser = optax.optimistic_gradient_descent(learning_rate=0.1)"
"x_ogd_optimizer = optax.optimistic_gradient_descent(learning_rate=0.1)\n",
"y_oga_optimizer = optax.optimistic_gradient_descent(learning_rate=0.1)"
]
},
{
Expand All @@ -207,7 +207,7 @@
},
"outputs": [],
"source": [
"gda_hist, gda_f_hist = optimise(initial_params, x_gd_optimiser, y_ga_optimiser)"
"gda_hist, gda_f_hist = optimize(initial_params, x_gd_optimizer, y_ga_optimizer)"
]
},
{
Expand All @@ -218,7 +218,7 @@
},
"outputs": [],
"source": [
"ogda_hist, ogda_f_hist = optimise(initial_params, x_ogd_optimiser, y_oga_optimiser)"
"ogda_hist, ogda_f_hist = optimize(initial_params, x_ogd_optimizer, y_oga_optimizer)"
]
},
{
Expand All @@ -227,7 +227,7 @@
"id": "S504XrZrtXNe"
},
"source": [
"Visualise the optimisation trajectories. The optimal solution is $(0, 0)$. "
"Visualize the optimization trajectories. The optimal solution is $(0, 0)$. "
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions optax/_src/alias.py
Original file line number Diff line number Diff line change
Expand Up @@ -834,7 +834,7 @@ def amsgrad(
eps_root: float = 0.0,
mu_dtype: Optional[Any] = None,
) -> base.GradientTransformation:
"""The AMSGrad optimiser.
"""The AMSGrad optimizer.
The original Adam can fail to converge to the optimal solution in some cases.
AMSGrad guarantees convergence by using a long-term memory of past gradients.
Expand Down Expand Up @@ -1194,7 +1194,7 @@ def sign_sgd(
References:
Bernstein et al., `signSGD: Compressed Optimisation for Non-Convex Problems
Bernstein et al., `signSGD: Compressed optimization for Non-Convex Problems
<https://arxiv.org/abs/1802.04434>`_, 2018
Balles et al.`The Geometry of Sign Gradient Descent
Expand Down
6 changes: 3 additions & 3 deletions optax/_src/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ class TransformUpdateFn(Protocol):
The `update` step takes a tree of candidate parameter `updates` (e.g. their
gradient with respect to some loss), an arbitrary structured `state`, and the
current `params` of the model being optimised. The `params` argument is
current `params` of the model being optimized. The `params` argument is
optional, it must however be provided when using transformations that require
access to the current values of the parameters.
Expand Down Expand Up @@ -171,9 +171,9 @@ class GradientTransformation(NamedTuple):
passed to the next call to the gradient transformation.
Since gradient transformations are pure, idempotent functions, the only way
to change the behaviour of a gradient transformation between steps, is to
to change the behavior of a gradient transformation between steps, is to
change the values in the optimizer state. To see an example of mutating the
optimizer state in order to control the behaviour of an optax gradient
optimizer state in order to control the behavior of an optax gradient
transformation see the `meta-learning example <https://optax.readthedocs.io/en/latest/_collections/examples/meta_learning.html>`_ in the optax documentation.
Attributes:
Expand Down
2 changes: 1 addition & 1 deletion optax/_src/base_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ def test_set_to_zero_is_stateless(self):
class ExtraArgsTest(chex.TestCase):

def test_isinstance(self):
"""Locks in behaviour for comparing transformations."""
"""Locks in behavior for comparing transformations."""

def init_fn(params):
del params
Expand Down
Loading

0 comments on commit 25485d1

Please sign in to comment.