Merge pull request #1087 from miguelcsx:fix/typos-spelling

PiperOrigin-RevId: 681872884
google-deepmind · Oct 3, 2024 · 25485d1 · 25485d1
2 parents 159b2f1 + 32443cc
commit 25485d1
Show file tree

Hide file tree

Showing 40 changed files with 179 additions and 176 deletions.
diff --git a/README.md b/README.md
@@ -15,12 +15,12 @@ Our goals are to
 
 *   Provide simple, well-tested, efficient implementations of core components.
 *   Improve research productivity by enabling to easily combine low-level
-    ingredients into custom optimisers (or other gradient processing components).
+    ingredients into custom optimizers (or other gradient processing components).
 *   Accelerate adoption of new ideas by making it easy for anyone to contribute.
 
-We favour focusing on small composable building blocks that can be effectively
+We favor focusing on small composable building blocks that can be effectively
 combined into custom solutions. Others may build upon these basic components
-in more complicated abstractions. Whenever reasonable, implementations prioritise
+in more complicated abstractions. Whenever reasonable, implementations prioritize
 readability and structuring code to match standard equations, over code reuse.
 
 An initial prototype of this library was made available in JAX's experimental

diff --git a/docs/api/stochastic_gradient_estimators.rst b/docs/api/stochastic_gradient_estimators.rst
@@ -2,7 +2,7 @@ Stochastic Gradient Estimators
 ==============================
 
 .. warning::
-    This module has been depreated and will be removed in optax 0.3.0.
+    This module has been deprecated and will be removed in optax 0.3.0.
 
 .. currentmodule:: optax.monte_carlo
 

diff --git a/docs/gallery.rst b/docs/gallery.rst
@@ -218,7 +218,7 @@ Examples that make use of the :doc:`api/contrib` module.
 
 .. raw:: html
 
-      <div class="sphx-glr-thumbcontainer" tooltip="Example usage of reduce_on_plateau learing rate scheduler.">
+      <div class="sphx-glr-thumbcontainer" tooltip="Example usage of reduce_on_plateau learning rate scheduler.">
 
 .. only:: html
 
@@ -229,7 +229,7 @@ Examples that make use of the :doc:`api/contrib` module.
 
 .. raw:: html
 
-      <div class="sphx-glr-thumbnail-title">Example usage of reduce_on_plateau learing rate scheduler.</div>
+      <div class="sphx-glr-thumbnail-title">Example usage of reduce_on_plateau learning rate scheduler.</div>
     </div>
 
 

diff --git a/docs/getting_started.ipynb b/docs/getting_started.ipynb
diff --git a/docs/index.rst b/docs/index.rst
@@ -6,19 +6,19 @@ Optax
 
 Optax is a gradient processing and optimization library for JAX. It is designed
 to facilitate research by providing building blocks that can be recombined in
-custom ways in order to optimise parametric models such as, but not limited to,
+custom ways in order to optimize parametric models such as, but not limited to,
 deep neural networks.
 
 Our goals are to
 
 *   Provide readable, well-tested, efficient implementations of core components,
 *   Improve researcher productivity by making it possible to combine low level
-    ingredients into custom optimiser (or other gradient processing components).
+    ingredients into custom optimizer (or other gradient processing components).
 *   Accelerate adoption of new ideas by making it easy for anyone to contribute.
 
-We favour focusing on small composable building blocks that can be effectively
+We favor focusing on small composable building blocks that can be effectively
 combined into custom solutions. Others may build upon these basic components
-more complicated abstractions. Whenever reasonable, implementations prioritise
+more complicated abstractions. Whenever reasonable, implementations prioritize
 readability and structuring code to match standard equations, over code reuse.
 
 Installation

diff --git a/examples/contrib/sam.ipynb b/examples/contrib/sam.ipynb
@@ -70,7 +70,7 @@
         "id": "7-p_W8vkhnO1"
       },
       "source": [
-        "To actually use SAM then, you create your adversarial optimizer, here SGD with normalized gradients, an outer optimzer, and then wrap them with SAM.\n",
+        "To actually use SAM then, you create your adversarial optimizer, here SGD with normalized gradients, an outer optimizer, and then wrap them with SAM.\n",
         "\n",
         "The drop-in SAM optimizer described in the paper uses SGD for both optimizers."
       ]
@@ -648,7 +648,7 @@
         "id": "NLAJFJ-SjmIK"
       },
       "source": [
-        "The behavior is identical to transparent mode, but the percieved number of gradient steps is half as many as in transparent mode (or 1/5 as many for SAM Adam)."
+        "The behavior is identical to transparent mode, but the perceived number of gradient steps is half as many as in transparent mode (or 1/5 as many for SAM Adam)."
       ]
     },
     {

diff --git a/examples/flax_example.ipynb b/examples/flax_example.ipynb
diff --git a/examples/lbfgs.ipynb b/examples/lbfgs.ipynb
@@ -8,7 +8,7 @@
       "source": [
         "# L-BFGS\n",
         "\n",
-        "L-BFGS is a classical optimization method that uses past gradients and parameters informations to iteratively refine a solution to a minimization problem. In this notebook, we illustrate\n",
+        "L-BFGS is a classical optimization method that uses past gradients and parameters information to iteratively refine a solution to a minimization problem. In this notebook, we illustrate\n",
         "1. how to use L-BFGS as a simple gradient transformation,\n",
         "2. how to wrap L-BFGS in a solver, and how linesearches are incorporated,\n",
         "3. how to debug the solver if needed,\n"
@@ -123,7 +123,7 @@
       "source": [
         "## L-BFGS as a solver\n",
         "\n",
-        "L-BFGS is a stample in numerical optimization to solve medium scale problems. It is often the backend of generic minimization functions in software libraries like [scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize). A key ingredient to make it a simple optimization blackbox, is to remove the need of tuning the stepsize, a.k.a. learning rate in machine learning. In a deterministic setting (no additional varying inputs like inputs/labels), such automatic tuning of the stepsize is done by means of linesearches reviewed below."
+        "L-BFGS is a sample in numerical optimization to solve medium scale problems. It is often the backend of generic minimization functions in software libraries like [scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize). A key ingredient to make it a simple optimization blackbox, is to remove the need of tuning the stepsize, a.k.a. learning rate in machine learning. In a deterministic setting (no additional varying inputs like inputs/labels), such automatic tuning of the stepsize is done by means of linesearches reviewed below."
       ]
     },
     {
@@ -146,7 +146,7 @@
         "\n",
         "where $c_1$ is some constant set to $10^{-4}$ by default. Consider for example the update direction to be $u_k = -g_k$, i.e., moving along the negative gradient direction. In that case the criterion above reduces to $f(w_k - \\eta_k g_k) \\leq f(w_k) - c_1 \\eta_k ||g_k||_2^2$. The criterion amounts then to choosing the stepsize such that it decreases the objective by an amount proportional to the squared gradient norm.\n",
         "\n",
-        "As long as the update direction is a *descent direction*, that is, $\\langle u_k, g_k\\rangle \u003c 0$ the above criterion is guaranteed to be satisfied by some sufficiently small stepsize.\n",
+        "As long as the update direction is a *descent direction*, that is, $\\langle u_k, g_k\\rangle < 0$ the above criterion is guaranteed to be satisfied by some sufficiently small stepsize.\n",
         "A simple linesearch technique to ensure a sufficient decrease is then to decrease a candidate stepsize by a constant factor up until the criterion is satisfied. This amounts to the backtracking linesearch implemented in [optax.scale_by_backtracking_linesearch](https://optax.readthedocs.io/en/latest/api/transformations.html#optax.scale_by_backtracking_linesearch) and briefly reviewed below.\n",
         "\n",
         "#### Small curvature (Strong wolfe criterion)\n",
@@ -303,7 +303,7 @@
         "    iter_num = otu.tree_get(state, 'count')\n",
         "    grad = otu.tree_get(state, 'grad')\n",
         "    err = otu.tree_l2_norm(grad)\n",
-        "    return (iter_num == 0) | ((iter_num \u003c max_iter) \u0026 (err \u003e= tol))\n",
+        "    return (iter_num == 0) | ((iter_num < max_iter) & (err >= tol))\n",
         "\n",
         "  init_carry = (init_params, opt.init(init_params))\n",
         "  final_params, final_state = jax.lax.while_loop(\n",

diff --git a/examples/meta_learning.ipynb b/examples/meta_learning.ipynb
@@ -34,7 +34,7 @@
         "bounded between 0 and 1, we parametrize the learning rate as a sigmoid\n",
         "over the meta parameter $\\eta$.\n",
         "\n",
-        "In the following snippts, we will solve the problem using optax. To begin with, we define a generator that samples from the hidden underlying distribution."
+        "In the following snippets, we will solve the problem using optax. To begin with, we define a generator that samples from the hidden underlying distribution."
       ]
     },
     {
@@ -63,7 +63,7 @@
       },
       "outputs": [],
       "source": [
-        "def generator() -\u003e Iterator[Tuple[chex.Array, chex.Array]]:\n",
+        "def generator() -> Iterator[Tuple[chex.Array, chex.Array]]:\n",
         "  rng = jax.random.PRNGKey(0)\n",
         "\n",
         "  while True:\n",
@@ -114,7 +114,7 @@
       },
       "outputs": [],
       "source": [
-        "def f(theta: chex.Array, x: chex.Array) -\u003e chex.Array:\n",
+        "def f(theta: chex.Array, x: chex.Array) -> chex.Array:\n",
         "  return x * theta\n",
         "\n",
         "theta = jax.random.normal(jax.random.PRNGKey(42))"

diff --git a/examples/nanolm.ipynb b/examples/nanolm.ipynb
diff --git a/examples/ogda_example.ipynb b/examples/ogda_example.ipynb
@@ -51,14 +51,14 @@
         "y_{k+1} = y_k + \\eta_k \\nabla_y f(x_k, y_k),\n",
         "$$\n",
         "\n",
-        "where $\\eta_k$ is a step size. However, it's well-documented that GDA can fail to converge in this setting. This is an important issue because gradient-based min-max optimisation is increasingly prevalent in machine learning (e.g., GANs, constrained RL). *Optimistic* GDA (OGDA) addresses this shortcoming by introducing a form of memory-based negative momentum:  \n",
+        "where $\\eta_k$ is a step size. However, it's well-documented that GDA can fail to converge in this setting. This is an important issue because gradient-based min-max optimization is increasingly prevalent in machine learning (e.g., GANs, constrained RL). *Optimistic* GDA (OGDA) addresses this shortcoming by introducing a form of memory-based negative momentum:  \n",
         "\n",
         "$$\n",
         "x_{k+1} = x_k - 2 \\eta_k \\nabla_x f(x_k, y_k) + \\eta_k  \\nabla_x f(x_{k-1}, y_{k-1})  \\\\\n",
         "y_{k+1} = y_k + 2 \\eta_k \\nabla_y f(x_k, y_k) - \\eta_k \\nabla_y f(x_{k-1}, y_{k-1})).\n",
         "$$\n",
         "\n",
-        "Thus, to implement OGD (or OGA), the optimiser needs to keep track of the gradient from the previous step. OGDA has been formally shown to converge to the optimum $(x_k, y_k) \\to (x^\\star, y^\\star)$ in this setting. The generalised form of the OGDA update rule is given by\n",
+        "Thus, to implement OGD (or OGA), the optimizer needs to keep track of the gradient from the previous step. OGDA has been formally shown to converge to the optimum $(x_k, y_k) \\to (x^\\star, y^\\star)$ in this setting. The generalized form of the OGDA update rule is given by\n",
         "\n",
         "$$\n",
         "x_{k+1} = x_k - (\\alpha + \\beta) \\eta_k \\nabla_x f(x_k, y_k) + \\beta \\eta_k \\nabla_x f(x_{k-1}, y_{k-1})  \\\\\n",
@@ -84,7 +84,7 @@
         "\\mu^{k+1} = \\mu^k + 2\\tau_\\mu^k \\nabla_\\mu \\mathcal L(\\pi^k_k, \\mu^k)+  \\tau_\\mu^k \\nabla_\\mu \\mathcal L(\\pi^{k-1}, \\mu^{k-1})\n",
         "$$\n",
         "\n",
-        "where $\\eta_k$ is a step size. However, it's well-documented that GDA can fail to converge in this setting. This is an important issue because gradient-based min-max optimisation is increasingly prevalent in machine learning (e.g., GANs, constrained RL). *Optimistic* GDA (OGDA) addresses this shortcoming by introducing a form of memory-based negative momentum:\n",
+        "where $\\eta_k$ is a step size. However, it's well-documented that GDA can fail to converge in this setting. This is an important issue because gradient-based min-max optimization is increasingly prevalent in machine learning (e.g., GANs, constrained RL). *Optimistic* GDA (OGDA) addresses this shortcoming by introducing a form of memory-based negative momentum:\n",
         "\n",
         "$$\n",
         "x_{k+1} = x_k - 2 \\eta_k \\nabla_x f(x_k, y_k) + \\eta_k  \\nabla_x f(x_{k-1}, y_{k-1})  \\\\\n",
@@ -120,7 +120,7 @@
         "id": "G-4JMKlgs-Lr"
       },
       "source": [
-        "Define an optimisation loop."
+        "Define an optimization loop."
       ]
     },
     {
@@ -131,20 +131,20 @@
       },
       "outputs": [],
       "source": [
-        "def optimise(params: optax.Params, x_optimiser: optax.GradientTransformation, y_optimiser: optax.GradientTransformation, n_steps: int = 1000, display_every: int = 100) -> optax.Params:\n",
-        "  \"\"\"An optimisation loop minimising x and maximising y.\"\"\"\n",
+        "def optimize(params: optax.Params, x_optimizer: optax.GradientTransformation, y_optimizer: optax.GradientTransformation, n_steps: int = 1000, display_every: int = 100) -> optax.Params:\n",
+        "  \"\"\"An optimization loop minimizing x and maximizing y.\"\"\"\n",
         "\n",
-        "  x_opt_state = x_optimiser.init(params[\"x\"])\n",
-        "  y_opt_state = y_optimiser.init(params[\"y\"])\n",
+        "  x_opt_state = x_optimizer.init(params[\"x\"])\n",
+        "  y_opt_state = y_optimizer.init(params[\"y\"])\n",
         "  param_hist = [params]\n",
         "  f_hist = []\n",
         "\n",
         "  @jax.jit\n",
         "  def step(params, x_opt_state, y_opt_state):\n",
         "    f_value, grads = jax.value_and_grad(f)(params)\n",
-        "    x_update, x_opt_state = x_optimiser.update(grads[\"x\"], x_opt_state, params[\"x\"])\n",
-        "    # note that we\"re maximising y so we feed in the negative gradient to the OGD update\n",
-        "    y_update, y_opt_state = y_optimiser.update(-grads[\"y\"], y_opt_state, params[\"y\"])\n",
+        "    x_update, x_opt_state = x_optimizer.update(grads[\"x\"], x_opt_state, params[\"x\"])\n",
+        "    # note that we\"re maximizing y so we feed in the negative gradient to the OGD update\n",
+        "    y_update, y_opt_state = y_optimizer.update(-grads[\"y\"], y_opt_state, params[\"y\"])\n",
         "    updates = {\"x\": x_update, \"y\": y_update}\n",
         "    params = optax.apply_updates(params, updates)\n",
         "    return params, x_opt_state, y_opt_state, f_value\n",
@@ -165,7 +165,7 @@
         "id": "gDtB7gJdtPZj"
       },
       "source": [
-        "Initialise $x$ and $y$, as well as optimisers for each. "
+        "Initialize $x$ and $y$, as well as optimizers for each. "
       ]
     },
     {
@@ -182,12 +182,12 @@
         "}\n",
         "\n",
         "# GDA\n",
-        "x_gd_optimiser = optax.sgd(learning_rate=0.1)\n",
-        "y_ga_optimiser = optax.sgd(learning_rate=0.1)\n",
+        "x_gd_optimizer = optax.sgd(learning_rate=0.1)\n",
+        "y_ga_optimizer = optax.sgd(learning_rate=0.1)\n",
         "\n",
         "# OGDA\n",
-        "x_ogd_optimiser = optax.optimistic_gradient_descent(learning_rate=0.1)\n",
-        "y_oga_optimiser = optax.optimistic_gradient_descent(learning_rate=0.1)"
+        "x_ogd_optimizer = optax.optimistic_gradient_descent(learning_rate=0.1)\n",
+        "y_oga_optimizer = optax.optimistic_gradient_descent(learning_rate=0.1)"
       ]
     },
     {
@@ -207,7 +207,7 @@
       },
       "outputs": [],
       "source": [
-        "gda_hist, gda_f_hist = optimise(initial_params, x_gd_optimiser, y_ga_optimiser)"
+        "gda_hist, gda_f_hist = optimize(initial_params, x_gd_optimizer, y_ga_optimizer)"
       ]
     },
     {
@@ -218,7 +218,7 @@
       },
       "outputs": [],
       "source": [
-        "ogda_hist, ogda_f_hist = optimise(initial_params, x_ogd_optimiser, y_oga_optimiser)"
+        "ogda_hist, ogda_f_hist = optimize(initial_params, x_ogd_optimizer, y_oga_optimizer)"
       ]
     },
     {
@@ -227,7 +227,7 @@
         "id": "S504XrZrtXNe"
       },
       "source": [
-        "Visualise the optimisation trajectories. The optimal solution is $(0, 0)$. "
+        "Visualize the optimization trajectories. The optimal solution is $(0, 0)$. "
       ]
     },
     {

diff --git a/optax/_src/alias.py b/optax/_src/alias.py
@@ -834,7 +834,7 @@ def amsgrad(
     eps_root: float = 0.0,
     mu_dtype: Optional[Any] = None,
 ) -> base.GradientTransformation:
-  """The AMSGrad optimiser.
+  """The AMSGrad optimizer.
 
   The original Adam can fail to converge to the optimal solution in some cases.
   AMSGrad guarantees convergence by using a long-term memory of past gradients.
@@ -1194,7 +1194,7 @@ def sign_sgd(
 
 
   References:
-    Bernstein et al., `signSGD: Compressed Optimisation for Non-Convex Problems
+    Bernstein et al., `signSGD: Compressed optimization for Non-Convex Problems
     <https://arxiv.org/abs/1802.04434>`_, 2018
 
     Balles et al.`The Geometry of Sign Gradient Descent

diff --git a/optax/_src/base.py b/optax/_src/base.py
@@ -85,7 +85,7 @@ class TransformUpdateFn(Protocol):
 
   The `update` step takes a tree of candidate parameter `updates` (e.g. their
   gradient with respect to some loss), an arbitrary structured `state`, and the
-  current `params` of the model being optimised. The `params` argument is
+  current `params` of the model being optimized. The `params` argument is
   optional, it must however be provided when using transformations that require
   access to the current values of the parameters.
 
@@ -171,9 +171,9 @@ class GradientTransformation(NamedTuple):
   passed to the next call to the gradient transformation.
 
   Since gradient transformations are pure, idempotent functions, the only way
-  to change the behaviour of a gradient transformation between steps, is to
+  to change the behavior of a gradient transformation between steps, is to
   change the values in the optimizer state. To see an example of mutating the
-  optimizer state in order to control the behaviour of an optax gradient
+  optimizer state in order to control the behavior of an optax gradient
   transformation see the `meta-learning example <https://optax.readthedocs.io/en/latest/_collections/examples/meta_learning.html>`_ in the optax documentation.
 
   Attributes:

diff --git a/optax/_src/base_test.py b/optax/_src/base_test.py
@@ -65,7 +65,7 @@ def test_set_to_zero_is_stateless(self):
 class ExtraArgsTest(chex.TestCase):
 
   def test_isinstance(self):
-    """Locks in behaviour for comparing transformations."""
+    """Locks in behavior for comparing transformations."""
 
     def init_fn(params):
       del params