chore: onboarding doc clarification (#7109)

determined-ai · Jun 14, 2023 · afff53e · afff53e
1 parent 6f98aac
commit afff53e
Showing 1 changed file with 13 additions and 12 deletions.
diff --git a/onboarding/mlsys/1-gradients.md b/onboarding/mlsys/1-gradients.md
@@ -295,7 +295,7 @@ print(train_loop(m=0, lr=0.0001, dataset=dataset, iterations=10000))
 ```
 
 Now that we have noise in our dataset, we find that some values are "noisier"
-than others.  Notice that at `m=1` our model is off by `1`, but at `x=4` our
+than others.  Notice that at `x=1` our model is off by `1`, but at `x=4` our
 model is off by `2.25`.  These noisier points can cause training to become
 unstable at higher learning rates.
 
@@ -326,16 +326,16 @@ the gradients from each point.
 This is called "batching" your inputs.  The number of inputs that you include
 in each training iteration is called your "batch size".
 
-### Q11
+### Q10
 
 Let `m=2` and fill the following table of per-data-point gradients:
 
-| `(x, ytrue)`  | `grad_point` |
-| ------------- | ------------ |
-|     (1, 0)    |              |
-|     (2, 1)    |              |
-|     (3, 1)    |              |
-|    (4, 6.25)  |              |
+| `(x, ytrue)`  | `grad_point(N=1)` |
+| ------------- | ----------------- |
+|     (1, 0)    |                   |
+|     (2, 1)    |                   |
+|     (3, 1)    |                   |
+|    (4, 6.25)  |                   |
 
 Then, with `m=2`, fill the following table of `batch_size=2` batch gradients:
 
@@ -344,10 +344,11 @@ Then, with `m=2`, fill the following table of `batch_size=2` batch gradients:
 |     (1, 0)     |     (2, 1)     |              |
 |     (3, 1)     |    (4, 6.25)   |              |
 
-Do you see that by averaging gradients from multiple points, we end up with
-smoother overall gradients?
+What is the scale between the largest and smallest gradient values in the first
+table?  What about the second table?  Do you see that by averaging gradients
+from multiple points, we end up with smoother overall gradients?
 
-### Q12
+### Q11
 
 Write a new training loop that takes batches instead of individual points.
 
@@ -409,7 +410,7 @@ worker calculates gradients for its batch.  Then all workers communicate their
 gradients to all other workers, after which each worker has the average of all
 worker gradients.
 
-### Q13
+### Q12
 
 Write a training loop that simulates data-parallel distributed training.  No
 need for actual distributed mechanics; just do it inside a single process.