Skip to content

Commit

Permalink
chore: onboarding doc clarification (#7109)
Browse files Browse the repository at this point in the history
  • Loading branch information
rb-determined-ai authored Jun 14, 2023
1 parent 6f98aac commit afff53e
Showing 1 changed file with 13 additions and 12 deletions.
25 changes: 13 additions & 12 deletions onboarding/mlsys/1-gradients.md
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,7 @@ print(train_loop(m=0, lr=0.0001, dataset=dataset, iterations=10000))
```

Now that we have noise in our dataset, we find that some values are "noisier"
than others. Notice that at `m=1` our model is off by `1`, but at `x=4` our
than others. Notice that at `x=1` our model is off by `1`, but at `x=4` our
model is off by `2.25`. These noisier points can cause training to become
unstable at higher learning rates.

Expand Down Expand Up @@ -326,16 +326,16 @@ the gradients from each point.
This is called "batching" your inputs. The number of inputs that you include
in each training iteration is called your "batch size".

### Q11
### Q10

Let `m=2` and fill the following table of per-data-point gradients:

| `(x, ytrue)` | `grad_point` |
| ------------- | ------------ |
| (1, 0) | |
| (2, 1) | |
| (3, 1) | |
| (4, 6.25) | |
| `(x, ytrue)` | `grad_point(N=1)` |
| ------------- | ----------------- |
| (1, 0) | |
| (2, 1) | |
| (3, 1) | |
| (4, 6.25) | |

Then, with `m=2`, fill the following table of `batch_size=2` batch gradients:

Expand All @@ -344,10 +344,11 @@ Then, with `m=2`, fill the following table of `batch_size=2` batch gradients:
| (1, 0) | (2, 1) | |
| (3, 1) | (4, 6.25) | |

Do you see that by averaging gradients from multiple points, we end up with
smoother overall gradients?
What is the scale between the largest and smallest gradient values in the first
table? What about the second table? Do you see that by averaging gradients
from multiple points, we end up with smoother overall gradients?

### Q12
### Q11

Write a new training loop that takes batches instead of individual points.

Expand Down Expand Up @@ -409,7 +410,7 @@ worker calculates gradients for its batch. Then all workers communicate their
gradients to all other workers, after which each worker has the average of all
worker gradients.

### Q13
### Q12

Write a training loop that simulates data-parallel distributed training. No
need for actual distributed mechanics; just do it inside a single process.
Expand Down

0 comments on commit afff53e

Please sign in to comment.