Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Artifact Evaluation README #315

Open
wants to merge 51 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
7b67c6a
Check in Artifact Evaluation README>
luomai Sep 3, 2020
e694820
checkpoint
luomai Sep 3, 2020
357dc69
add VM setup instructions
lgarithm Sep 3, 2020
9b64a1f
updte
luomai Sep 4, 2020
ff1effc
monitoring checkpoint.
luomai Sep 4, 2020
68867b3
added adaptive comm strategy readme section
kfertakis Sep 4, 2020
553db8f
Check in scalability README.
luomai Sep 4, 2020
a5f0f33
Merge branch 'osdi20-artifact' of github.com:lsds/KungFu into osdi20-…
luomai Sep 4, 2020
0022e05
minor additional information in adaptive comm strategy part of readme
kfertakis Sep 4, 2020
b1871db
Merge branch 'osdi20-artifact' of github.com:lsds/KungFu into osdi20-…
kfertakis Sep 4, 2020
ac32805
minor
luomai Sep 4, 2020
c5617c6
Merge branch 'osdi20-artifact' of github.com:lsds/KungFu into osdi20-…
luomai Sep 4, 2020
6ca4222
Address Andrie's comments.
luomai Sep 4, 2020
fdbf119
Starting Scaling Performance.
luomai Sep 4, 2020
76a13f4
add first rough version of Adaptive resource provisioning
marwage Sep 4, 2020
6dadfb4
merge Adaptive resource provisioning
marwage Sep 4, 2020
abab533
add instructions for Fig.4
lgarithm Sep 4, 2020
5aa4ed5
add text to Adaptive resource provisioning
marwage Sep 4, 2020
59201b6
merge Adaptive resource provisioning
marwage Sep 4, 2020
83a507b
change git clone from ssh to https
marwage Sep 4, 2020
a6bdc33
Figure 7 (wip)
lgarithm Sep 4, 2020
ad0eea5
Pass over README.md (#316)
prp Sep 4, 2020
3c3de56
rename file and fix typo
marwage Sep 4, 2020
f035fc2
add plot script in the readme
marwage Sep 4, 2020
94f1326
fix typo
marwage Sep 4, 2020
e767118
fix another typo
marwage Sep 4, 2020
377c1f4
3. Scaling Performance (Figure 7)
lgarithm Sep 4, 2020
f392557
Merge branch 'master' into osdi20-artifact
lgarithm Sep 4, 2020
7d7da8c
add time estimation
lgarithm Sep 4, 2020
5045e6c
Address Peter's comments.
luomai Sep 4, 2020
373b1a4
update
luomai Sep 4, 2020
a5fb3ca
placeholder for Figure 10.
luomai Sep 4, 2020
8f5fd34
numbering
luomai Sep 4, 2020
418baf4
more on 3.3. Scaling Performance (Figure 7)
lgarithm Sep 4, 2020
1622b3f
add nccl schedulers.
luomai Sep 5, 2020
5521443
update doc
lgarithm Sep 5, 2020
e768475
add more comments fog Fig 4
lgarithm Sep 5, 2020
f716088
fix
lgarithm Sep 5, 2020
6b52e66
update
luomai Sep 5, 2020
9b5f222
Merge branch 'osdi20-artifact' of github.com:lsds/KungFu into osdi20-…
luomai Sep 5, 2020
856485a
Pass over S3.3 and S3.4
prp Sep 5, 2020
fc932ed
small fix
lgarithm Sep 5, 2020
c092d61
mention kungfu-remote-install
lgarithm Sep 5, 2020
3ac1389
-strategy MULTI_BINARY_TREE_STAR
lgarithm Sep 5, 2020
9e6b99b
update 3.2
lgarithm Sep 5, 2020
19b7b2d
update scalability result.
luomai Sep 5, 2020
1bf5908
fix
luomai Sep 5, 2020
09b85f6
add an example
lgarithm Sep 5, 2020
415570a
update
luomai Sep 5, 2020
ac21ac7
submission checkpoint
luomai Sep 5, 2020
f858789
add more description
lgarithm Sep 5, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add nccl schedulers.
luomai committed Sep 5, 2020
commit 1622b3fa065bc529ddaea94be752952d24487ada
55 changes: 53 additions & 2 deletions osdi20/README.md
Original file line number Diff line number Diff line change
@@ -223,7 +223,7 @@ need to replace `--model=ResNet50` with `--model=MobileNetV2`.
The same chgange can be applied to clusters with any number (i.e.,
8, 16, 32, ...) of VMs.

### 3.3. Scaling Performance (Figure 7)
### 3.3. Dynamic scaling (Figure 7)

In this experiment we show the ability to change number of workers of KungFu.
In addition to installing KungFu, you need to install the example config server.
@@ -300,7 +300,58 @@ kungfu-run-scaling-experiments -u $USER -nic eth0 -hostfile hosts.txt -resize-sc

### 3.4. NCCL scheduler (Figure 10)

[...]
The NCCL scheduler is designed for fully exploiting machines that have NV-Link.
We no longer have access to the DGX-1 machines and cannot find similar multi-GPU VM that has NV-Link.
To run this experiment, we provide a 4 Titan-X GPU machine at local. There are 2 GPUs (i.e., a subset of GPUs)
interconnected using NV-Link. Please contact the authors to gain SSH access to this machine.

The machine is shared by multiple users. After SSH to this machine, you need to
clone the artifact and create a virtual Python environment as follows:

```bash
# Clone the artifact
git clone --branch osdi20-artifact https://github.com/lsds/KungFu.git
cd KungFu

# Create a virtual environment
virtualenv -p python3 env
source env/bin/activate

# Install TensorFlow
pip3 install -U numpy==1.16 tensorflow-gpu==1.13.2

# Install KungFu with NCCL (i.e., KUNGFU_ENABLE_NCCL=1)
KUNGFU_ENABLE_NCCL=1 pip3 install -U .
```

To train the ResNet-50 model using a synthetic ImageNet dataset, you can use the following command.
The training uses the NCCL scheduler to exploit NV-Link.

```bash
kungfu-run -allow-nvlink -np 4 python3 benchmarks/system/benchmark_kungfu.py --kf-optimizer=sync-sgd-nccl --model=ResNet50 --batch-size=64
```

The `-allow-nvlink` option allows `kungfu-run` to enable NV-Link.
The `sync-sgd-nccl` optimizer allows the benchmark program to delegate
all-reduce requests to the NCCL scheduler.

You would expect output:

```text
[127.0.0.1.10000::stdout] Iter #4: 180.5 img/sec per /gpu:0
[127.0.0.1.10000::stdout] Iter #5: 182.3 img/sec per /gpu:0
[127.0.0.1.10000::stdout] Iter #6: 181.6 img/sec per /gpu:0
[127.0.0.1.10000::stdout] Iter #7: 179.1 img/sec per /gpu:0
[127.0.0.1.10000::stdout] Iter #8: 181.1 img/sec per /gpu:0
[127.0.0.1.10000::stdout] Iter #9: 180.3 img/sec per /gpu:0
[127.0.0.1.10000::stdout] Img/sec per /gpu:0: 181.3 +-2.4
[127.0.0.1.10000::stdout] RESULT: 181.336683 +-2.371972 {"framework":"kungfu","np":4,"strategy":"BINARY_TREE_STAR","bs":64,"model":"ResNet50","xla":false,"kf-opt":"sync-sgd-nccl","fuse":false,"nvlink":"true"}
[I] all 4/4 local peers finished, took 55.146302161s
```

To show the advantages of the NCCL scheduler over KungFu asynchronous communication layer
(as shown in Figure 10),
the evaluators need to access to a DGX-1 machine and repeat the above steps.

## 4. Adaptation Policies