Use DeepHyper to find the max FLOPS for various PyTorch Layers.
- Benchmark DeepHyper+Balsam throughput on various platforms. Taylor reports low utilization on Theta.
- Compute and/or measure memory usage on GPU and KNL
- Extend pairplot from DeepHyper's HPS analytics notebook to include calculation of equal width objective bins (thirds, not equal frequency terciles)
- Try different
kappa
values with the Random Forest surrogate model
Run failures in conv2d/
, conv1d/
, conv3d/
; surprisingly, conv3d/
has fewer
objective=0.0
evaluations than the other two.
conv2d/
on Traverse: several instances of the following error:
Traceback (most recent call last):
File "/home/kfelker/deephyper_pytorch_layers/conv2d/conv2d_run.py", line 36, in run
outputs = layer(inputs)
File "/home/kfelker/.conda/envs/frnn/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/kfelker/.conda/envs/frnn/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 343, in forward
return self.conv2d_forward(input, self.weight)
File "/home/kfelker/.conda/envs/frnn/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 340, in conv2d_forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.
torch version: 1.2.0 torch file: /home/kfelker/.conda/envs/frnn/lib/python3.6/site-packages/torch/__init__.py
received exception: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.
None
runtime= 8.049392938613892
DH-OUTPUT: 0.0
- Compare frequency to errors due to memory exhaustion:
CUDA out of memory. Tried to allocate 22.79 GiB (GPU 0; 31.75 GiB total capacity; 28.66 GiB already allocated; 1.93 GiB free; 21.30 MiB cached; 0 bytes inactive)
E.g. out of 101 evaluations of conv2d/
on Traverse, total of 15x run failures:
(base) ➜ conv2d git:(feature/cuda) ✗ sort -t , -k 7,7 -n results.csv
batch_size,height,in_channels,kernel_size,out_channels,width,objective,elapsed_sec
128,512,3,3,64,512,0.0,19.287363052368164
146,985,64,12,58,329,0.0,1273.2861964702606
276,422,64,5,57,330,0.0,1067.2843222618103
283,655,64,14,64,976,0.0,1077.281128168106
290,844,54,11,7,176,0.0,447.2913315296173
331,193,52,12,6,777,0.0,181.2829658985138
377,559,57,10,23,224,0.0,281.2877972126007
383,458,64,7,7,215,0.0,241.2843005657196
403,184,55,7,8,646,0.0,347.2854845523834
405,887,64,10,58,135,0.0,997.284960269928
413,683,61,12,6,133,0.0,211.2883608341217
476,139,64,13,40,1021,0.0,1269.2893743515015
488,129,63,7,63,791,0.0,815.2774884700775
506,135,62,13,55,995,0.0,977.2954721450806
510,184,61,11,56,1021,0.0,307.28731894493103
cifar10/
errors:
Traceback (most recent call last):
File "/home/kfelker/deephyper_pytorch_layers/cifar10/cifar10_run.py", line 106, in run
outputs = net(inputs)
File "/home/kfelker/.conda/envs/frnn/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/kfelker/deephyper_pytorch_layers/cifar10/cifar10_run.py", line 84, in forward
x = x.view(-1, self.view_size)
RuntimeError: shape '[-1, 936]' is invalid for input of size 20800
-
BatchNorm1d
-
BatchNorm2d
-
BatchNorm3d
-
Softmax
-
Tanh
-
Sigmoid
-
ReLU
- Pooling?
- Embedding?
-
RNN
-
LSTM
-
GRU
-
Transformer
,TransformerEncoder
,TransformerDecoder
See https://pytorch.org/docs/stable/nn.html
All table entries were measured
Machine | Balsam job node | Nodes | DH numworkers | Time limit (min) | PyTorch layer or model | Evaluations |
---|---|---|---|---|---|---|
ALCF Theta | MPI | 8 | 8 | 60 | Linear | 1189 |
ALCF Theta | MPI | 8 | 8 | 60 | Conv2D | 186 |
Princeton Traverse | Serial | 1 | 5 | 60 | Linear | 185 |
Princeton Traverse | MPI | 2 | 2 | 60 | Cifar10 | 237 (mostly errors) |
Princeton Traverse | MPI | 2 | 2 | 60 | Conv3D | 226 |
Princeton TigerGPU | Serial | 1 | 5 | 120 | Linear | 245 |
Princeton TigerGPU | Serial | 3 | 10 | 120 | Linear |