You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Mostly repeating private email and in-person communication on this topic for reference notes and posterity.
FRNN performance on V100s on the 2x IBM AC922 systems, OLCF Summit and Princeton's Traverse cluster, is about 3x slower than on the P100s on Princeton's TigerGPU cluster. See the below table, which tests the performance for d3d_0D training on both machines as a function of batch size (as suggested by @jnkh). I have run these tests with 1, 2, 8 GPUs as well, and several datasets.
Machine (GPU Model)
N_node
N_{GPU}
Examples/sec
Sec/batch
Batch size
Traverse (V100)
1
4
1.35e3
0.75
1024
2.53e3
0.80
2048
5.20e3
0.80
4096
TigerGPU (P100)
1
4
4.30e3
0.24
1024
7.70e3
0.26
2048
1.38e4
0.30
4096
At first, I suspected some issue with my Conda / MPI environment on the Power 9 architecture. However, @ge-dong and I compared figures, and we confirmed that we are both independently observing this behavior. In fact, the original modules on Traverse produced about even slower performance (20%).
@ASvyatkovskiy identified the primary issue being that the TensorFlow backend fortf.keras or external Keras does not run the cuDNN autotuner unlike vanilla TensorFlow architecture definitions. See my notes about the autotuner in #51. The default implementations of our layers might be slower on V100 than on P100.
Mostly repeating private email and in-person communication on this topic for reference notes and posterity.
FRNN performance on V100s on the 2x IBM AC922 systems, OLCF Summit and Princeton's Traverse cluster, is about 3x slower than on the P100s on Princeton's TigerGPU cluster. See the below table, which tests the performance for
d3d_0D
training on both machines as a function of batch size (as suggested by @jnkh). I have run these tests with 1, 2, 8 GPUs as well, and several datasets.At first, I suspected some issue with my Conda / MPI environment on the Power 9 architecture. However, @ge-dong and I compared figures, and we confirmed that we are both independently observing this behavior. In fact, the original modules on Traverse produced about even slower performance (20%).
@ASvyatkovskiy identified the primary issue being that the TensorFlow backend for
tf.keras
or external Keras does not run the cuDNN autotuner unlike vanilla TensorFlow architecture definitions. See my notes about the autotuner in #51. The default implementations of our layers might be slower on V100 than on P100.He opened issues about this when he first ran on Summit over 1.5 years ago:
tensorflow/tensorflow#18913, keras-team/keras#9825. Related: keras-team/keras#9321
And proposed the following optimizations especially for V100s:
Also, I am systematically benchmarking the
LSTM
Keras layer definition vs.CuDNNLSTM
, which seems to be at least an order of magnitude faster.IBM AC922 "Traverse" architecture details:
The text was updated successfully, but these errors were encountered: