there are serial, intel MKL dgemm(), OpenMP, MPI, hybrid(MPI+OpenMP), and hybrid(MPI+OpenACC) versions.
MPI version is based on Cannon's algorithm.
intel compiler and intel MKL library are needed.
Input matrix is a psudorandom number, that is generated by intel MKL Mersenne Twister(MT19937)
-
binary names:
- serial: seri
- OpenMP: omp
- intel MKL dgemm(): dgemm
- MPI: can
- hybrid(MPI+OpenMP): can_hyb
- hybrid(MPI+OpenACC): can_acc
-
matrix size: imax x imax (param.f)
-
Some notes for MPI and hybrid version:
- imax/sqrt(np) must be an integer.
- sqrt(np) must be an integer.
- intel compiler and intel MPI are required.
$ make
$ ./create_input
$ ./seri
or
$ ./omp
or
./dgemm
or
mpirun -np $NP ./can
or
mpirun -np $NP ./can_hyb
performance comparison(matrix size: 4096x4096, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 14 cores/socket, 2 sockets/node, 4 nodes, intel OPA):
- serial
$ ./seri
serial time: 6.99500107765198 19.6481675908668 Gflops
trace: 4196462.48061815
- MKL dgemm() (single thread)
$ MKL_NUM_THREADS=1 ./dgemm
dgemm time: 3.69211506843567 37.2249918879782 Gflops
trace: 4196462.48061815
- MKL dgemm() (28 threads)
$ MKL_NUM_THREADS=28 KMP_AFFINITY=compact ./dgemm
dgemm time: 1.08629608154297 126.520711808868 Gflops
trace: 4196462.48061815
- OpenMP (28 threads)
$ OMP_NUM_THREADS=28 KMP_AFFINITY=compact ./omp
omp time: 0.852473020553589 161.223816071913 Gflops
trace: 4196462.48061815
- MPI
$ mpiexec.hydra -ppn 16 -np 64 ./can
MPI time: 0.405706882476807 338.764165480622 Gflops
trace: 4196462.48061815
- hybrid(MPI+OpenMP)
$ OMP_NUM_THREADS=$((28/4)) KMP_AFFINITY=compact mpiexec.hydra -ppn 4 -np 16 ./can_hyb
MPI time: 0.325567960739136 422.151347939683 Gflops
trace: 4196462.48061815
$ ./check c.seri c.dgemm
maximum error: 9.094947017729282E-012
$ ./check c.seri c.omp
maximum error: 0.000000000000000E+000
$ ./check c.seri c.can
maximum error: 1.409716787748039E-011
$ ./check c.seri c.can_hyb
maximum error: 1.409716787748039E-011
PGI compiler, OpenMPI and intel MKL are required.
CPU and inteterconnect are the same as normal version, GPU is nvidia P100x4 per 1 node.
GPUDirect is used.
$ make -f makefile.acc.mk
$ ./create_input
$ ./seri
serial time: 51.68619100000000 2.659103927236580 Gflops
trace: 4196462.480618147
$ mpirun -x LD_LIBRARY_PATH -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -npernode 4 -np 16 ./can_acc
MPI time: 0.1217727372422814 1128.651261230572 Gflops
trace: 4196462.480618146
$ ./check c.seri c.can_acc
maximum error: 1.2278178473934531E-011
- flat MPI, 64 cores, intel compiler and intel MPI
$ mpiexec.hydra -ppn 16 -np 64 ./can
MPI time: 82.6075530052185 106.480493637819 Gflops
trace: 67116321.7059676
- hybrid(MPI+OpenMP), 112 cores, intel compiler and intel MPI
$ OMP_NUM_THREADS=$((28/4)) KMP_AFFINITY=compact mpiexec.hydra -ppn 4 -np 16 ./can_hyb
MPI time: 40.3734800815582 217.868090747666 Gflops
trace: 67116321.7059676
- hybrid(MPI+OpenACC), 16 GPUs, PGI compiler and OpenMPI
$ mpirun -x LD_LIBRARY_PATH -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -npernode 4 -np 16 ./can_acc
MPI time: 4.504744562320411 1952.628589816666 Gflops
trace: 67116321.70596765