Skip to content

Optimization_4x4_12

Jianyu Huang edited this page Aug 2, 2016 · 4 revisions

Copy the contents of file MMult_4x4_11.c into a file named MMult_4x4_12.c and change the contents:

|| from || to || ||<^> -Include(HowToOptimizeGemm/Details/MMult_4x4_11)- ||<^> -Include(HowToOptimizeGemm/Details/MMult_4x4_12)- ||

Change the first lines in the makefile to {{{ OLD := MMult_4x4_11 NEW := MMult_4x4_12 }}}

  • make run {{{ octave:3> PlotAll % this will create the plot }}}

This time the performance graph will look something like

ImageLink(http://www.cs.utexas.edu/users/rvdg/HowToOptimizeGemm/Graphs/compare_MMult-4x4-11_MMult-4x4-12.png,http://www.cs.utexas.edu/users/rvdg/HowToOptimizeGemm/Graphs/compare_MMult-4x4-11_MMult-4x4-12.png,width=40%)

We now pack to 4xk block of A before calling AddDot4x4. We see a performance drop. If one examines the inner kernel {{{ void InnerKernel( int m, int n, int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) { int i, j; double packedA[ m * k ];

for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 / for ( i=0; i<m; i+=4 ){ / Loop over the rows of C / / Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in one routine (four inner products) / PackMatrixA( k, &A( i, 0 ), lda, &packedA[ ik ] ); AddDot4x4( k, &packedA[ i*k ], 4, &B( 0,j ), ldb, &C( i,j ), ldc ); } } } }}}

one notices that each 4xk block of A is packed repeatedly, once for every time the outer loop is executed.

Clone this wiki locally