-
Notifications
You must be signed in to change notification settings - Fork 355
Optimization_4x4_12
Copy the contents of file MMult_4x4_11.c
into a file named MMult_4x4_12.c
and change the contents:
|| from || to ||
||<^> -Include(HowToOptimizeGemm/Details/MMult_4x4_11)- ||<^> -Include(HowToOptimizeGemm/Details/MMult_4x4_12)- ||
Change the first lines in the makefile
to
{{{
OLD := MMult_4x4_11
NEW := MMult_4x4_12
}}}
-
make run
{{{ octave:3> PlotAll % this will create the plot }}}
This time the performance graph will look something like
We now pack to 4xk block of A before calling AddDot4x4
. We see a performance drop. If one examines the inner kernel
{{{
void InnerKernel( int m, int n, int k, double *a, int lda,
double *b, int ldb,
double *c, int ldc )
{
int i, j;
double
packedA[ m * k ];
for ( j=0; j<n; j+=4 ){ /* Loop over the columns of C, unrolled by 4 / for ( i=0; i<m; i+=4 ){ / Loop over the rows of C / / Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in one routine (four inner products) / PackMatrixA( k, &A( i, 0 ), lda, &packedA[ ik ] ); AddDot4x4( k, &packedA[ i*k ], 4, &B( 0,j ), ldb, &C( i,j ), ldc ); } } } }}}
one notices that each 4xk block of A is packed repeatedly, once for every time the outer loop is executed.