Optimize tinygemm by using smem to dequantize 2 x any4 codes

any4 LUT dequantization is currently via warp shuffle in the GEMM core, but higher throughput might be achievable by using smem to dequantize 2 x any4 codes (1 byte) at a time instead at the possible expense of added bank conflights.