Matrix Multiplication

One of the friendliest problems for vectorization is matrix multiplication. Given M ร— K matrix ๐€, and K ร— N matrix ๐, multiplying them is like performing M * N dot products of length K. We need M*K + K*N + M*N total memory, but M*K*N multiplications and additions, so there's a lot more arithmetic we can do relative to the memory needed.

LoopVectorization currently doesn't do any memory-modeling or memory-based optimizations, so it will still run into problems as the size of matrices increases. But at smaller sizes, it's capable of achieving a healthy percent of potential GFLOPS. We can write a single function:

function A_mul_B!(C, A, B)
    @turbo for n โˆˆ indices((C,B), 2), m โˆˆ indices((C,A), 1)
        Cmn = zero(eltype(C))
        for k โˆˆ indices((A,B), (2,1))
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end

and this can handle all transposed/not-tranposed permutations. LoopVectorization will change loop orders and strategy as appropriate based on the types of the input matrices. For each of the others, I wrote separate functions to handle each case. Letting all three matrices be square and Size x Size, we attain the following benchmark results:

AmulB This is classic GEMM, ๐‚ = ๐€ * ๐. GFortran's intrinsic matmul function does fairly well. But all the compilers are well behind LoopVectorization here, which falls behind MKL's gemm beyond 70x70 or so. The problem imposed by alignment is also striking: performance is much higher when the sizes are integer multiplies of 8. Padding arrays so that each column is aligned regardless of the number of rows can thus be very profitable. PaddedMatrices.jl offers just such arrays in Julia. I believe that is also what the -pad compiler flag does when using Intel's compilers.

AmulBt The optimal pattern for ๐‚ = ๐€ * ๐แต€ is almost identical to that for ๐‚ = ๐€ * ๐. Yet, gfortran's matmul intrinsic stumbles, surprisingly doing much worse than gfortran + loops, and almost certainly worse than allocating memory for ๐แต€ and creating the explicit copy.

ifort did equally well whethor or not ๐ was transposed, while LoopVectorization's performance degraded slightly faster as a function of size in the transposed case, because strides between memory accesses are larger when ๐ is transposed. But it still performed best of all the compiled loops over this size range, losing out to MKL and eventually OpenBLAS. icc interestingly does better when it is transposed.

GEMM is easiest when the matrix ๐€ is not tranposed (assuming column-major memory layouts), because then you can sum up columns of ๐€ to store into ๐‚. If ๐€ were transposed, then we cannot efficiently load contiguous elements from ๐€ that can best stored directly in ๐‚. So for ๐‚ = ๐€แต€ * ๐, contiguous vectors along the k-loop have to be reduced, adding some overhead. AtmulB Packing is critical for performance here. LoopVectorization does not pack, therefore it is well behind MKL and OpenBLAS, which do. Eigen packs, but is poorly optimized for this CPU architecture.

When both ๐€ and ๐ are transposed, we now have ๐‚ = ๐€แต€ * ๐แต€ = (๐ * ๐€)แต€. AtmulBt Julia, Clang, and gfortran all struggled to vectorize this, because none of the matrices share a contiguous access: M for ๐‚, K for ๐€แต€, and N for ๐แต€. However, LoopVectorization and all the specialized matrix multiplication functions managed to do about as well as normal; transposing while storing the results takes negligible amounts of time relative to the matrix multiplication itself. The ifort-loop version also did fairly well.