Matrix Multiplication
One of the friendliest problems for vectorization is matrix multiplication. Given M ร K
matrix ๐
, and K ร N
matrix ๐
, multiplying them is like performing M * N
dot products of length K
. We need M*K + K*N + M*N
total memory, but M*K*N
multiplications and additions, so there's a lot more arithmetic we can do relative to the memory needed.
LoopVectorization currently doesn't do any memory-modeling or memory-based optimizations, so it will still run into problems as the size of matrices increases. But at smaller sizes, it's capable of achieving a healthy percent of potential GFLOPS. We can write a single function:
function A_mul_B!(C, A, B)
@turbo for n โ indices((C,B), 2), m โ indices((C,A), 1)
Cmn = zero(eltype(C))
for k โ indices((A,B), (2,1))
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
and this can handle all transposed/not-tranposed permutations. LoopVectorization will change loop orders and strategy as appropriate based on the types of the input matrices. For each of the others, I wrote separate functions to handle each case. Letting all three matrices be square and Size
x Size
, we attain the following benchmark results:
This is classic GEMM, ๐ = ๐ * ๐
. GFortran's intrinsic matmul
function does fairly well. But all the compilers are well behind LoopVectorization here, which falls behind MKL's gemm
beyond 70x70 or so. The problem imposed by alignment is also striking: performance is much higher when the sizes are integer multiplies of 8. Padding arrays so that each column is aligned regardless of the number of rows can thus be very profitable. PaddedMatrices.jl offers just such arrays in Julia. I believe that is also what the -pad compiler flag does when using Intel's compilers.
The optimal pattern for ๐ = ๐ * ๐แต
is almost identical to that for ๐ = ๐ * ๐
. Yet, gfortran's matmul
intrinsic stumbles, surprisingly doing much worse than gfortran + loops, and almost certainly worse than allocating memory for ๐แต
and creating the explicit copy.
ifort did equally well whethor or not ๐
was transposed, while LoopVectorization's performance degraded slightly faster as a function of size in the transposed case, because strides between memory accesses are larger when ๐
is transposed. But it still performed best of all the compiled loops over this size range, losing out to MKL and eventually OpenBLAS. icc interestingly does better when it is transposed.
GEMM is easiest when the matrix ๐
is not tranposed (assuming column-major memory layouts), because then you can sum up columns of ๐
to store into ๐
. If ๐
were transposed, then we cannot efficiently load contiguous elements from ๐
that can best stored directly in ๐
. So for ๐ = ๐แต * ๐
, contiguous vectors along the k
-loop have to be reduced, adding some overhead. Packing is critical for performance here. LoopVectorization does not pack, therefore it is well behind MKL and OpenBLAS, which do. Eigen packs, but is poorly optimized for this CPU architecture.
When both ๐
and ๐
are transposed, we now have ๐ = ๐แต * ๐แต = (๐ * ๐)แต
. Julia, Clang, and gfortran all struggled to vectorize this, because none of the matrices share a contiguous access: M
for ๐
, K
for ๐แต
, and N
for ๐แต
. However, LoopVectorization and all the specialized matrix multiplication functions managed to do about as well as normal; transposing while storing the results takes negligible amounts of time relative to the matrix multiplication itself. The ifort-loop version also did fairly well.