Array Interface

LoopVectorization uses ArrayInterface.jl to describe the memory layout of arrays. By supporting the interface, LoopVectorization will be able to support compatible AbstractArray types. StaticArrays.jl and HybridArrays.jl are two example libraries providing array types supporting the interface.

StaticArrays.SArray itself is not compatible, because LoopVectorization needs access to a pointer. However,StaticArrays.MArrays are compatible. Loops featuring StaticArrays.StaticArray will result in a fall-back loop being executed that wasn't optimized by LoopVectorization, but instead simply had @inbounds @fastmath applied to the loop. This can often still yield reasonable to good performance, saving you from having to write more than one version of the loop to get good performance and correct behavior just because the array types happen to be different.

By supporting the interface, using LoopVectorization can simplify implementing many operations like matrix multiply while still getting good performance. For example, instead of a few hundred lines of code to define matrix multiplication in StaticArrays, one could simply write:

using StaticArrays, LoopVectorization

@inline function AmulB!(C, A, B)
    @turbo for n ∈ axes(C,2), m ∈ axes(C,1)
        Cmn = zero(eltype(C))
        for k ∈ axes(B,1)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
    C
end
@inline AmulB(A::MMatrix{M,K,T}, B::MMatrix{K,N,T}) where {M,K,N,T} = AmulB!(MMatrix{M,N,T}(undef), A, B)
@inline AmulB(A::SMatrix, B::SMatrix) = SMatrix(AmulB(MMatrix(A), MMatrix(B)))

Through converting back and fourth between SMatrix and MMatrix, we can still use LoopVectorization to implement SMatrix multiplication, and in most cases get better performance than the unrolled methods from the library. Unfortunately, it is still suboptimal because the compiler isn't able to elide the copying, but the temporaries are all stack-allocated, making the code allocateion free. We can benchmark our simple implementation vs the StaticArrays.SMatrix and StaticArrays.MMatrix methods:

using BenchmarkTools, LinearAlgebra, DataFrames, VegaLite
BLAS.set_num_threads(1);

matdims(x::Integer) = (x, x, x)
matdims(x::NTuple{3}) = x
matflop(x::Integer) = 2x^3
matflop(x::NTuple{3}) = 2prod(x)

function runbenches(sr, ::Type{T}, fa = identity, fb = identity) where {T}
    bench_results = Matrix{Float64}(undef, length(sr), 4);
    for (i,s) ∈ enumerate(sr)
        M, K, N = matdims(s)
        Am = @MMatrix rand(T, M, K)
        Bm = @MMatrix rand(T, K, N)
        As = Ref(SMatrix(Am));
        Bs = Ref(SMatrix(Bm));
        Css = fa(As[]) * fb(Bs[]);
        Csl = AmulB(fa(As[]), fb(Bs[]))
        Cms = similar(Css); mul!(Cms, fa(Am), fb(Bm));
        Cml = similar(Css); AmulB!(Cml, fa(Am), fb(Bm));
        @assert Array(Css) ≈ Array(Csl) ≈ Array(Cms) ≈ Array(Cml) # Once upon a time Julia crashed on ≈ for large static arrays
        bench_results[i,1] = @belapsed $fa($As[]) * $fb($Bs[])
        bench_results[i,2] = @belapsed AmulB($fa($As[]), $fb($Bs[]))
        bench_results[i,3] = @belapsed mul!($Cms, $fa($Am), $fb($Bm))
        bench_results[i,4] = @belapsed AmulB!($Cml, $fa($Am), $fb($Bm))
        @show s, bench_results[i,:]
    end
    gflops = @. 1e-9 * matflop(sr) / bench_results
    array_type = append!(fill("Static", 2length(sr)), fill("Mutable", 2length(sr)))
    sa = fill("StaticArrays", length(sr)); lv = fill("LoopVectorization", length(sr));
    matmul_lib = vcat(sa, lv, sa, lv);
    sizes = reduce(vcat, (sr for _ ∈ 1:4))
    DataFrame(
        Size = sizes, Time = vec(bench_results), GFLOPS = vec(gflops),
        ArrayType = array_type, MatmulLib = matmul_lib, MulType = array_type .* ' ' .* matmul_lib
    )
end

df = runbenches(1:24, Float64);
df |> @vlplot(:line, x = :Size, y = :GFLOPS, color = :MulType, height=640,width=960) |> save("sarraymatmul.svg")

This yields: sarray_benchmarks Our AmulB! for MMatrixes was the fastest at all sizes except 2x2, where it lost out to AmulB for SMatrix, which in turn was faster than the hundreds of lines of StaticArrays code at all sizes except 3x3, 5x5, and 6x6.

Additionally, HybridArrays.jl can be used when we have a mix of dynamic and statically sized arrays. Maybe we want to multiply two matrices, where each element is a 3x3 matrix:

using HybridArrays, StaticArrays, LoopVectorization, BenchmarkTools

A_static = [@SMatrix(rand(3,3)) for i in 1:32, j in 1:32];
B_static = [@SMatrix(rand(3,3)) for i in 1:32, j in 1:32];
C_static = similar(A_static);

A_hybrid = HybridArray{Tuple{StaticArrays.Dynamic(),StaticArrays.Dynamic(),3,3}}(permutedims(reshape(reinterpret(Float64, A_static), (3,3,size(A_static)...)), (3,4,1,2)));
B_hybrid = HybridArray{Tuple{StaticArrays.Dynamic(),StaticArrays.Dynamic(),3,3}}(permutedims(reshape(reinterpret(Float64, B_static), (3,3,size(B_static)...)), (3,4,1,2)));
C_hybrid = HybridArray{Tuple{StaticArrays.Dynamic(),StaticArrays.Dynamic(),3,3}}(permutedims(reshape(reinterpret(Float64, C_static), (3,3,size(C_static)...)), (3,4,1,2)));

# C is M x N x I x J
# A is M x K x I x L
# B is K x N x L x J
function bmul!(C, A, B)
    @turbo for n in axes(C,2), m in axes(C,1), j in axes(C,4), i in axes(C,3)
        Cmnji = zero(eltype(C))
        for k in axes(B,1), l in axes(B,3)
            Cmnji += A[m,k,i,l] * B[k,n,l,j]
        end
        C[m,n,i,j] = Cmnji
    end
end

This yields

julia> @benchmark bmul!($C_hybrid, $A_hybrid, $B_hybrid)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.550 μs (0.00% GC)
  median time:      15.663 μs (0.00% GC)
  mean time:        15.685 μs (0.00% GC)
  maximum time:     50.286 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  
julia> @benchmark mul!($C_static, $A_static, $B_static)
BenchmarkTools.Trial:
  memory estimate:  336 bytes
  allocs estimate:  6
  --------------
  minimum time:     277.736 μs (0.00% GC)
  median time:      278.035 μs (0.00% GC)
  mean time:        278.310 μs (0.00% GC)
  maximum time:     299.259 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> all(I -> C_hybrid[Tuple(I)[1],Tuple(I)[2],:,:] ≈ C_static[I], CartesianIndices(C_static))
true

julia> length(C_hybrid) * size(B_hybrid,1) * size(B_hybrid,3) * 2e-9 / 15.55e-6 # GFLOPS loops + hybrid arrays
113.79241157556271

julia> length(C_hybrid) * size(B_hybrid,1) * size(B_hybrid,3) * 2e-9 / 277.736e-6 # GFLOPS LinearAlgebra.mul! + StaticArrays
6.371057407034018

When using LoopVectorization + HybridArrays, you may often find that you often get the best performance when the leading dimensions are either an even multiple of 8, or relatively large. This will often mean not leading with a small static dimension, which is commonly best practice when not using LoopVectorization.

If you happen to like tensor operations such as from this last example, you're also strongly encouraged to check out Tullio.jl which provides index-notation that is both much more convenient and much less error-prone than writing out loops, and uses both LoopVectorization (if you using LoopVectorization before @tullio) as well as multiple threads to maximize performance.