Here, we convolve a small matrix
kern with a larger matrix
A, storing the results in
out, using Julia's generic Cartesian Indexing:
using LoopVectorization, OffsetArrays, Images kern = Images.Kernel.gaussian((1, 1), (3, 3)) A = rand(130,130); out = OffsetArray(similar(A, size(A) .- size(kernel) .+ 1), -1 .- kernel.offsets); function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern) @turbo for J in CartesianIndices(out) tmp = zero(eltype(out)) for I ∈ CartesianIndices(kern) tmp += A[I + J] * kern[I] end out[J] = tmp end out end
These are effectively four nested loops. For all the benchmarks,
kern was 3 by 3, making it too small for vectorizing these loops to be particularly profitable. By vectorizing an outer loop instead, it can benefit from SIMD and also avoid having to do a reduction (horizontal addition) of a vector before storing in
out, as the vectors can then be stored directly.
LoopVectorization achieved much better performance than all the alternatives, which tried vectorizing the inner loops. By making the compilers aware that the inner loops are too short to be worth vectorizing, we can get them to vectorize an outer loop instead. By defining the size of
kern as constant in
Fortran, and using size parameters in Julia, we can inform the compilers: Now all are doing much better than they were before, although still well shy of the 131.2 GFLOPS theoretical limit for the host CPU cores. While they all improved, two are lagging behind the main group:
ifortlags behind all the others except base Julia. I'll need to do more investigating to find out why.
- Base Julia. While providing static size information was enough for it to realize vectorizing the inner loops was not worth it, base Julia was seemingly the only one that didn't decide to vectorize an outer loop instead.
Manually unrolling the inner loops allows base Julia to vectorize, while the performance of all non-Julia variants was unchanged: LoopVectorization is currently limited to only unrolling two loops (but a third may be vectorized, effectively unrolling it by the length of the vectors). Manually unrolling two of the loops lets up to four loops be unrolled.