API reference

LoopVectorization — Module

LoopVectorization provides macros and functions that combine SIMD vectorization and loop-reordering so as to improve performance:

@turbo: transform for-loops and broadcasting
vmapreduce: vectorized version of mapreduce
vreduce: vectorized version of reduce
vsum: vectorized version of sum
vmap and vmap!: vectorized version of map and map!
vmapnt and vmapnt!: non-temporal variants of vmap and vmap!
vmapntt and vmapntt!: threaded variants of vmapnt and vmapnt!
vfilter and vfilter!: vectorized versions of filter and filter!

source

Macros

LoopVectorization.@turbo — Macro

@turbo

Annotate a for loop, or a set of nested for loops whose bounds are constant across iterations, to optimize the computation. For example:

function AmulB!(C, A, B)
    @turbo for m ∈ indices((A,C), 1), n ∈ indices((B,C), 2) # indices((A,C),1) == axes(A,1) == axes(C,1)
        Cₘₙ = zero(eltype(C))
        for k ∈ indices((A,B), (2,1)) # indices((A,B), (2,1)) == axes(A,2) == axes(B,1)
            Cₘₙ += A[m,k] * B[k,n]
        end
        C[m,n] = Cₘₙ
    end
end

The macro models the set of nested loops, and chooses an ordering of the three loops to minimize predicted computation time.

Current limitations:

It assumes that loop iterations are independent.
It does not perform bounds checks.
It assumes that each loop iterates at least once. (Use @turbo check_empty=true to lift this assumption.)
That there is only one loop at each level of the nest.

It may also apply to broadcasts:

julia> using LoopVectorization

julia> a = rand(100);

julia> b = @turbo exp.(2 .* a);

julia> c = similar(b);

julia> @turbo @. c = exp(2a);

julia> b ≈ c
true

Extended help

Advanced users can customize the implementation of the @turbo-annotated block using keyword arguments:

@turbo inline = false unroll = 2 thread = 4 body

where body is the code of the block (e.g., for ... end).

thread is either a Boolean, or an integer. The integer's value indicates the number of threads to use. It is clamped to be between 1 and min(Threads.nthreads(),LoopVectorization.num_cores()). false is equivalent to 1, and true is equivalent to min(Threads.nthreads(),LoopVectorization.num_cores()).

safe (defaults to true) will cause @turbo to fall back to @inbounds @fastmath if can_turbo returns false for any of the functions called in the loop. You can disable the associated warning with warn_check_args=false.

Setting the keyword argument warn_check_args=true (e.g. @turbo warn_check_args=true for ...) in a loop or broadcast statement will cause it to warn once if LoopVectorization.check_args fails and the fallback loop is executed instead of the LoopVectorization-optimized loop. Setting it to an integer > 0 will warn that many times, while setting it to a negative integer will warn an unlimited amount of times. The default is warn_check_args = 1. Failure means that there may have been an array with unsupported type, unsupported element types, or (if safe=true) a function for which can_turbo returned false.

inline is a Boolean. When true, body will be directly inlined into the function (via a forced-inlining call to _turbo_!). When false, it wont force inlining of the call to _turbo_! instead, letting Julia's own inlining engine determine whether the call to _turbo_! should be inlined. (Typically, it won't.) Sometimes not inlining can lead to substantially worse code generation, and >40% regressions, even in very large problems (2-d convolutions are a case where this has been observed). One can find some circumstances where inline=true is faster, and other circumstances where inline=false is faster, so the best setting may require experimentation. By default, the macro tries to guess. Currently the algorithm is simple: roughly, if there are more than two dynamically sized loops or and no convolutions, it will probably not force inlining. Otherwise, it probably will.

check_empty (default is false) determines whether or not it will check if any of the iterators are empty. If false, you must ensure yourself that they are not empty, else the behavior of the loop is undefined and (like with @inbounds) segmentation faults are likely.

unroll is an integer that specifies the loop unrolling factor, or a tuple (u₁, u₂) = (4, 2) signaling that the generated code should unroll more than one loop. u₁ is the unrolling factor for the first unrolled loop and u₂ for the next (if present), but it applies to the loop ordering and unrolling that will be chosen by LoopVectorization, not the order in body. uᵢ=0 (the default) indicates that LoopVectorization should pick its own value, and uᵢ=-1 disables unrolling for the correspond loop.

The @turbo macro also checks the array arguments using LoopVectorization.check_args to try and determine if they are compatible with the macro. If check_args returns false, a fall back loop annotated with @inbounds and @fastmath is generated. Note that VectorizationBase provides functions such as vadd and vmul that will ignore @fastmath, preserving IEEE semantics both within @turbo and @fastmath. check_args currently returns false for some wrapper types like LinearAlgebra.UpperTriangular, requiring you to use their parent. Triangular loops aren't yet supported.

source

LoopVectorization.@tturbo — Macro

@tturbo

Equivalent to @turbo, except it adds thread=true as the first keyword argument. Note that later arguments take precedence.

Meant for convenience, as @tturbo is shorter than @turbo thread=true.

source

`map`-like constructs

LoopVectorization.vmap — Function

vmap(f, a::AbstractArray)
vmap(f, a::AbstractArray, b::AbstractArray, ...)

SIMD-vectorized map, applying f to each element of a (or paired elements of a, b, ...) and returning a new array.

source

LoopVectorization.vmap! — Function

vmap!(f, destination, a::AbstractArray)
vmap!(f, destination, a::AbstractArray, b::AbstractArray, ...)

Vectorized-map!, applying f to batches of elements of a (or paired batches of a, b, ...) and storing the result in destination.

The function f must accept VectorizationBase.AbstractSIMD inputs. Ideally, all this requires is making sure that f is defined to be agnostic with respect to input types, but if the function f contains branches or loops, more work will probably be needed. For example, a function

f(x) = x > 0 ? log(x) : inv(x)

can be rewritten into

using IfElse
f(x) = IfElse.ifelse(x > 0, log(x), inv(x))

source

LoopVectorization.vmapt — Function

vmapt(f, a::AbstractArray)
vmapt(f, a::AbstractArray, b::AbstractArray, ...)

A threaded variant of vmap.

source

LoopVectorization.vmapt! — Function

vmapt!(::Function, dest, args...)

A threaded variant of vmap!.

source

LoopVectorization.vmapnt — Function

vmapnt(f, a::AbstractArray)
vmapnt(f, a::AbstractArray, b::AbstractArray, ...)

A "non-temporal" variant of vmap. This can improve performance in cases where destination will not be needed soon.

source

LoopVectorization.vmapnt! — Function

vmapnt!(::Function, dest, args...)

This is a vectorized map implementation using nontemporal store operations. This means that the write operations to the destination will not go to the CPU's cache. If you will not immediately be reading from these values, this can improve performance because the writes won't pollute your cache. This can especially be the case if your arguments are very long.

julia> f(x, y) = exp(-0.5abs2(x - y))
using LoopVectorization, BenchmarkTools

julia> @benchmark map!(f, $z, $x, $y)
x = rand(10^8); y = rand(10^8); z = similar(x);

julia> @benchmark vmap!(f, $z, $x, $y)
f (generic function with 1 method)

julia> @benchmark vmapnt!(f, $z, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     439.613 ms (0.00% GC)
  median time:      440.729 ms (0.00% GC)
  mean time:        440.695 ms (0.00% GC)
  maximum time:     441.665 ms (0.00% GC)
  --------------
  samples:          12
  evals/sample:     1

source

LoopVectorization.vmapntt — Function

vmapntt(f, a::AbstractArray)
vmapntt(f, a::AbstractArray, b::AbstractArray, ...)

A threaded variant of vmapnt.

source

LoopVectorization.vmapntt! — Function

vmapntt!(::Function, dest, args...)

A threaded variant of vmapnt!.

source

`filter`-like constructs

LoopVectorization.vfilter — Function

vfilter(f, a::AbstractArray)

SIMD-vectorized filter, returning an array containing the elements of a for which f return true.

This function requires AVX512 to be faster than Base.filter, as it adds compressstore instructions.

source

LoopVectorization.vfilter! — Function

vfilter!(f, a::AbstractArray)

SIMD-vectorized filter!, removing the element of a for which f is false.

source

`reduce`-like constructs

VectorizationBase.vsum — Function

vsum(A::DenseArray)
vsum(f, A::DenseArray)

Vectorized version of sum. Providing a function as the first argument will apply the function to each element of A before summing.

source

LoopVectorization.vreduce — Function

vreduce(op, A::DenseArray; [dims::Int])

Vectorized version of reduce. Reduces the array A using the operator op. At most one dimension may be supplied as kwarg.

source

LoopVectorization.vmapreduce — Function

vmapreduce(f, op, A::DenseArray...)

Vectorized version of mapreduce. Applies f to each element of the arrays A, and reduces the result with op.

source

Operators

LoopVectorization.:*ˡ — Function

A *ˡ B

A lazy product of A and B. While functionally identical to A * B, this may avoid the need for intermediate storage for any computations in A or B. Example:

@turbo @. a + B *ˡ (c + d')

which is equivalent to

 a .+ B * (c .+ d')

It should only be used inside an @turbo block, and to materialize the result it cannot be the final operation.

source

API reference

Macros

map-like constructs

filter-like constructs

reduce-like constructs

Operators

`map`-like constructs

`filter`-like constructs

`reduce`-like constructs