for loop, or a set of nested
for loops whose bounds are constant across iterations, to optimize the computation. For example:
function AmulB!(C, A, B) @turbo for m ∈ indices((A,C), 1), n ∈ indices((B,C), 2) # indices((A,C),1) == axes(A,1) == axes(C,1) Cₘₙ = zero(eltype(C)) for k ∈ indices((A,B), (2,1)) # indices((A,B), (2,1)) == axes(A,2) == axes(B,1) Cₘₙ += A[m,k] * B[k,n] end C[m,n] = Cₘₙ end end
The macro models the set of nested loops, and chooses an ordering of the three loops to minimize predicted computation time.
- It assumes that loop iterations are independent.
- It does not perform bounds checks.
- It assumes that each loop iterates at least once. (Use
@turbo check_empty=trueto lift this assumption.)
- That there is only one loop at each level of the nest.
It may also apply to broadcasts:
julia> using LoopVectorization julia> a = rand(100); julia> b = @turbo exp.(2 .* a); julia> c = similar(b); julia> @turbo @. c = exp(2a); julia> b ≈ c true
Advanced users can customize the implementation of the
@turbo-annotated block using keyword arguments:
@turbo inline=false unroll=2 body
body is the code of the block (e.g.,
for ... end).
thread is either a Boolean, or an integer. The integer's value indicates the number of threads to use. It is clamped to be between
false is equivalent to
true is equivalent to
inline is a Boolean. When
body will be directly inlined into the function (via a forced-inlining call to
false, it wont force inlining of the call to
_turbo_! instead, letting Julia's own inlining engine determine whether the call to
_turbo_! should be inlined. (Typically, it won't.) Sometimes not inlining can lead to substantially worse code generation, and >40% regressions, even in very large problems (2-d convolutions are a case where this has been observed). One can find some circumstances where
inline=true is faster, and other circumstances where
inline=false is faster, so the best setting may require experimentation. By default, the macro tries to guess. Currently the algorithm is simple: roughly, if there are more than two dynamically sized loops or and no convolutions, it will probably not force inlining. Otherwise, it probably will.
check_empty (default is
false) determines whether or not it will check if any of the iterators are empty. If false, you must ensure yourself that they are not empty, else the behavior of the loop is undefined and (like with
@inbounds) segmentation faults are likely.
unroll is an integer that specifies the loop unrolling factor, or a tuple
(u₁, u₂) = (4, 2) signaling that the generated code should unroll more than one loop.
u₁ is the unrolling factor for the first unrolled loop and
u₂ for the next (if present), but it applies to the loop ordering and unrolling that will be chosen by LoopVectorization, not the order in
uᵢ=0 (the default) indicates that LoopVectorization should pick its own value, and
uᵢ=-1 disables unrolling for the correspond loop.
@turbo macro also checks the array arguments using
LoopVectorization.check_args to try and determine if they are compatible with the macro. If
check_args returns false, a fall back loop annotated with
@fastmath is generated. Note that
VectorizationBase provides functions such as
vmul that will ignore
@fastmath, preserving IEEE semantics both within
check_args currently returns false for some wrapper types like
LinearAlgebra.UpperTriangular, requiring you to use their
parent. Triangular loops aren't yet supported.
Setting the keyword argument
@turbo warn_check_args=true for ...) in a loop or broadcast statement will cause it to warn once if
LoopVectorization.check_args fails and the fallback loop is executed instead of the LoopVectorization-optimized loop. Setting it to an integer > 0 will warn that many times, while setting it to a negative integer will warn an unlimited amount of times. The default is
warn_check_args = 0.
@turbo, except it adds
thread=true as the first keyword argument. Note that later arguments take precendence.
Meant for convenience, as
@tturbo is shorter than
vmap(f, a::AbstractArray) vmap(f, a::AbstractArray, b::AbstractArray, ...)
f to each element of
a (or paired elements of
b, ...) and returning a new array.
vmap!(f, destination, a::AbstractArray) vmap!(f, destination, a::AbstractArray, b::AbstractArray, ...)
f to batches of elements of
a (or paired batches of
b, ...) and storing the result in
f must accept
VectorizationBase.AbstractSIMD inputs. Ideally, all this requires is making sure that
f is defined to be agnostic with respect to input types, but if the function
f contains branches or loops, more work will probably be needed. For example, a function
f(x) = x > 0 ? log(x) : inv(x)
can be rewritten into
using IfElse f(x) = IfElse.ifelse(x > 0, log(x), inv(x))
vmapnt(f, a::AbstractArray) vmapnt(f, a::AbstractArray, b::AbstractArray, ...)
A "non-temporal" variant of
vmap. This can improve performance in cases where
destination will not be needed soon.
vmapnt!(::Function, dest, args...)
This is a vectorized map implementation using nontemporal store operations. This means that the write operations to the destination will not go to the CPU's cache. If you will not immediately be reading from these values, this can improve performance because the writes won't pollute your cache. This can especially be the case if your arguments are very long.
julia> using LoopVectorization, BenchmarkTools julia> x = rand(10^8); y = rand(10^8); z = similar(x); julia> f(x,y) = exp(-0.5abs2(x - y)) f (generic function with 1 method) julia> @benchmark map!(f, $z, $x, $y) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 439.613 ms (0.00% GC) median time: 440.729 ms (0.00% GC) mean time: 440.695 ms (0.00% GC) maximum time: 441.665 ms (0.00% GC) -------------- samples: 12 evals/sample: 1 julia> @benchmark vmap!(f, $z, $x, $y) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 178.147 ms (0.00% GC) median time: 178.381 ms (0.00% GC) mean time: 178.430 ms (0.00% GC) maximum time: 179.054 ms (0.00% GC) -------------- samples: 29 evals/sample: 1 julia> @benchmark vmapnt!(f, $z, $x, $y) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 144.183 ms (0.00% GC) median time: 144.338 ms (0.00% GC) mean time: 144.349 ms (0.00% GC) maximum time: 144.641 ms (0.00% GC) -------------- samples: 35 evals/sample: 1
vmapntt(f, a::AbstractArray) vmapntt(f, a::AbstractArray, b::AbstractArray, ...)
A threaded variant of
vmapntt!(::Function, dest, args...)
A threaded variant of
filter, returning an array containing the elements of
a for which
This function requires AVX512 to be faster than
Base.filter, as it adds compressstore instructions.
filter!, removing the element of
a for which
f is false.
vreduce(op, destination, A::DenseArray...)
Vectorized version of
reduce. Reduces the array
A using the operator
vmapreduce(f, op, A::DenseArray...)
Vectorized version of
f to each element of the arrays
A, and reduces the result with