Convenient Vectorized Functions
vmap
This is simply a vectorized map
function.
vmapnt and vmapntt
These are like vmap
, but use non-temporal (streaming) stores into the destination, to avoid polluting the cache. Likely to yield a performance increase if you wont be reading the values soon.
julia> using LoopVectorization, BenchmarkTools
julia> f(x,y) = exp(-0.5abs2(x - y))
f (generic function with 1 method)
julia> x = rand(10^8); y = rand(10^8); z = similar(x);
julia> @benchmark map!(f, $z, $x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 442.614 ms (0.00% GC)
median time: 443.750 ms (0.00% GC)
mean time: 443.664 ms (0.00% GC)
maximum time: 444.730 ms (0.00% GC)
--------------
samples: 12
evals/sample: 1
julia> @benchmark vmap!(f, $z, $x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 177.257 ms (0.00% GC)
median time: 177.380 ms (0.00% GC)
mean time: 177.423 ms (0.00% GC)
maximum time: 177.956 ms (0.00% GC)
--------------
samples: 29
evals/sample: 1
julia> @benchmark vmapnt!(f, $z, $x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 143.521 ms (0.00% GC)
median time: 143.639 ms (0.00% GC)
mean time: 143.645 ms (0.00% GC)
maximum time: 143.821 ms (0.00% GC)
--------------
samples: 35
evals/sample: 1
julia> Threads.nthreads()
36
julia> @benchmark vmapntt!(f, $z, $x, $y)
BenchmarkTools.Trial:
memory estimate: 25.69 KiB
allocs estimate: 183
--------------
minimum time: 30.065 ms (0.00% GC)
median time: 30.130 ms (0.00% GC)
mean time: 30.146 ms (0.00% GC)
maximum time: 31.277 ms (0.00% GC)
--------------
samples: 166
evals/sample: 1
vfilter
This function requires LLVM 7 or greater, and is only likely to give better performance if your CPU has AVX512. This is because it uses the compressed store intrinsic, which was added in LLVM 7. AVX512 provides a corresponding instruction, making the operation fast, while other instruction sets must emulate it, and thus are likely to get similar performance with LoopVectorization.vfilter
as they do from Base.filter
.
julia> using LoopVectorization, BenchmarkTools
julia> x = rand(997);
julia> y1 = filter(a -> a > 0.7, x);
julia> y2 = vfilter(a -> a > 0.7, x);
julia> y1 == y2
true
julia> @benchmark filter(a -> a > 0.7, $x)
BenchmarkTools.Trial:
memory estimate: 7.94 KiB
allocs estimate: 1
--------------
minimum time: 955.389 ns (0.00% GC)
median time: 1.050 μs (0.00% GC)
mean time: 1.191 μs (9.72% GC)
maximum time: 82.799 μs (94.92% GC)
--------------
samples: 10000
evals/sample: 18
julia> @benchmark vfilter(a -> a > 0.7, $x)
BenchmarkTools.Trial:
memory estimate: 7.94 KiB
allocs estimate: 1
--------------
minimum time: 477.487 ns (0.00% GC)
median time: 575.166 ns (0.00% GC)
mean time: 711.526 ns (17.87% GC)
maximum time: 9.257 μs (79.17% GC)
--------------
samples: 10000
evals/sample: 193
vmapreduce
Vectorized version of mapreduce
. vmapreduce(f, op, a, b, c)
applies f(a[i], b[i], c[i])
for i in eachindex(a,b,c)
, reducing the results to a scalar with op
.
julia> using LoopVectorization, BenchmarkTools
julia> x = rand(127); y = rand(127);
julia> @btime vmapreduce(hypot, +, $x, $y)
191.420 ns (0 allocations: 0 bytes)
96.75538300513509
julia> @btime mapreduce(hypot, +, $x, $y)
1.777 μs (5 allocations: 1.25 KiB)
96.75538300513509
vsum
Vectorized version of sum
. vsum(f, a)
applies f(a[i])
for i in eachindex(a)
, then sums the results.
julia> using LoopVectorization, BenchmarkTools
julia> x = rand(127);
julia> @btime vsum(hypot, $x)
12.095 ns (0 allocations: 0 bytes)
66.65246070098374
julia> @btime sum(hypot, $x)
16.992 ns (0 allocations: 0 bytes)
66.65246070098372