# Multithreading

LoopVectorization can multithread loops if you pass the argument `@turbo thread=true for ... end`

or equivalently use `@tturbo`

. By default, `thread = false`

, which runs only a single thread. You can also supply a numerical argument to set an upper bound on the number of threads, e.g. `@turbo thread=8 for ... end`

will use up to `min(8,Threads.nthreads(),VectorizationBase.num_cores())`

threads. `VectorizationBase.num_cores()`

uses Hwloc.jl to get the number of physical cores. Currently, this only works for `for`

loops, but support for broadcasting will come.

Lets look at a few benchmarks.

Taking the first example from the ThreadsX.jl README:

```
function relative_prime_count(x, N)
c = 0
@tturbo for i ∈ 1:N
c += gcd(x, i) == 1
end
c
end
```

Benchmarking them:

```
julia> @btime ThreadsX.sum(gcd(42, i) == 1 for i in 1:10_000)
130.928 μs (3097 allocations: 240.39 KiB)
2857
julia> @btime relative_prime_count(42, 10_000)
3.376 μs (0 allocations: 0 bytes)
2857
```

Note that much of the performance difference here is thanks to SIMD, which requires AVX512 for good performance (`trailing_zeros`

, required by `gcd`

, needs AVX512 for a SIMD version). LoopVectorization is a good choice for loops (a) amenable to SIMD (b) where all arrays are dense and (c) a static schedule would work well. Generally, this means loops built up of relatively primitive arithmetic operations (e.g. `+`

, `/`

, or `log`

), and not, for example, solving differential equations.

I'll make comparisons with OpenMP through the rest of this, starting with a simple dot product to focus on threading overhead:

```
function dot_tturbo(a::AbstractArray{T}, b::AbstractArray{T}) where {T <: Real}
s = zero(T)
@tturbo for i ∈ eachindex(a,b)
s += a[i] * b[i]
end
s
end
function dotbaseline(a::AbstractArray{T}, b::AbstractArray{T}) where {T}
s = zero(T)
@fastmath @inbounds @simd for i ∈ eachindex(a,b)
s += a[i]' * b[i]
end
s
end
```

In `C`

:

```
#include<omp.h>
// gcc -Ofast -march=native -mprefer-vector-width=512 -fopenmp -shared -fPIC openmp.c -o libomptest.so
double dot(double* a, double* b, long N){
double s = 0.0;
#pragma omp parallel for reduction(+: s)
for(long n = 0; n < N; n++){
s += a[n]*b[n];
}
return s;
}
```

Wrapping it in Julia is straightforward, after compiling:

```
using Libdl; const OPENMPTEST = joinpath(pkgdir(LoopVectorization), "benchmark", "libomptest.$(Libdl.dlext)");
cdot(a::AbstractVector{Float64},b::AbstractVector{Float64}) = @ccall OPENMPTEST.dot(a::Ref{Float64}, b::Ref{Float64}, length(a)::Clong)::Float64
```

Trying out one size to give a perspective on scale:

```
julia> N = 10_000; x = rand(N); y = rand(N);
julia> @btime dot($x, $y) # LinearAlgebra
1.114 μs (0 allocations: 0 bytes)
2480.296446711209
julia> @btime dot_turbo($x, $y)
761.621 ns (0 allocations: 0 bytes)
2480.296446711209
julia> @btime dot_tturbo($x, $y)
622.723 ns (0 allocations: 0 bytes)
2480.296446711209
julia> @btime dot_baseline($x, $y)
1.294 μs (0 allocations: 0 bytes)
2480.2964467112097
julia> @btime cdot($x, $y)
6.109 μs (0 allocations: 0 bytes)
2480.2964467112092
```

All these times are fairly fast; `wait(Threads.@spawn 1+1)`

will typically take much longer than even `@cdot`

did here.

Now let's look at a more complex example:

```
function dot_tturbo(ca::AbstractVector{Complex{T}}, cb::AbstractVector{Complex{T}}) where {T}
a = reinterpret(reshape, T, ca)
b = reinterpret(reshape, T, cb)
re = zero(T); im = zero(T)
@tturbo for i ∈ axes(a,2) # adjoint(a[i]) * b[i]
re += a[1,i] * b[1,i] + a[2,i] * b[2,i]
im += a[1,i] * b[2,i] - a[2,i] * b[1,i]
end
Complex(re, im)
end
```

LoopVectorization currently only supports arrays of type `T <: Union{Bool,Base.HWReal}`

. So to support `Complex{T}`

, we reinterpret the arrays and then write out the corresponding operations. The plan is to eventually have LoopVectorization do this automatically, but for now we require this workaround. Corresponding `C`

:

```
void cdot(double* c, double* a, double* b, long N){
double r = 0.0, i = 0.0;
#pragma omp parallel for reduction(+: r, i)
for(long n = 0; n < N; n++){
r += a[2*n] * b[2*n ] + a[2*n+1] * b[2*n+1];
i += a[2*n] * b[2*n+1] - a[2*n+1] * b[2*n ];
}
c[0] = r;
c[1] = i;
return;
}
```

The Julia wrapper:

```
function cdot(x::AbstractVector{Complex{Float64}}, y::AbstractVector{Complex{Float64}})
c = Ref{Complex{Float64}}()
@ccall OPENMPTEST.cdot(c::Ref{Complex{Float64}}, x::Ref{Complex{Float64}}, y::Ref{Complex{Float64}}, length(x)::Clong)::Cvoid
c[]
end
```

The complex dot product is more compute bound. Given the same number of elements, we require `2x`

the memory for complex numbers, `4x`

the floating point arithmetic, and as we have an array of structs rather than structs of arrays, we need additional instructions to shuffle the data.

If we take this further to the three-argument dot product, which isn't implemented in BLAS, `@tturbo`

now holds a substantial advantage over the competition:

```
function dot3(x::AbstractVector{Complex{T}}, A::AbstractMatrix{Complex{T}}, y::AbstractVector{Complex{T}}) where {T}
xr = reinterpret(reshape, T, x);
yr = reinterpret(reshape, T, y);
Ar = reinterpret(reshape, T, A);
sre = zero(T)
sim = zero(T)
@tturbo for n in axes(Ar,3)
tre = zero(T)
tim = zero(T)
for m in axes(Ar,2)
tre += xr[1,m] * Ar[1,m,n] + xr[2,m] * Ar[2,m,n]
tim += xr[1,m] * Ar[2,m,n] - xr[2,m] * Ar[1,m,n]
end
sre += tre * yr[1,n] - tim * yr[2,n]
sim += tre * yr[2,n] + tim * yr[1,n]
end
Complex(sre, sim)
end
```

```
void cdot3(double* c, double* x, double* A, double* y, long M, long N){
double sr = 0.0, si = 0.0;
#pragma omp parallel for reduction(+: sr, si)
for (long n = 0; n < N; n++){
double tr = 0.0, ti = 0.0;
for(long m = 0; m < M; m++){
tr += x[2*m] * A[2*m + 2*n*N] + x[2*m+1] * A[2*m+1 + 2*n*N];
ti += x[2*m] * A[2*m+1 + 2*n*N] - x[2*m+1] * A[2*m + 2*n*N];
}
sr += tr * y[2*n ] - ti * y[2*n+1];
si += tr * y[2*n+1] + ti * y[2*n ];
}
c[0] = sr;
c[1] = si;
return;
}
```

The wrapper is more or less the same as before:

```
function cdot(x::AbstractVector{Complex{Float64}}, A::AbstractMatrix{Complex{Float64}}, y::AbstractVector{Complex{Float64}})
c = Ref{Complex{Float64}}()
M, N = size(A)
@ccall OPENMPTEST.cdot3(c::Ref{Complex{Float64}}, x::Ref{Complex{Float64}}, A::Ref{Complex{Float64}}, y::Ref{Complex{Float64}}, M::Clong, N::Clong)::Cvoid
c[]
end
```

When testing on my laptop, the `C`

implentation ultimately won, but I will need to investigate further to tell whether this benchmark benefits from hyperthreading, or if it's because LoopVectorization's memory access patterns are less friendly. I plan to work on cache-level blocking to increase memory friendliness eventually, and will likely also allow it to take advantage of hyperthreading/simultaneous multithreading, although I'd prefer a few motivating test problems to look at first. Note that a single core of this CPU is capable of exceeding 100 GFLOPS of double precision compute. The execution units are spending most of their time idle. So the question of whether hypthreading helps may be one of whether or not we are memory-limited.

For a more compute-limited operation, lets look at matrix multiplication, which requires `O(N³)`

compute for `O(N²)`

memory. Note that it's still easy to be memory-starved in matrix multiplication, especially for larger matrices. While the total memory required may be `O(N²)`

, if the memory doesn't fit in the high cache levels, it will have to churn through it. The memory bandwidth requirements are thus `O(N³)`

, but cache-level blocking can give it a small enough coefficient that you can make the most of your CPU's theoretical compute. Unlike all the dot product cases (including the 3-argument dot product), which force you to stream most of the memory through the cores. There is no reuse on `x`

and `y`

for the 2-arg dot products, or on memory from `A`

in the the 3-arg dot product.

Here, I compare against other libraries: Intel MKL, OpenBLAS (Julia's default), and two Julia libraries: Tullio.jl and Octavian.jl.

```
function A_mul_B!(C, A, B)
@tturbo for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
Cmn = zero(eltype(C))
for k ∈ indices((A,B), (2,1))
Cmn += C[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
```

Benchmarks over the size range `10:5:300`

:

Because LoopVectorization doesn't do cache optimizations yet, MKL, OpenBLAS, and Octavian will all pull ahead for larger matrices. This CPU has a 1 MiB L2 cache per core and 18 cores:

```
julia> doubles_per_l2 = (2 ^ 20) ÷ 8
131072
julia> total_doubles_in_l2 = doubles_per_l2 * (Sys.CPU_THREADS ÷ 2) # doubles_per_l2 * 18
2359296
julia> doubles_per_mat = total_doubles_in_l2 ÷ 3 # divide up amoung 3 matrices
786432
julia> sqrt(ans)
886.8100134752651
```

Meaning we could fit three 886x886 matrices in our L2 cache by splitting them up among the cores. The largest matrices benchmarked above, at 300x300, fit comfortably.

Aside from the fact that LoopVectorization did much better than OpenBLAS–Julia's default library–over this size range, LoopVectorization's major advantage that it should perform similarly well for a wide variety of comparable operations and not just GEMM (GEneral Matrix-Matrix multiplication) specifically. GEMM has long been a motivating benchmark, as it's one of the best optimized routines available to compare against and get a sense of how well you're doing vs hand-tuned limits optimized in assembly.

Because it is so well optimized, a standard trick for implementing more general optimized routines is to convert them into GEMM calls. For example, this is commonly done for tensor operations (see, e.g., TensorOperations.jl) as well as for convolutions, e.g. in NNlib's `conv_im2col!`

, their default optimized convolution function.

Lets take a look at convolutions as our next example. We create a batch of a hundred 256x256 images with 3 input channels, and convolve them with a 5x5 kernel producing 6 output channels.

```
using NNlib, LoopVectorization, Static
img = rand(Float32, 260, 260, 3, 100);
kern = rand(Float32, 5, 5, 3, 6);
out1 = Array{Float32}(undef, size(img,1)+1-size(kern,1), size(img,2)+1-size(kern,2), size(kern,4), size(img,4));
out2 = similar(out1);
dcd = NNlib.DenseConvDims(img, kern, flipkernel = true);
function kernaxes(::DenseConvDims{2,K,C_in, C_out}) where {K,C_in, C_out} # LoopVectorization can take advantage of static size information
K₁ = StaticInt(1):StaticInt(K[1])
K₂ = StaticInt(1):StaticInt(K[2])
Cᵢₙ = StaticInt(1):StaticInt(C_in)
Cₒᵤₜ = StaticInt(1):StaticInt(C_out)
(K₁, K₂, Cᵢₙ, Cₒᵤₜ)
end
function convlayer!(out::AbstractArray{<:Any,4}, img, kern, dcd::DenseConvDims)
(K₁, K₂, Cᵢₙ, Cₒᵤₜ) = kernaxes(dcd)
@tturbo for j₁ ∈ axes(out,1), j₂ ∈ axes(out,2), d ∈ axes(out,4), o ∈ Cₒᵤₜ
s = zero(eltype(out))
for k₁ ∈ K₁, k₂ ∈ K₂, i ∈ Cᵢₙ
s += img[j₁ + k₁ - 1, j₂ + k₂ - 1, i, d] * kern[k₁, k₂, i, o]
end
out[j₁, j₂, o, d] = s
end
out
end
```

LoopVectorization likes to take advantage of any static size information when available, so we write `kernaxes`

to extract them from the `DenseConvDims`

object and produce statically sized axes. Otherwise, this code is simply writing the convolutions as a bunch of loops.

This yields:

```
julia> NNlib.conv!(out2, img, kern, dcd);
julia> convlayer!(out1, img, kern, dcd);
julia> out1 ≈ out2
true
julia> @benchmark convlayer!($out1, $img, $kern, $dcd)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.377 ms (0.00% GC)
median time: 5.432 ms (0.00% GC)
mean time: 5.433 ms (0.00% GC)
maximum time: 5.682 ms (0.00% GC)
--------------
samples: 920
evals/sample: 1
julia> @benchmark NNlib.conv!($out2, $img, $kern, $dcd)
BenchmarkTools.Trial:
memory estimate: 675.02 MiB
allocs estimate: 195
--------------
minimum time: 182.749 ms (0.00% GC)
median time: 190.472 ms (0.60% GC)
mean time: 197.527 ms (4.98% GC)
maximum time: 300.536 ms (35.82% GC)
--------------
samples: 26
evals/sample: 1
```

By default, the BLAS library called uses multiple threads, but `NNlib`

also threads over the batches using `Threads.@threads`

. This oversubscribes the threads. We thus improve performance by forcing `BLAS`

to use just a single thread, favoring the more granular threading across batches:

```
julia> using LinearAlgebra
julia> BLAS.set_num_threads(1)
julia> @benchmark NNlib.conv!($out2, $img, $kern, $dcd)
BenchmarkTools.Trial:
memory estimate: 675.02 MiB
allocs estimate: 195
--------------
minimum time: 124.177 ms (0.00% GC)
median time: 128.609 ms (0.93% GC)
mean time: 133.574 ms (5.36% GC)
maximum time: 235.760 ms (45.17% GC)
--------------
samples: 38
evals/sample: 1
```

This still nets `@tturbo`

a 23x advantage on this machine!

## FAQ

#### If I do `@turbo thread=true for ... end`

, how many threads will it use? Or if I do `@turbo thread=4 for ... end`

, what then?

LoopVectorization will choose how many threads to use based on the length of the loop ranges and how expensive it estimates evaluating the loop to be. It will at most use one thread per physical core of the system.

#### How do I get answers to my questions?

Feel free to ask on Discourse, Zulip, Slack, or GitHub Discussions! I can also add it to the FAQ here, or one in an appropriate section.