Polyester
Polyester.reset_threads!
— MethodPolyester.reset_threads!()
Resets the threads used by Polyester.jl.
Polyester.@batch
— Macro@batch for i in Iter; ...; end
Evaluate the loop on multiple threads.
@batch minbatch=N for i in Iter; ...; end
Evaluate at least N iterations per thread. Will use at most length(Iter) ÷ N
threads.
@batch threadlocal=init() for i in Iter; ...; end
Create a thread-local storage used in the loop.
The init
function will be called at the start at each thread. threadlocal
will refer to storage local for the thread. At the end of the loop, a threadlocal
vector containing all the thread-local values will be available. A type can be specified with threadlocal=init()::Type
.
@batch reduction=((op1, var1), (op2, var2), ...) for i in Iter; ...; end
Perform OpenMP-esque reduction on the isbits
variables var1
, var2
, ...
using the operations op1
, op2
, ...
. The variables have to be initialized before the loop and cannot be a fieldname like x.y
or x[i]
. Supported operations are +
, *
, min
, max
, &
, and |
. The type does not have to be provided, since it is already inferred from the initialized variables–-caution has to be taken to ensure that the type remains consistent throughout the loop. While threadlocal
can do the same thing, reduction
does not incur additional allocations and is generally more efficient for its purpose. It is up to the user to ensure that there are no data dependencies between iterations, which could lead to incorrect results.
@batch per=core for i in Iter; ...; end
@batch per=thread for i in Iter; ...; end
Use at most 1 thread per physical core, or 1 thread per CPU thread, respectively. One thread per core will mean less threads competing for the cache, while (for example) if there are two hardware threads per physical core, then using each thread means that there are two independent instruction streams feeding the CPU's execution units. When one of these streams isn't enough to make the most of out of order execution, this could increase total throughput.
Which performs better will depend on the workload, so if you're not sure it may be worth benchmarking both.
LoopVectorization.jl currently only uses up to 1 thread per physical core. Because there is some overhead to switching the number of threads used, per=core
is @batch
's default, so that Polyester.@batch
and LoopVectorization.@tturbo
work well together by default.
Threads are not pinned to a given CPU core and the total number of available threads is still governed by --threads
or JULIA_NUM_THREADS
.
You can pass both per=(core/thread)
and minbatch=N
options at the same time, e.g.
@batch per=thread minbatch=2000 for i in Iter; ...; end
@batch minbatch=5000 per=core for i in Iter; ...; end
@batch stride=true for i in Iter; ...; end
This may be better for load balancing if iterations close to each other take a similar amount of time, but iterations far apart take different lengths of time. Setting this also forces per=thread
. The default is stride=false
.