Polyester

Polyester.@batchMacro
@batch for i in Iter; ...; end

Evaluate the loop on multiple threads.

@batch minbatch=N for i in Iter; ...; end

Evaluate at least N iterations per thread. Will use at most length(Iter) ÷ N threads.

@batch threadlocal=init() for i in Iter; ...; end

Create a thread-local storage used in the loop.

The init function will be called at the start at each thread. threadlocal will refer to storage local for the thread. At the end of the loop, a threadlocal vector containing all the thread-local values will be available. A type can be specified with threadlocal=init()::Type.

@batch reduction=((op1, var1), (op2, var2), ...) for i in Iter; ...; end

Perform OpenMP-esque reduction on the isbits variables var1, var2, ... using the operations op1, op2, ... . The variables have to be initialized before the loop and cannot be a fieldname like x.y or x[i]. Supported operations are +, *, min, max, &, and |. The type does not have to be provided, since it is already inferred from the initialized variables–-caution has to be taken to ensure that the type remains consistent throughout the loop. While threadlocal can do the same thing, reduction does not incur additional allocations and is generally more efficient for its purpose. It is up to the user to ensure that there are no data dependencies between iterations, which could lead to incorrect results.

@batch per=core for i in Iter; ...; end
@batch per=thread for i in Iter; ...; end

Use at most 1 thread per physical core, or 1 thread per CPU thread, respectively. One thread per core will mean less threads competing for the cache, while (for example) if there are two hardware threads per physical core, then using each thread means that there are two independent instruction streams feeding the CPU's execution units. When one of these streams isn't enough to make the most of out of order execution, this could increase total throughput.

Which performs better will depend on the workload, so if you're not sure it may be worth benchmarking both.

LoopVectorization.jl currently only uses up to 1 thread per physical core. Because there is some overhead to switching the number of threads used, per=core is @batch's default, so that Polyester.@batch and LoopVectorization.@tturbo work well together by default.

Threads are not pinned to a given CPU core and the total number of available threads is still governed by --threads or JULIA_NUM_THREADS.

You can pass both per=(core/thread) and minbatch=N options at the same time, e.g.

@batch per=thread minbatch=2000 for i in Iter; ...; end
@batch minbatch=5000 per=core   for i in Iter; ...; end

@batch stride=true for i in Iter; ...; end

This may be better for load balancing if iterations close to each other take a similar amount of time, but iterations far apart take different lengths of time. Setting this also forces per=thread. The default is stride=false.

source