Constructing LoopSets

Loop expressions

When applying @turbo to a loop expression, it creates a LoopSet without awareness to type information, and then condenses the information into a summary which is passed as type information to a generated function.

julia> @macroexpand @turbo for m ∈ 1:M, n ∈ 1:N
           C[m,n] = zero(eltype(B))
           for k ∈ 1:K
               C[m,n] += A[m,k] * B[k,n]
           end
       end
quote
    var"##vptr##_C" = LoopVectorization.stridedpointer(C)
    var"##vptr##_A" = LoopVectorization.stridedpointer(A)
    var"##vptr##_B" = LoopVectorization.stridedpointer(B)
    begin
        $(Expr(:gc_preserve, :(LoopVectorization._turbo_!(Val{(0, 0)}(), Tuple{:numericconstant, Symbol("##zero#270"), LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000007, LoopVectorization.memstore, 0x01, 0x02), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000013, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x03), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000032, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x04), :numericconstant, Symbol("##reductzero#274"), LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000003, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x05), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000132, 0x0000000000000003, 0x0000000000000000, 0x0000000000030405, LoopVectorization.compute, 0x00, 0x05), :LoopVectorization, :reduce_to_add, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000003, 0x0000000000000000, 0x0000000000000601, LoopVectorization.compute, 0x00, 0x01)}, Tuple{LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000102, 0xffffffffffffe03b), LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000103, 0xffffffffffffffd6), LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000302, 0xffffffffffffe056), LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000102, 0xffffffffffffffd6)}, Tuple{0, Tuple{}, Tuple{}, Tuple{}, Tuple{}, Tuple{(1, LoopVectorization.IntOrFloat), (5, LoopVectorization.IntOrFloat)}, Tuple{}}, (LoopVectorization.StaticLowerUnitRange{0}(M), LoopVectorization.StaticLowerUnitRange{0}(N), LoopVectorization.StaticLowerUnitRange{0}(K)), var"##vptr##_C", var"##vptr##_A", var"##vptr##_B", var"##vptr##_C")), :C, :A, :B))
    end
end

When the corresponding method gets compiled for specific type of A, B, and C, the call to the @generated function _turbo_! get compiled. This causes the summary to be reconstructed using the available type information. This type information can be used, for example, to realize an array has been transposed, and thus correctly identify which axis contains contiguous elements that are efficient to load from. This kind of information cannot be extracted from the raw expression, which is why these decisions are made when the method gets compiled for specific types via the @generated function _turbo_!.

The three chief components of the summaries are the definitions of operations, e.g.:

:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000013, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x03)

the referenced array objects:

LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000102, 0xffffffffffffe03b)

and the set of loop bounds:

(LoopVectorization.StaticLowerUnitRange{0}(M), LoopVectorization.StaticLowerUnitRange{0}(N), LoopVectorization.StaticLowerUnitRange{0}(K))

Broadcasting

When applying the @turbo macro to a broadcast expression, there are no explicit loops, and even the dimensionality of the operation is unknown. Consequently the LoopSet object must be constructed at compile time. The function and involved operations are their relationships are straightforward to infer from the structure of nested broadcasts:

julia> Meta.@lower @. f(g(a,b) + c) / d
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.broadcasted(g, a, b)
│   %2 = Base.broadcasted(+, %1, c)
│   %3 = Base.broadcasted(f, %2)
│   %4 = Base.broadcasted(/, %3, d)
│   %5 = Base.materialize(%4)
└──      return %5
))))

julia> @macroexpand @turbo @. f(g(a,b) + c) / d
quote
    var"##262" = Base.broadcasted(g, a, b)
    var"##263" = Base.broadcasted(+, var"##262", c)
    var"##264" = Base.broadcasted(f, var"##263")
    var"##265" = Base.broadcasted(/, var"##264", d)
    var"##266" = LoopVectorization.vmaterialize(var"##265", Val{:Main}())
end

These nested broadcasted objects already express information very similar to what the LoopSet objects hold. The dimensionality of the objects provides the information on the associated loop dependencies, but again this information is available only when the method is compiled for specific types. The @generated function vmaterialize constructs the LoopSet by recursively evaluating add_broadcast! on all the fields.