VectorizationBase.jl

VectorizationBase.MMType

The name MM type refers to MM registers such as XMM, YMM, and ZMM. MMX from the original MMX SIMD instruction set is a [meaningless initialism](https://en.wikipedia.org/wiki/MMX(instruction_set)#Naming).

The MM{W,X} type is used to represent SIMD indexes of width W with stride X.

source
VectorizationBase.UnrollType

Unroll{AU,F,N,AV,W,M,X}(i::I)

  • AU: Unrolled axis
  • F: Factor, step size per unroll. If AU == AV, F == W means successive loads. 1 would mean offset by 1, e.g. x{1:8], x[2:9], and x[3:10].
  • N: How many times is it unrolled
  • AV: Vectorized axis # 0 means not vectorized, some sort of reduction
  • W: vector width
  • M: bitmask indicating whether each factor is masked
  • X: stride between loads of vectors along axis AV.
  • i::I - index
source
VectorizationBase.VecUnrollType
VecUnroll{N,W,T,V<:Union{NativeTypes,AbstractSIMD{W,T}}} <: AbstractSIMD{W,T}

VecUnroll supports optimizations when interleaving instructions across different memory storage schemes. VecUnroll{N,W,T} is typically a tuple ofN+1AbstractSIMDVector{W,T}s. For example, aVecUnroll{3,8,Float32}is a collection of 4×Vec{8,Float32}`.

Examples

julia> rgbs = [
         (
           R = Float32(i) / 255,
           G = Float32(i + 100) / 255,
           B = Float32(i + 200) / 255
         ) for i = 0:7:49
       ]
8-element Vector{NamedTuple{(:R, :G, :B), Tuple{Float32, Float32, Float32}}}:
 (R = 0.0, G = 0.39215687, B = 0.78431374)
 (R = 0.02745098, G = 0.41960785, B = 0.8117647)
 (R = 0.05490196, G = 0.44705883, B = 0.8392157)
 (R = 0.08235294, G = 0.4745098, B = 0.8666667)
 (R = 0.10980392, G = 0.5019608, B = 0.89411765)
 (R = 0.13725491, G = 0.5294118, B = 0.92156863)
 (R = 0.16470589, G = 0.5568628, B = 0.9490196)
 (R = 0.19215687, G = 0.58431375, B = 0.9764706)

julia> ret = vload(
         stridedpointer(reinterpret(reshape, Float32, rgbs)),
         Unroll{1,1,3,2,8,zero(UInt),1}((1, 1))
       )
3 x Vec{8, Float32}
Vec{8, Float32}<0.0f0, 0.02745098f0, 0.05490196f0, 0.08235294f0, 0.10980392f0, 0.13725491f0, 0.16470589f0, 0.19215687f0>
Vec{8, Float32}<0.39215687f0, 0.41960785f0, 0.44705883f0, 0.4745098f0, 0.5019608f0, 0.5294118f0, 0.5568628f0, 0.58431375f0>
Vec{8, Float32}<0.78431374f0, 0.8117647f0, 0.8392157f0, 0.8666667f0, 0.89411765f0, 0.92156863f0, 0.9490196f0, 0.9764706f0>

julia> typeof(ret)
VecUnroll{2, 8, Float32, Vec{8, Float32}}

While the R, G, and B are interleaved in rgbs, they have effectively been split out in ret (the first contains all 8 R values, with G and B in the second and third, respectively).

To optimize for the user's CPU, in real code it would typically be better to use Int(pick_vector_width(Float32)) # # following two definitions are for checking that you aren't accidentally creating VecUnroll{0}s. in place of 8 (W) in the Unroll construction. # @inline (VecUnroll(data::Tuple{V,Vararg{V,N}})::VecUnroll{N,W,T,V}) where {N,W,T,V<:AbstractSIMD{W,T}} = (@assert(N > 0); new{N,W,T,V}(data))

source
VectorizationBase._vrangeincrMethod

vrange(::Val{W}, i::I, ::Val{O}, ::Val{F})

W - Vector width i::I - dynamic offset O - static offset F - static multiplicative factor

source
VectorizationBase.alignFunction
align(x::Union{Int,Ptr}, [n])

Return aligned memory address with minimum increment. align assumes n is a power of 2.

source
VectorizationBase.bitselectMethod

bitselect(m::Unsigned, x::Unsigned, y::Unsigned)

If you have AVX512, setbits of vector-arguments will select bits according to mask m, selecting from x if 0 and from y if 1. For scalar arguments, or vector arguments without AVX512, setbits requires the additional restrictions on y that all bits for which m is 1, y must be 0. That is for scalar arguments or vector arguments without AVX512, it requires the restriction that ((y ⊻ m) & m) == m

source
VectorizationBase.ifmahiMethod
ifmalo(v1, v2, v3)

Multiply unsigned integers v1 and v2, adding the upper 52 bits to v3.

Requires has_feature(Val(:x86_64_avx512ifma)) to be fast.

source
VectorizationBase.ifmaloMethod
ifmalo(v1, v2, v3)

Multiply unsigned integers v1 and v2, adding the lower 52 bits to v3.

Requires has_feature(Val(:x86_64_avx512ifma)) to be fast.

source
VectorizationBase.inv_approxMethod

Fast approximate reciprocal.

Guaranteed accurate to at least 2^-14 ≈ 6.103515625e-

Useful for special funcion implementations.

source
VectorizationBase.offset_ptrMethod

An omnibus offset constructor.

The general motivation for generating the memory addresses as LLVM IR rather than combining multiple lllvmcall Julia functions is that we want to minimize the inttoptr and ptrtoint calculations as we go back and fourth. These can get in the way of some optimizations, such as memory address calculations. It is particulary import for gather and scatters, as these functions take a Vec{W,Ptr{T}} argument to load/store a Vec{W,T} to/from. If sizeof(T) < sizeof(Int), converting the <W x $(typ)* vectors of pointers in LLVM to integer vectors as they're represented in Julia will likely make them too large to fit in a single register, splitting the operation into multiple operations, forcing a corresponding split of the Vec{W,T} vector as well. This would all be avoided by not promoting/widenting the <W x $(typ)> into a vector of Ints.

For this last issue, an alternate workaround would be to wrap a Vec of 32-bit integers with a type that defines it as a pointer for use with internal llvmcall functions, but I haven't really explored this optimization.

source
VectorizationBase.preserve_bufferMethod

For structs wrapping arrays, using GC.@preserve can trigger heap allocations. preserve_buffer attempts to extract the heap-allocated part. Isolating it by itself will often allow the heap allocations to be elided. For example:

julia> using StaticArrays, BenchmarkTools

julia> # Needed until a release is made featuring https://github.com/JuliaArrays/StaticArrays.jl/commit/a0179213b741c0feebd2fc6a1101a7358a90caed
       Base.elsize(::Type{<:MArray{S,T}}) where {S,T} = sizeof(T)

julia> @noinline foo(A) = unsafe_load(A, 1)
foo (generic function with 1 method)

julia> function alloc_test_1()
         A = view(MMatrix{8,8,Float64}(undef), 2:5, 3:7)
         A[begin] = 4
         GC.@preserve A foo(pointer(A))
       end
alloc_test_1 (generic function with 1 method)

julia> function alloc_test_2()
         A = view(MMatrix{8,8,Float64}(undef), 2:5, 3:7)
         A[begin] = 4
         pb = parent(A) # or `LoopVectorization.preserve_buffer(A)`; `perserve_buffer(::SubArray)` calls `parent`
         GC.@preserve pb foo(pointer(A))
       end
alloc_test_2 (generic function with 1 method)

julia> @benchmark alloc_test_1()
BenchmarkTools.Trial:
  memory estimate:  544 bytes
  allocs estimate:  1
  --------------
  minimum time:     17.227 ns (0.00% GC)
  median time:      21.352 ns (0.00% GC)
  mean time:        26.151 ns (13.33% GC)
  maximum time:     571.130 ns (78.53% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark alloc_test_2()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.275 ns (0.00% GC)
  median time:      3.493 ns (0.00% GC)
  mean time:        3.491 ns (0.00% GC)
  maximum time:     4.998 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
source