In my last post on a vectorised bitwise-OR function for kdb+, I wondered towards the end what the next interesting step would be — should it be the production of an AVX-variant of the same code, or should it be the modification of the code to handle short
as well as byte
values? Well, I opted for the former, and found out that it really wasn’t worth the trouble, and the biggest benefit was realising some of the limitations of the AVX instruction-set. It’s worth sharing, since this may help others in the future and will help produce a faster version when AVX2 instructions come out… which is a hint.
When someone starts talking about “AVX” it’s usual think of 256-bit, 32-byte, half-cache-line-wide vector registers and instructions.
The thing is, only some of the AVX instructions can operate on 32-byte YMM registers, and pending the arrival of Haswell and the AVX2 instructions, there’s no obvious advantage to some of them over the SSE instructions. In his “Optimizing subroutines in assembly language” manual (PDF), Agner Fog states (s. 13.6):
The AVX instruction set supports floating point operations in 256-bit vectors. Integer operations in 256-bit vectors require the AVX2 instruction set.
However, while some AVX instructions don’t provide access to wider registers, they’re necessary since there is a cost to mixing AVX and legacy SSE instructions. Again, per Agner Fog:
All the existing XMM instructions have two versions in the AVX instruction set. A legacy version which leaves the upper half (bit 128-255) of the target YMM register unchanged, and a VEX-prefix version which sets the upper half to zero in order to remove false dependences. If a code sequence contains both 128-bit and 256-bit instructions then it is important to use the VEX-prefix version of the 128-bit instructions. Mixing 256 bit instructions with legacy 128-bit instructions without VEX prefix will cause the processor to switch the register file between different states which may cost many clock cycles on Intel’s Sandy Bridge processor.
So to take my current use-case, the logical, vectorised, comparison of individual byte values, there’s no benefit to using vpcmpeqb
, vpxor
and vpand
over the SSE pcmpeqb
, pxor
and pand
instructions since they all only operate on XMM registers — it’s just a question of cost-avoidance. Some performance testing of the converted bitwise-OR function bears this out, and the AVX version is, in fact, ever so slightly slower (although I’m happy to take responsibility for that).
The conversion of the existing SSE code was fairly straightforward, and once I realised that I wasn’t going to have to deal with YMM registers it made the loop-accounting identical. By that I mean having to deal with less than a full YMM register’s worth of byte-values and how to copy the less than 32 bytes to the result-vector efficiently. The steps were quite straightforward:
- ensure that memory-reads and writes are properly aligned; and
- ensure that only AVX instructions are used.
kdbavxor.s
# (c) Michael Guyver, 2012, all rights reserved. Please see the copy on GitHub for the full licence.
.section .rodata
.align 0x10
.LbadTypeStr:
.asciz "vectype"
.section .text
.globl mgBitwiseOr
.type mgBitwiseOr, STT_FUNC
.LhandleBadType: # Handler for invalid input vector-type
(%rip), %rdi # Load address of error message
lea .LbadTypeStrcallq krr@PLT # Call kdb's error-reporting function
jmp .LrestoreStackForExit # Exit via the clean-up function
.LsuccessExit:
%r14, %rax # Copy result address to RAX
mov
.LrestoreStackForExit:
%rbp, %rsp # Restore SP
mov pop %r15 # Restore non-durable registers
pop %r14
pop %r13
pop %r12
pop %rbx
pop %rbp
vzeroupper # Guard against expensive AVX->SSE transition
ret # Exit to caller
.align 0x10 # Align entry-point to 16-byte boundary
mgBitwiseOr:
push %rbp # Push all non-durable registers
push %rbx
push %r12
push %r13
push %r14
push %r15
%rsp, %rbp # Store updated SP
mov sub $0x10, %rsp # Ensure alignment reserves at least 16 bytes
and $-0x20, %rsp # Align SP to 32-byte boundary
.LstoreInputArgs:
%rdx, 0x08(%rsp) # Store arg2
mov %rcx, 0x00(%rsp) # Store arg3
mov %rdi, %r12 # Copy/store arg0
mov %rsi, %r13 # Copy/store arg1
mov
vzeroupper # Guard against expensive SSE->AVX transition
.LtestInputType:
cmpb $4, 2(%rdi) # Compare input vec-tye with 4
jne .LhandleBadType # branch if not-equal
.LcreateResultObject:
$1, %rdi # arg0: vector-type
mov movq 8(%r12), %rsi # arg1: copy input vector-length
callq ktn@PLT # Create result vector
%rax, %r14 # Safe-store result pointer
mov
.LsetupAvxConstants:
xor %rax, %rax # Zero register
movb 8(%r13), %al # Copy mask byte-value
movq %rax, %xmm0 # Copy again to XMM0
vpxor %xmm1, %xmm1, %xmm1 # Clear XMM0: arg[2] = arg[0] ^ arg[1]
vpshufb %xmm1, %xmm0, %xmm0 # Byte-wise shuffle: msk=arg[0], val=arg[1], dst=arg[2]; 2B/4-282, p. 284
vpcmpeqw %xmm2, %xmm2, %xmm2 # Set all bits in XMM2
vpcmpeqw %xmm3, %xmm3, %xmm3 # Set all bits in XMM3: 0xffff_ffff_ffff_ffff_...
vpsrlw $15, %xmm3, %xmm3 # Rotates by 15 bits: 0x0001_0001_0001_0001_...
vpackuswb %xmm3, %xmm3, %xmm3 # Packs: 0x0101_0101_0101_0101_...
.LsetupLoopCounter:
xor %rdx, %rdx # Clear loop-counter
8(%r14), %r15 # Copy result->length to R15
mov
.LloopStart:
%r15, %rax # Copy vec-len
mov sub %rdx, %rax # Calculate remaining
je .LsuccessExit # branch to exit-handler if remaining == 0
$0x10, %rax # Compare remaining with 16
cmp jl .LsubXmmRemaining # branch if < 16
je .LprocessOneXmm # branch if == 16
0x10(%r12,%rdx), %rbx # Calculate address of read cursor
lea test $0x3f, %rbx # Test for 64-byte alignment
jnz .LprocessOneXmm # if not aligned, jump to generic handler
$0x40, %rax # Read is aligned, check num bytes remaining
cmp jge .LprocessCacheLine # if >= 64, jump to cache-line handler
.LprocessOneXmm:
vmovaps 0x10(%r12,%rdx,1), %xmm4 #
vpand %xmm0, %xmm4, %xmm4 #
vpcmpeqb %xmm1, %xmm4, %xmm4 #
vpxor %xmm2, %xmm4, %xmm4 #
vpand %xmm3, %xmm4, %xmm4 #
vmovaps %xmm4, 0x10(%r14,%rdx,1) #
$0x10, %rdx
add jmp .LloopStart
The next section contains some minor differences from the original SSE version which I thought it worth mentioning. In an attempt to lower the work done by the memory controller, I have issued two vmovaps
instead of four movaps
instructions. The fact that the memory controller does “less” work is measurable insofar as this code issues 60.0% of the number of MEM_UOP_RETIRED.LOADS
events.
In order to unpack the 32-byte YMM registers into XMM registers, I’ve used the vinsertf128
instruction with the imm8
value to copy the YMM high double-quad-word/octo-word to the low DQW/OW of another YMM register — or in other words, an XMM register. I use the same process in reverse so that the result is written to memory in two 32-byte chunks.
.align 0x10 # Align jump target to 16-byte boundary
.LprocessCacheLine:
vmovaps 0x10(%r12,%rdx,1), %ymm4 # Copy 32 bytes to YMM4
vmovaps 0x30(%r12,%rdx,1), %ymm6 # Copy 32 bytes to YMM6
prefetcht0 0x310(%r12,%rdx,1) # Hint to prefetch cache line several loops hence
vperm2f128 $0x01, %ymm4, %ymm4, %ymm5 # Unpack high-bytes from YMM4 to XMM5
vperm2f128 $0x01, %ymm6, %ymm6, %ymm7 # Unpack high-bytes from YMM6 to XMM7
vpand %xmm0, %xmm4, %xmm4 # Perform bitwise comparison
vpand %xmm0, %xmm5, %xmm5 #
vpand %xmm0, %xmm6, %xmm6 #
vpand %xmm0, %xmm7, %xmm7 #
vpcmpeqb %xmm1, %xmm4, %xmm4 #
vpcmpeqb %xmm1, %xmm5, %xmm5 #
vpcmpeqb %xmm1, %xmm6, %xmm6 #
vpcmpeqb %xmm1, %xmm7, %xmm7 #
vpxor %xmm2, %xmm4, %xmm4 #
vpxor %xmm2, %xmm5, %xmm5 #
vpxor %xmm2, %xmm6, %xmm6 #
vpxor %xmm2, %xmm7, %xmm7 #
vpand %xmm3, %xmm4, %xmm4 #
vpand %xmm3, %xmm5, %xmm5 #
vpand %xmm3, %xmm6, %xmm6 #
vpand %xmm3, %xmm7, %xmm7 #
vperm2f128 $0x20, %ymm4, %ymm5, %ymm4 # Copy XMM5 to the high-bytes of YMM4
vperm2f128 $0x20, %ymm6, %ymm7, %ymm6 # Copy XMM7 to the high-bytes of YMM6
vmovaps %ymm4, 0x10(%r14,%rdx,1) # Copy 32 bytes to result
vmovaps %ymm6, 0x30(%r14,%rdx,1) # Copy 32 bytes to result
$0x40, %rdx # Loop accounting: add 64 to counter
add %r15, %rax # Copy result length
mov sub %rdx, %rax # Subtract counter from result.len
$0x40, %rax # Compare result with 64
cmp jge .LprocessCacheLine # if greater or equal, loop again
jmp .LloopStart # Otherwise, enter generic loop handler
The following code is, again, functionally identical to the SSE variant except for the replacement of legacy SSE instructions with their AVX counterparts. This section of code copies less than a full XMM register’s worth of bytes to the result-vector. Say, for example, there are 7 bytes remaining to copy, it would skip copying the high quad-word of the XMM value now on the stack (8 bytes), but would copy the next double-word (4 bytes), word (2 bytes) and byte value to the result vector (= 7).
.LsubXmmRemaining:
vmovaps 0x10(%r12,%rdx,1), %xmm4 # Load 16-bytes of input
vpand %xmm0, %xmm4, %xmm4 # compare
vpcmpeqb %xmm1, %xmm4, %xmm4 # set to 0xff if equal
vpxor %xmm2, %xmm4, %xmm4 # flip bytes
vpand %xmm3, %xmm4, %xmm4 # convert 0xff to 0x01
vmovaps %xmm4, (%rsp) # Copy boolean results to stack
xor %rbx, %rbx # Zero RBX for counting bytes copied
%r15, %rax # Copy veclen(result)
mov sub %rdx, %rax # Calc remaining
$0x08, %rax # Compare remaining with 8
cmp jl .LltEightRemaining # branch < 8
movq (%rsp,%rbx,1), %r8 # Copy QW from stack to QW reg
movq %r8, 0x10(%r14,%rdx,1) # Copy the same to the result
$0x08, %rdx # Add 8 to counter
add $0x08, %rbx # Add 8 to src-counter
add sub $0x08, %rax # Subtract 8 from remaining
.LltEightRemaining: # Handle < 8
$0x04, %rax # Compare remaining with 4
cmp jl .LltFourRemaining # branch < 4
movl (%rsp,%rbx,1), %r8d # Copy result to DW reg
movl %r8d, 0x10(%r14,%rdx,1) # Copy DW to result
$0x04, %rdx # Add 4 to counter
add $0x04, %rbx # Add 4 to src-counter
add sub $0x04, %rax # Subtract 4 from remaining
.LltFourRemaining:
$0x02, %rax
cmp jl .LltTwoRemaining
movw (%rsp,%rbx,1), %r8w
movw %r8w, 0x10(%r14,%rdx,1) # Copy W to result
$0x02, %rdx # Add 2 to counter
add $0x02, %rbx # Add 2 to src-counter
add sub $0x02, %rax # Subtract 2 from remaining
.LltTwoRemaining:
$0x01, %rax
cmp jl .LnoneRemaining
movb (%rsp,%rbx,1), %r8b
movb %r8b, 0x10(%r14,%rdx,1) # Copy DW to result
.LnoneRemaining:
jmp .LsuccessExit
I’m not going to go into an analysis of this code in any great depth — but a comparison of the two different functions from the same q
session shows the following results (the AVX version is on top):
q).pmc.avgres:{[t;n] flip key[d]!enlist each value d:avg[n#t] }
q).pmc.setAvx[]
q)avx:.pmc.script4`usr;
q).pmc.setSse[]
q)sse:.pmc.script4`usr;
q).pmc.avgres[avx;-1024] uj .pmc.avgres[sse;-1024]
instAny clkCore clkRef UopsAny MemUopsLd L3Miss StallCycles MHz nanos
-----------------------------------------------------------------------------------
468950 208697 165777.4 485184.3 46915.16 1.169922 31703.37 3399.318 61399.37
468954 208664.1 165706.5 515918.6 78157.2 3.637695 42506.97 3400.265 61372.37
Some items of note:
- a insigificant difference in the number of instructions executed between the two variants (instAny
);
- a noticeable difference in the number of µ-ops issued (UopsAny
);
- a very significant difference in the number of memory-load µ-ops issued (MemUopsLd
);
- a very significant difference in the number of L3 misses (L3Miss
);
- a very significant difference in the number of stall-cycles experienced by the two different functions (StallCycles
);
- an entirely equivalent average CPU clock speed for each test run (MHz
, derived from clkCore
and clkRef
)… yet
- an almost identical number of reference clocks (clkRef
) and hence wall-time for both variants. In light of the above, I find this very difficult to explain — the AVX version has everything going for it. It’s like playing top trumps but losing in spite of having a card with all the best scores. If you’ve got any ideas, please drop me a line?
Like the last time, I’ve included the code for this post on GitHub, complete with licence details etc.