A long time ago I started writing some shared-library functions for kdb+ which would offer bitwise comparison of integer vector types using SSE vector registers. I came up with something which worked well enough but wanted to know if it could be made to go any faster. I wondered in particular about Intel’s prefetch
instructions and needed a way of verifying whether they made any difference.
Please also see the follow-up post Another vectorised bitwise-OR function, AVX style.
In order to find out, I downloaded, built and figured out how to use Agner Fog’s unique testp library. I say unique because I am unaware of any other non-commercial software which aims to help programmers accurately measure the performance of a section of code. The invariant RDTSC
instruction just doesn’t cut the mustard in an age of speed-stepping and turbo boost. I have to admit to finding Agner’s library a little bit confusing, but also got useful results with it. And then… well, then I got side-tracked, as I thought it could be simplified for the single-threaded use-case, after all, how hard could it be? Once the MSR kernel module is in place, it should be easy to roll-your-own controller. Then I had the idea about using kdb+ to run the controller and manage the results… and so I spent quite a bit of time not writing vectorised bitwise comparison functions and instead wrote a performance monitoring tool (entirely for my own selfish needs, hence the reason it only currently targets the 2nd generation Core i7 processor family).
But now I’m back on the original project and have been able to determine that the prefetch
instruction does make a measurable difference and think I’ve found the sweet spot which knocks about 10% off the execution time of the vector register code-section. I’ll present the code in chunks because it’s quite long, not because of the comparison code itself but because of having to handle insufficiently-well aligned data and end-of-vector data which doesn’t fill a whole XMM register. The supporting code for this blog-post can be found on my space on GitHub.
/*
* (c) Michael Guyver, 2012, all rights reserved.
*/
.include "savereg.macro.s"
.section .rodata
.align 0x10 # Align LbadTypeStr to 16-byte boundary
.LbadTypeStr:
.asciz "vectype"
.section .text
.globl bitwiseOr # Declare global function 'bitwiseOr'
.type bitwiseOr, STT_FUNC
.LbadType: # Handler for invalid input vector-type
(%rip), %rdi # Load address of error message
lea .LbadTypeStrcallq krr@PLT # Call kdb's error function
jmp .LrestoreStackForExit # Exit via the clean-up function
.LsuccessExit:
%r14, %rax # Copy result address to RAX
mov .LrestoreStackForExit:
%rbp, %rsp # Restore poppable SP
mov pop %r15 # Restore non-volatile registers
pop %r14
pop %r13
pop %r12
pop %rbx
pop %rbp
ret # Exit to calling code
bitwiseOr:
push %rbp # Push all non-volatile registers
push %rbx
push %r12
push %r13
push %r14
push %r15
%rsp, %rbp # Store updated SP mov
The code above contains the simple error-reporting function .LbadType
. This notifies the q
binary that an exception has occurred which is reported on return from the library function. The .LrestoreStackForExit
code epilogue which is executed on any exit-path from bitwiseOr
restored the non-volatile registers from the stack and and finally executes the ret
instruction. Note also the .LsuccesExit
function which sets the result value on the RAX
register (and which then flows into the function epilogue). I’ve shown the prologue of the bitwiseOr
function for comparison with the function epilogue.
It’s probably worth mentioning that I’ve stored the stack-pointer in the RBP
to make it simple to recover the position from which the values can be popped from the stack. This has the advantage that I don’t need to think too hard about the stack-pointer alignment and can simply align it to the next lowest 16-byte boundary, as in the following code by performing a bitwise AND with bitwise NOT 0x0F
. Note also that this code has been modified to accommodate two additional parameters, namely the function pointers to libpmc.c
’s start_counters
and stop_counters
.
sub $0x10, %rsp # Reserve stack space
and $~0xF, %rsp # Align to 16-byte boundary
push %rdx # Store arg2: start_counters address: 0x08(%rsp)
push %rcx # Store arg3: stop_counters address: 0x00(%rsp)
%rdi, %r12 # Save arg0: byte-vec
mov %rsi, %r13 # Save arg1: byte-mask
mov
movb 2(%rdi), %al # Copy arg0->t to lo-byte
%al, %bl # Copy type
mov sub $4, %bl # Check lower-bound
jl .LbadType # Branch if below
sub $4, %al # Check upper-bound
jg .LbadType # Branch if above
After the stack alignment and storing the two function pointers on the stack (keeping the 16-byte alignment), the code then tests the type of the first argument. Have a look at the k.h
header file for the definition of the 3.0 struct k0
. The third byte in the struct defines the type. At the moment the type-check tests a lower and an upper-bound of 4; this isn’t nuts, it’s just waiting till I implement the code for short
, int
and possibly long
values whose type-values are 5, 6 and 7. If the test fails either comparison the code branches to the error handler function. The code then continues to instantiate the result object, using functions exposed by the q
binary:
# Create result object
$1, %rdi # Specify type (bool-vec)
mov movq 8(%r12), %rsi # Copy veclen(arg0)
callq ktn@PLT # Create result struct
%rax, %r14 # Store result address in non-vol register mov
It’s a simple code-snippet: it requests the construction of a boolean vector with the same length as input vector arg0
. See code.kx.com for information on the ktn
function for creating lists. The code then stores the pointer to the result struct in a spare non-volatile register for later use.
We’re finally getting to some of the comparison code. The basic comparison I’ve come up with is as follows:
example XMM value under test: | 000102030405060708090a0b0c0d0e0f |
test against vectorised mask byte: | 01010101010101010101010101010101 |
(interim result) | 00010001000100010001000100010001 |
compare with zeroed-operand: | 00000000000000000000000000000000 |
(interim result) | FF00FF00FF00FF00FF00FF00FF00FF00 |
XOR with fully-set operand: | FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF |
(interim result) | 00FF00FF00FF00FF00FF00FF00FF00FF |
Reduce 0xFF to boolean ‘true’ 0x01 : |
00010001000100010001000100010001 |
It’s a bit long winded and for the special case where you’re testing against a byte-mask of 0x01
you could skip the final three steps (note to self), but it works for the general case. In order to perform these comparisons and transformations, I need to set up a vector containing 16 copies of the byte-mask passed in as the second argument to bitwiseOr
, a fully-zeroed XMM register (easy), a fully-set XMM (easy, as it turns out) and an XMM register with each byte set to 0x01
(easy if you know that Agner Fog has taken the time and trouble to publish in his manuals the steps necessary to generate different constant values):
.LpreLoopByte:
# shuffle mask byte into all 128 bits of XMM
xor %rax, %rax # zero register
movb 8(%r13), %al # load mask-byte, arg1->g
movq %rax, %xmm0 # copy byte to XMM
pxor %xmm1, %xmm1 # clear mask bytes
pshufb %xmm1, %xmm0 # XMM0 contains 16 copies of arg1->g
pcmpeqw %xmm2, %xmm2 # set all bits
# Initialise 0xFF->0x01 conversion mask
pcmpeqw %xmm3, %xmm3 # Generates: 0xffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff
psrlw $15, %xmm3 # Rotates: 0x0001_0001_0001_0001_0001_0001_0001_0001
packuswb %xmm3, %xmm3 # Packs: 0x0101_0101_0101_0101_0101_0101_0101_0101
movq 8(%r14), %r15 # Copy veclen(result) -> R15
xor %rdx, %rdx # Clear counter
Having set up the constant values needed to do the comparisons, there now follows the main-loop controller code. It’s complicated by the fact that it needs to test bounding-conditions, counts of elements remaining as well as cache-line alignment before it can branch to the high-througput code. Read the comments for more detail:
.LloopByte:
%r15, %rax # Copy vec-len
mov sub %rdx, %rax # Calculate remaining
jle .LsuccessExit # branch to exit if <= 0
$0x10, %rax # Compare remaining with 16
cmp jl .LsubXmmRemaining # branch if < 16
je .LsingleReg # branch if == 16
0x10(%r12,%rdx), %rbx # Load address of next read
lea test $0x3f, %rbx # Test for 64-byte alignment
jnz .LsingleReg # if not aligned, jump to generic handler
$0x40, %rax # Read is aligned, check num bytes available
cmp jge .LquadRegPre # if >= 64, handle 64-bytes at a time
The code starting at label .LsingleReg
handles the case where there is exactly one XMM register of data remaining, or the input data is not yet aligned to a 64-byte boundary. The result is then copied into the boolean result-vector, which is 16-byte aligned (enabling the use of movaps
without further alignment checks). Finally it jumps back to the loop controller code.
.LsingleReg:
movaps 0x10(%r12,%rdx,1), %xmm4 # Load 16-bytes of input
pand %xmm0, %xmm4 # compare with mask
pcmpeqb %xmm1, %xmm4 # set to 0xFF if equal
pxor %xmm2, %xmm4 # flip bits
pand %xmm3, %xmm4 # convert 0xFF to 0x01
movaps %xmm4, 0x10(%r14,%rdx,1) # Given >= 16 bytes, copy result
$0x10, %rdx
add jmp .LloopByte
The next chunk of code is longer as a I’ve unrolled a loop such that each loop iteration addresses one whole cache-line of data by using four XMM registers. A couple of things to point out are that the .LquadRegPre
label makes it easy to skip the code I’ve added to start the PMC timer in the line:
call *0xa8(%rsp)
The value 0xa8(%rsp)
references the function pointer start_counters
and the 0xa8
value allows for the values stored to the stack by the macro m_save_regs
, the code for which you can review on GitHub, if you like.
Then, of course, comes the prefetcht0
instruction, the cause of the delay in this post seeing the light of day. There’s not a great deal of information available about its operation but some guidance about its use suggests ensuring that the data it references is sufficiently far enough ahead of the current read-point so that it will have been loaded by the time it’s needed. Furthermore, since it may result in a load-instruction being issued, it should appear after any explicit load-instructions as it may negatively impact their performance. There are several types of prefetch
instruction, and although the whole area is slightly murky, I seem to get the best results with prefetcht0
.
.LquadRegPre:
m_save_regscall *0xa8(%rsp)
m_restore_regs.LquadReg:
movaps 0x10(%r12,%rdx,1), %xmm4 # Copy cache line into registers
movaps 0x20(%r12,%rdx,1), %xmm5
movaps 0x30(%r12,%rdx,1), %xmm6
movaps 0x40(%r12,%rdx,1), %xmm7
prefetcht0 0x210(%r12,%rdx,1) # Prefetch src cache line 8 loops hence (after movaps above)
pand %xmm0, %xmm4 # Bitwise-AND 16-byte value against the vectorised byte-mask
pand %xmm0, %xmm5
pand %xmm0, %xmm6
pand %xmm0, %xmm7
pcmpeqb %xmm1, %xmm4 # Compare byte-values with zero, each now 0xFF or 0x00
pcmpeqb %xmm1, %xmm5
pcmpeqb %xmm1, %xmm6
pcmpeqb %xmm1, %xmm7
pxor %xmm2, %xmm4 # Flip bytes using an operand with all bits set
pxor %xmm2, %xmm5
pxor %xmm2, %xmm6
pxor %xmm2, %xmm7
pand %xmm3, %xmm4 # Filter bits so those set 0xFF now equal 0x01
pand %xmm3, %xmm5
pand %xmm3, %xmm6
pand %xmm3, %xmm7
movaps %xmm4, 0x10(%r14,%rdx,1) # Copy result to output
movaps %xmm5, 0x20(%r14,%rdx,1)
movaps %xmm6, 0x30(%r14,%rdx,1)
movaps %xmm7, 0x40(%r14,%rdx,1)
$0x40, %rdx # Add 64 to counter
add %r15, %rax # Copy veclen
mov sub %rdx, %rax # Subtract counter from veclen
$0x40, %rax # Compare remainder against 64 bytes
cmp jge .LquadReg # if >= 64, repeat aligned branch
m_save_regscall *0xa0(%rsp)
m_restore_regsjmp .LloopByte # Branch to generic loop
The following series of instructions handles the case where there are fewer than 16 bytes remaining. It still loads 16-bytes into the XMM register, trusting in memory-alignment niceties to assume that we’re not going to attempt to load data off the side of the virtual page, but it does take care to ensure that it only writes the right number of bytes to the result. It attempts to do this by comparing the element-count successively with 8, then 4, 2 and finally 1, decrementing the count appropriately as bytes are written to the result vector. By way of example, assume that 15 bytes need to be processed: in that case, each of the sections which deal with 8, 4, 2 and 1 byte will execute since 8 + 4 + 2 + 1 = 15. With only 7 bytes remaining, code sections for 4, 2 and 1 would execute. Etc.
.LsubXmmRemaining:
movaps 0x10(%r12,%rdx,1), %xmm4 # Load 16-bytes of input
pand %xmm0, %xmm4 # compare
pcmpeqb %xmm1, %xmm4 # set to 0xff if equal
pxor %xmm2, %xmm4 # flip bytes
pand %xmm3, %xmm4 # convert 0xff to 0x01
movaps %xmm4, (%rsp) # Copy boolean results to stack
xor %rbx, %rbx # Zero RBX for counting bytes copied
%r15, %rax # Copy veclen(result)
mov sub %rdx, %rax # Calc remaining
$0x08, %rax # Compare remaining with 8
cmp jl .LltEightRemaining # branch < 8
movq (%rsp,%rbx,1), %r8 # Copy QW from stack to QW reg
movq %r8, 0x10(%r14,%rdx,1) # Copy the same to the result
$0x08, %rdx # Add 8 to counter
add $0x08, %rbx # Add 8 to src-counter
add sub $0x08, %rax # Subtract 8 from remaining
.LltEightRemaining: # Handle < 8
$0x04, %rax # Compare remaining with 4
cmp jl .LltFourRemaining # branch < 4
movl (%rsp,%rbx,1), %r8d # Copy result to DW reg
movl %r8d, 0x10(%r14,%rdx,1) # Copy DW to result
$0x04, %rdx # Add 4 to counter
add $0x04, %rbx # Add 4 to src-counter
add sub $0x04, %rax # Subtract 4 from remaining
.LltFourRemaining:
$0x02, %rax
cmp jl .LltTwoRemaining
movw (%rsp,%rbx,1), %r8w
movw %r8w, 0x10(%r14,%rdx,1) # Copy W to result
$0x02, %rdx # Add 2 to counter
add $0x02, %rbx # Add 2 to src-counter
add sub $0x02, %rax # Subtract 2 from remaining
.LltTwoRemaining:
$0x01, %rax
cmp jl .LnoneRemaining
movb (%rsp,%rbx,1), %r8b
movb %r8b, 0x10(%r14,%rdx,1) # Copy DW to result
.LnoneRemaining:
jmp .LsuccessExit
PREFETCH performance data
I chose the .pmc.script4
set of performance counters to illustrate the effect of the prefetch
instruction since it shows µ-ops as well as some memory-related counters. Here’s the code with the prefetcht0
instruction in place:
q)t0:1_.pmc.script4`usr
q)-15#t0
instAny clkCore clkRef UopsAny MemUopsLd MemUopsSv StallCycles MHz nanos
------------------------------------------------------------------------
46860 15056 50816 51589 7826 6248 596 800 18821
46860 14858 50141 51599 7826 6248 526 800 18571
46860 14814 50006 51596 7826 6248 479 800 18521
46860 15130 51086 51589 7826 6248 695 800 18921
46860 14914 50330 51589 7826 6248 585 800 18641
46860 14810 49979 51599 7826 6248 501 800 18511
46860 14877 50195 51595 7826 6248 526 800 18591
46860 14854 50141 51599 7826 6248 550 800 18571
46860 14738 49736 51578 7826 6248 508 800 18421
46860 14737 49736 51593 7826 6248 505 800 18421
46860 14656 49466 51596 7826 6248 480 800 18321
46860 15008 50654 51593 7826 6248 642 800 18761
46860 14527 49007 51578 7826 6248 384 800 18151
46860 15015 50681 51589 7826 6248 614 800 18771
46860 14804 49979 51593 7826 6248 556 800 18511
And here are the results with the prefetcht0
instruction removed:
q)t1:1_.pmc.script4`usr
q)-15#t1
instAny clkCore clkRef UopsAny MemUopsLd MemUopsSv StallCycles MHz nanos
------------------------------------------------------------------------
45298 19202 64806 50025 6263 6248 1379 800 24002
45298 18919 63861 50021 6263 6248 1265 800 23652
45298 19291 65103 50025 6263 6248 1397 800 24112
45298 19034 64239 50029 6263 6248 1234 800 23792
45298 19296 65103 50041 6263 6248 1380 800 24112
45298 19176 64698 50021 6263 6248 1331 800 23962
45298 19337 65265 50041 6264 6248 1423 800 24172
45298 18977 64050 50021 6262 6248 1255 800 23722
45298 19254 64995 50025 6264 6248 1374 800 24072
45298 18942 63915 50021 6262 6248 1262 800 23672
45298 19293 65103 50001 6263 6248 1371 800 24112
45298 19074 64374 50203 6263 6248 1244 800 23842
45298 19395 65481 50017 6263 6248 1417 800 24252
45298 18976 64050 50026 6263 6248 1242 800 23722
45298 19277 65076 50005 6264 6248 1371 800 24102
Here are the averages of the performance runs joined using uj
or “union join”:
q)avgres:{[t;n] flip key[d]!enlist each value d:avg[n#t]}
q)avgres[t0;-100] uj avgres[t1;-100]
instAny clkCore clkRef UopsAny MemUopsLd MemUopsSv StallCycles nanos
---------------------------------------------------------------------------
46860 14987.26 50581.71 51595.45 7825.774 6248 602.7419 18734.23
45298 19240.52 64937.52 50030.77 6262.806 6248 1359.903 24050.71
OK, some things to note:
- the addition of the prefetcht0
instruction increases the number of instructions retired (instAny
column) and µ-ops issued. Despite the increaed work the code section appears to complete in a shorter time;
- the number of stall cycles is significantly higher in the version without the prefetcht0
instruction;
- the prefetcht0
instruction affects the results returned by the combination of PMC event MEM_UOP_RETIRED.ALL
filtered by MEM_UOP_RETIRED.LOADS
.
- the performance improvement factor appears to be around 1.28, but
- the processor is clearly executing at a lower clock frequency (helpfully calculated in the results tables as 800 MHz) and hence is not exercising the memory sub-system as hard as it would do, were it executing at full speed. What would happen in that case? Would the prefetcht0
instruction help, or would the memory sub-system reach saturation point and it make no difference at all?
Well, by increasing the size of the first parameter to 1 million (from 100k), and increasing the test iterations, I managed to get the CPU to try a bit harder. Not only that, but also the memory sub-system: while 1 million bytes will still fit in the L3 cache, it’s making it work harder. If I use an average from the last 1024 test runs (of 2048), I get the following:
q)avgres[t0;-1024] uj avgres[t1;-1024]
instAny clkCore clkRef UopsAny MemUopsLd L3Miss StallCycles MHz nanos
-----------------------------------------------------------------------------------
468720 208939.4 166570.2 515684.1 78138.16 15.08691 45108.97 3387.897 61692.64
453096 265471.6 211694.3 500098.8 62515.32 6.955078 51533.47 3386.98 78405.09
So…
- similar proportions of MemUopsLd
, vastlly different stall cycle ratio to the low-power runs;
- when playing with different iteration counts, the L3Miss
value was significantly higher than 15.0 for the run with the prefetcht0
instruction;
- the prefetcht0
version is faster by a factor of 1.27, which is nice;
- if the nanos
column really can be treated as wall time, then a function which (once warmed-up etc) handles 1mm bitwise comparisons in 61 micros is probably going to work out OK.
A production version?
I have put together a non-commercial version for you to play with which you can download from here. You can load it using the kdbasm.q
scriptlet:
/ load the shared library 'libbwcmp.so' and assign the 2-arg function 'bitwiseOr' to .mg.or
.mg.or:`libbwcmp 2:(`bitwiseOr;2);
An easy way to execute it and see its results is as follows:
q)n:2000
q).mg.or[n#`byte$ til 0xFF;0x01]
01010101010101010101010101010101010101010101010101010101010101010101010101010..
q).mg.or[n#`byte$ til 0xFF;0x02]
00110011001100110011001100110011001100110011001100110011001100110011001100110..
q).mg.or[n#`byte$ til 0xFF;0x03]
01110111011101110111011101110111011101110111011101110111011101110111011101110..
q).mg.or[n#`byte$ til 0xFF;0x08]
00000000111111110000000011111111000000001111111100000000111111110000000011111..
q).mg.or[n#`byte$ til 0xFF;0x10]
00000000000000001111111111111111000000000000000011111111111111110000000000000..
q).mg.or[n#`byte$ til 0xFF;0xFF]
01111111111111111111111111111111111111111111111111111111111111111111111111111..
q)\t .mg.or[n#`byte$ til 0xFF;0xFF]
0
q)/ \:D/
Next steps?
Well, there’s clearly scope for extending this function to support more integer types, though I do wonder about the utility of writing something for long
values. That’s a lot of flag bits which you’d be storing. I think after perhaps writing something for short
values, targetting the AVX registers is the obvious thing to do. It shouldn’t be too hard, I don’t think (famous last words?), since apart from more stringent memory-alignment requirements, the approach ought to be almost identical. It will be interesting to see how much faster it goes, how many instructions are issued, µ-ops issued and how it affects the memory controllers, since if the code executes faster there they will have to work harder to service the load-requests. Anyway, that’s for another day. Time to put this one to bed.