This post is part of the series on performance monitoring with Intel MSRs on Linux:
- A Linux Module For Reading/Writing MSRs
- Intel MSR Performance Monitoring Basics
- Fun with MSRs: Counting Performance Events On Intel
- Scripting MSR Performance Tests With kdb+
- Scripting MSR Performance Tests With kdb+: Part 2
- Intel Performance Monitoring: Loose Ends: this post
If you haven’t already, you’ll need to download the q
64-bit evaluation version for Linux from Kx Systems.
You will also need root
access to your Linux system, since the next step is to download the source-code from GitHub for the MSR kernel driver and install it onto your system. The source code for this post is also hosted on GitHub and can be found here. Both directories have a Makefile
and so should be easy enough to use. If you’re doing performance monitoring on Linux you can probably work a Makefile
.
The next part is “installing” the q
instance. This is as simple as unzipping the linux.zip
file in some vaguely sensible location and then setting the environment variable QHOME
to the absolute path of the directory containing the l64
directory (which is very easy to find). After that you can either add the l64
diretory to your path or invoke q
with a relative prefix.
One thing which will make your life easier on Linux is using rlwrap
to give you a sensible console experience - e.g. using the cursor keys to navigate the command line rather than emit ASCII control codes to the screen, etc. I alias the q
command as follows:
alias qq='rlwrap q'
And that’s all there is to it.
Assuming you’ve built the source and installed the driver…
The next step would be to set the LD_LIBRARY_PATH
value to contain or equal the directory path containing the libpmc.so
file. After that, you can start q
by issuing the command
rlwrap q pmc.q
The script automatically loads the pmcdata.csv
file and you can view its contents by simply typing the following at the q-prompt:
q).pmc.evt
But of course, you don’t need to do that, only if you want to see what it contains.
I’ve canned a couple of PMC configurations but you should definitely consider configuring your own. I wouldn’t go so far as to say they’re the most coherent set of PMC choices, but hey, mix and match and see something useful on each run. I’m talking about the functions .pmc.script1
, .pmc.script2
, .pmc.script3
and .pmc.script4
. Each of these functions takes what I’ve called a domain
argument, which is a symbol atom or vector with possible values `usr`os
. These are flag values (in the sense that the IA32_PERFEVTSELx
register has individual bit-flags). By specifying one or the other or both, you determine whether the PMCs and FFCs count in ring 0, rings 1, 2 & 3 or all of them. I have to say that there’s not much point counting cycles etc in the kernel (ring 0) with the trite testcode.c
I uploaded to GitHub, since it only executes a user-land function (gettimeofday
) and you get nonsense back when just specifying `os
! Interesting, nonetheless.
There’s an added enhancement as well to the libpmc.c
code as well. Instead of relying on functions declared as extern
and then packaged together by the linker, it uses the dynamic loader to look for the dynamic shared object (DSO) libtest.so
. It loads this on each execution of test-run (and pre-links the symbols to avoid some ugly skew on the first run), which means that it’s possible to leave kdb+ running, recompile your code with a minor change in it, and re-run the script you ran before. This way you can see immediately what effect your change has had on the execution characteristics. As my almost-three-year-old would say: “Cool, huh?” (note to self: stop saying things like that when he’s around).
Anyway, for the non-q programmers amongst you, you could execute a script the following ways:
.pmc.script1[`os]
.pmc.script1`os
.pmc.script1[`os`usr]
.pmc.script1`os`usr
The first two lines are functionally equivalent, as are the final two.
I’ll probably keep remembering things I need to add to these instructions, but one in particular is in the .pmc.runscript
function. I have hard-coded (in the sense that anything in an ASCII text-file is hard-coded) the reference clock-speed of my CPU: 2.7 GHz. It uses the difference between the values for the reference and actual clock cycles (from the FFCs) to show a rough number representing the current core speed. I’ve found it suprisingly accurate, and as mentioned in a previous post, the core spends most of its time at 800 MHz. The other results-column worth treating with some suspicion is the “nanos” column. This basically takes the number of reference clock ticks counted by the FFC (careful about whether it’s counting in usr
/os
) and divides it by 2.7 to come up with a spurious “wall-time” nanosecond value. Little more than a trinket, to be honest, but it amused me at the time.
It’s definitely worth pondering, if you get the chance, whether the load-pressure exerted on your memory sub-system by the CPU will do something interesting if your CPU changes its speed-stepping to start running at full speed (or even in turbo mode). In more complex code with hefty workloads, after the first few iterations the results may show your CPU increasing its clock-speed and hence issuing more load (and store) requests to the QPI and LLC/IO Controller per time unit. Now, you could get to this point by disabling speed-stepping — probably your preferred option if you’re trying to do something vaguely forensic — but it’s very interesting to see how the profile of some code changes with the CPU-to-Bus ratio. The system which coped amazingly well with the load put on it by a quiesced CPU may suddenly show up as a bottleneck when it’s speed increases.
As a parting example, here’s what I get when I execute .pmc.script1[`usr]
:
instAny clkCore clkRef UopsAny UopsP015 UopsP234 L3Miss MHz nanos
-----------------------------------------------------------------
83 497 1685 131 68 47 0 796 624
83 112 362 124 52 39 0 835 134
83 74 227 108 45 36 0 880 84
83 74 254 108 41 39 0 787 94
83 73 254 112 47 34 0 776 94
83 74 227 112 47 35 0 880 84
83 78 281 112 48 36 0 749 104
83 77 254 112 49 36 0 819 94
83 79 254 112 51 36 0 840 94
83 77 254 112 47 31 0 819 94
83 79 281 112 51 36 0 759 104
83 74 254 112 47 35 0 787 94
83 79 281 112 48 37 0 759 104