This post is part of the series on performance monitoring with Intel MSRs on Linux:
- A Linux Module For Reading/Writing MSRs
- Intel MSR Performance Monitoring Basics
- Fun with MSRs: Counting Performance Events On Intel: this post
- Scripting MSR Performance Tests With kdb+
- Scripting MSR Performance Tests With kdb+: Part 2
- Intel Performance Monitoring: Loose Ends
Hi, the last two posts have laid some groundwork for this post, in which I hope to show how you can measure various performance-related events using Intel’s MSRs. This post assumes you have at least installed the MSR kernel module discussed in this earlier post. All we’re going to do this time is record two MSR configuration scripts to memory and execute some arbitrary code to measure some performance metrics. One script will configure the MSRs and reset the counter values to zero, while the other will read the accumulated values after the test code has executed.
The code we’re going to profile is going to be enormously simple: we’re going to print something using the printf
function. At least this should be very easy to alter to test how different argument types require more processing than others.
Here’s the code:
msrdrv.c
#include "msrdrv.h"
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <stdlib.h>
#include <stdio.h>
static int loadDriver()
{
int fd;
= open("/dev/" DEV_NAME, O_RDWR);
fd if (fd == -1) {
("Failed to open /dev/" DEV_NAME);
perror}
return fd;
}
static void closeDriver(int fd)
{
int e;
= close(fd);
e if (e == -1) {
("Failed to close fd");
perror}
}
/*
* Reference:
* Intel Software Developer's Manual Vol 3B "253669.pdf" August 2012
* Intel Software Developer's Manual Vol 3C "326019.pdf" August 2012
*/
int main(void)
{
int fd;
struct MsrInOut msr_start[] = {
{ MSR_WRITE, 0x38f, 0x00, 0x00 }, // ia32_perf_global_ctrl: disable 4 PMCs & 3 FFCs
{ MSR_WRITE, 0xc1, 0x00, 0x00 }, // ia32_pmc0: zero value (35-5)
{ MSR_WRITE, 0xc2, 0x00, 0x00 }, // ia32_pmc1: zero value (35-5)
{ MSR_WRITE, 0xc3, 0x00, 0x00 }, // ia32_pmc2: zero value (35-5)
{ MSR_WRITE, 0xc4, 0x00, 0x00 }, // ia32_pmc3: zero value (35-5)
{ MSR_WRITE, 0x309, 0x00, 0x00 }, // ia32_fixed_ctr0: zero value (35-17)
{ MSR_WRITE, 0x30a, 0x00, 0x00 }, // ia32_fixed_ctr1: zero value (35-17)
{ MSR_WRITE, 0x30b, 0x00, 0x00 }, // ia32_fixed_ctr2: zero value (35-17)
{ MSR_WRITE, 0x186, 0x004101c2, 0x00 }, // ia32_perfevtsel1, UOPS_RETIRED.ALL (19-28)
{ MSR_WRITE, 0x187, 0x0041010e, 0x00 }, // ia32_perfevtsel0, UOPS_ISSUED.ANY (19.22)
{ MSR_WRITE, 0x188, 0x01c1010e, 0x00 }, // ia32_perfevtsel2, UOPS_ISSUED.ANY-stalls (19-22)
{ MSR_WRITE, 0x189, 0x004101a2, 0x00 }, // ia32_perfevtsel3, RESOURCE_STALLS.ANY (19-27)
{ MSR_WRITE, 0x38d, 0x222, 0x00 }, // ia32_perf_fixed_ctr_ctrl: ensure 3 FFCs enabled
{ MSR_WRITE, 0x38f, 0x0f, 0x07 }, // ia32_perf_global_ctrl: enable 4 PMCs & 3 FFCs
{ MSR_STOP, 0x00, 0x00 }
};
struct MsrInOut msr_stop[] = {
{ MSR_WRITE, 0x38f, 0x00, 0x00 }, // ia32_perf_global_ctrl: disable 4 PMCs & 3 FFCs
{ MSR_WRITE, 0x38d, 0x00, 0x00 }, // ia32_perf_fixed_ctr_ctrl: clean up FFC ctrls
{ MSR_READ, 0xc1, 0x00 }, // ia32_pmc0: read value (35-5)
{ MSR_READ, 0xc2, 0x00 }, // ia32_pmc1: read value (35-5)
{ MSR_READ, 0xc3, 0x00 }, // ia32_pmc2: read value (35-5)
{ MSR_READ, 0xc4, 0x00 }, // ia32_pmc3: read value (35-5)
{ MSR_READ, 0x309, 0x00 }, // ia32_fixed_ctr0: read value (35-17)
{ MSR_READ, 0x30a, 0x00 }, // ia32_fixed_ctr1: read value (35-17)
{ MSR_READ, 0x30b, 0x00 }, // ia32_fixed_ctr2: read value (35-17)
{ MSR_STOP, 0x00, 0x00 }
};
= loadDriver();
fd (fd, IOCTL_MSR_CMDS, (long long)msr_start);
ioctl("This is a hex number 0x%x\n", -1);
printf(fd, IOCTL_MSR_CMDS, (long long)msr_stop);
ioctl("uops retired: %7lld\n", msr_stop[2].value);
printf("uops issued: %7lld\n", msr_stop[3].value);
printf("stalled cycles: %7lld\n", msr_stop[4].value);
printf("resource stalls: %7lld\n", msr_stop[5].value);
printf("instr retired: %7lld\n", msr_stop[6].value);
printf("core cycles: %7lld\n", msr_stop[7].value);
printf("ref cycles: %7lld\n", msr_stop[8].value);
printf(fd);
closeDriverreturn 0;
}
The code (rather than the content of the MsrInOut
arrays) is straighforward. It opens a the file /dev/msrdrv
, the char-driver handle set up in the install.sh
script by the mknod
command. The code uses the ioctl
command to communicate through this file-handle with the driver. If you looked through the driver source-code (see the previous posts, link above), you would see that the final parameter contains the address of the MSR controller script we want to execute. To compile the above code, I used: gcc -g -I. -O2 -o msrtest msrtest.c
Once the msr_start
script has executed, the performance counters will be ticking. The program calls the printf
function and the promptly invokes the ioclt
function with the address of the msr_stop
script. Since the stop script also contains the result-reading commands, the program then simply prints out the values now stored in various elements of the msr_stop
array.
Some discussion of the contents of the msr_start
and msr_stop
is now necessary. I’ve tried to comment the code well enough to make the scripts intlligible, but here goes anyway:
The msr_start
script
MSR_WRITE, 0x38f
: this is the global control switch for both the fixed-function and general-purpose counters. This instruction clears the “enable” bits so no further counting will take place.MSR_WRITE, 0xc1
: also0xc2
,0xc3
,0xc4
. This zeros the values in the general-purpose counters.MSR_WRITE, 0x309
: also0x30a
,0x30b
. This zeros the values in the fixed-function counters.MSR_WRITE, 0x186
: configures the events counted intoPMC0
. The value0x004101c2
enables the counter, sets it to count when executing in user-mode, and selects the eventUOPS_RETIRED.ALL
(event0x0e
, umask0x01
). This counts the µ-ops which were actually executed (some will not be due to branch mis-prediction). See theIA32_PERFEVTSELx
image below for help in deciphering what the bit-fields mean.MSR_WRITE, 0x187
: configures the events counted intoPMC1
:UOPS_ISSUED.ANY
. This counts the µ-ops issued by the instruction decoder.MSR_WRITE, 0x188
: configures the events counted intoPMC2
:UOPS_ISSUED.ANY
— stall cycles. You should read the Intel docs about how settingcmask = 1
,inv = 1
(and possiblyany = 1
) inverts the counter so that it counts cycles when no µ-ops are issued.MSR_WRITE, 0x189
: configures the events counted intoPMC3
:RESOURCE_STALLS.ANY
. The manual states that this counts “Cycles Allocation is stalled due to Resource Related reason [sic]”… I think I know what it means, but wouldn’t put money on it.MSR_WRITE, 0x38d
: configures the fixed-function counters to be enabled when the logical core is executing user-land code. See Vol. 3B, pp. 18-8, 18-5.MSR_WRITE, 0x38f
: flips the bits in the global MSR control register to enable all four PMCs (0x0f
) and all three FFCs (0x07
). Once these values have been written by theWRMSR
command, the counters will start accumulating values.MSR_STOP
: exits the loop within the MSR driver.
The msr_stop
script
MSR_WRITE, 0x38f
: stop the counters using the global control. Issue this first to stop the counters as soon as possible.MSR_WRITE, 0x38d
: some housekeeping. This clears the enabled-flags in the fixed-function configuration MSR.MSR_READ, 0xc1
: also0xc2
,0xc3
,0xc4
. These read the values from each of the four PMCs into theMsrInOut
array.MSR_READ, 0x309
: also0x30a
,0x30b
. These read the values from each of the three FFCs into theMsrInOut
array.MSR_STOP
: exits the loop within the MSR driver.
It might be quite instructive to view the contents of the msr_start
and msr_stop
arrays to see how the MsrInOut
structs pack into memory. While the kernel module has a compile-time debug flag, it’s still useful to be able to see what it is that’s being sent to the driver:
(gdb) x/60xw &msr_start
0x7fffffffdf20: 0x00000002 0x0000038f 0x00000000 0x00000000
0x7fffffffdf30: 0x00000002 0x000000c1 0x00000000 0x00000000
0x7fffffffdf40: 0x00000002 0x000000c2 0x00000000 0x00000000
0x7fffffffdf50: 0x00000002 0x000000c3 0x00000000 0x00000000
0x7fffffffdf60: 0x00000002 0x000000c4 0x00000000 0x00000000
0x7fffffffdf70: 0x00000002 0x00000309 0x00000000 0x00000000
0x7fffffffdf80: 0x00000002 0x0000030a 0x00000000 0x00000000
0x7fffffffdf90: 0x00000002 0x0000030b 0x00000000 0x00000000
0x7fffffffdfa0: 0x00000002 0x00000186 0x004101c2 0x00000000
0x7fffffffdfb0: 0x00000002 0x00000187 0x0041010e 0x00000000
0x7fffffffdfc0: 0x00000002 0x00000188 0x01c1010e 0x00000000
0x7fffffffdfd0: 0x00000002 0x00000189 0x004101a2 0x00000000
0x7fffffffdfe0: 0x00000002 0x0000038d 0x00000222 0x00000000
0x7fffffffdff0: 0x00000002 0x0000038f 0x0000000f 0x00000007
0x7fffffffe000: 0x00000003 0x00000000 0x00000000 0x00000000
And the stop script before it is “executed”:
(gdb) x/40xw &msr_stop
0x7fffffffe010: 0x00000002 0x0000038f 0x00000000 0x00000000
0x7fffffffe020: 0x00000002 0x0000038d 0x00000000 0x00000000
0x7fffffffe030: 0x00000001 0x000000c1 0x00000000 0x00000000
0x7fffffffe040: 0x00000001 0x000000c2 0x00000000 0x00000000
0x7fffffffe050: 0x00000001 0x000000c3 0x00000000 0x00000000
0x7fffffffe060: 0x00000001 0x000000c4 0x00000000 0x00000000
0x7fffffffe070: 0x00000001 0x00000309 0x00000000 0x00000000
0x7fffffffe080: 0x00000001 0x0000030a 0x00000000 0x00000000
0x7fffffffe090: 0x00000001 0x0000030b 0x00000000 0x00000000
0x7fffffffe0a0: 0x00000003 0x00000000 0x00000000 0x00000000
And after it has been “executed” (in which I’ve mixed the output of “x/40xw &msr_stop
” and “x/20xg &msr_stop
” to make it clearer):
0x7fffffffe010: 0x00000002 0x0000038f 0x0000000000000000
0x7fffffffe020: 0x00000002 0x0000038d 0x0000000000000000
0x7fffffffe030: 0x00000001 0x000000c1 0x0000000000002866
0x7fffffffe040: 0x00000001 0x000000c2 0x0000000000002bfa
0x7fffffffe050: 0x00000001 0x000000c3 0x0000000000017ca7
0x7fffffffe060: 0x00000001 0x000000c4 0x00000000000003ea
0x7fffffffe070: 0x00000001 0x00000309 0x0000000000000dd3
0x7fffffffe080: 0x00000001 0x0000030a 0x00000000000079c4
0x7fffffffe090: 0x00000001 0x0000030b 0x0000000000019b1e
0x7fffffffe0a0: 0x00000003 0x00000000 0x0000000000000000
The Results
These are the results printed out on my system, for one iteration through the code, with no accounting for the fixed costs of starting and stopping the performance-counters. It wasn’t run under gdb
so the results won’t tally with the values printed above. However, it suffices for the purposes of showing how you can count with the PMCs: it’s never been about measuring the performance of the printf
command!
$ ./msrtest
0xffffffff
This is a hex number : 10043
uops retired: 10956
uops issued: 83345
stalled cycles: 371
resource stalls: 3538
instr retired: 25225
core cycles: 85131 ref cycles
As alluded to above, there are many potential sources of error here: that only one iteration was performed, that I didn’t establish some baseline, fixed-cost values for invoking the ioctl
function, that I haven’t disabled hyper-threading, that I didn’t bind my code to a core, that I didn’t disable speed-stepping or turbo boost. All this is up to you and I’ll show how you do some of them in a future post.