It’s not often you get a really nasty surprise when writing software, and even less often that the nasty surprise is lurking in the compiler (or assembler). It turns out that the Gnu assembler does not treat the operands of instructions such as fsub
and fdiv
uniformly — in fact, some of the time it does the opposite of what you instruct it to do.
Were I to start ranting about this state of affairs, I would probably mention in passing that one of the attractions of writing assembly was that you get exactly what you expect, rather than a compiler writer’s interpretation of your interpretation of a higher-level language, and that it might be a shame if someone managed to corrupt your intention when you wrote “subtract y from x” and instead substituted “subtract x from y”.
The bug, as it was christened long before I happened upon it, occurs when using the Gnu assembler with a non-commutative x87 FPU instruction where the source operand is %st(0)
, and is documented fairly quietly in the Gnu Assembler documentation. Let’s have a look at the fsub
instruction.
The Intel manual provides the following:
Opcode | Instruction | Description |
---|---|---|
D8 E0+i |
FSUB ST(0), ST(i) |
Subtract ST(i) from ST(0) and store result in ST(0). |
DC E8+i |
FSUB ST(i), ST(0) |
Subtract ST(0) from ST(i) and store result in ST(i). |
DE E8+i |
FSUBP ST(i), ST(0) |
Subtract ST(0) from ST(i), store result in ST(i), and pop register stack. |
It continues:
Subtracts the source operand from the destination operand and stores the difference in the destination location. The destination operand is always an FPU data register; the source operand can be a register or a memory location. Source operands in memory can be in single-precision or double-precision floating-point format or in word or doubleword integer format.
So far, so good. There is of course the usual operand reversal for gas:
; Intel
fsub st(0), st(1) ; st(0) <-- st(0)-st(1)
# Gnu Assembler
fsub %st(1), %st(0) # st(0) <-- st(0)-st(1)
In the example above, the source operand is %st(1)
, so all’s well and you get what you expected. However, if you write
fsub %st(0), %st(1) # st(1) <-- st(1)-st(0) ... or is it?
what you get is not st(1) <-- st(1)-st(0)
, but rather st(1) <-- st(0)-st(1)
. Has it reversed the operands for the subtraction (but at least had the decency to store the result in the right register)?
Unfortunately, this is exactly what it does, as you can see for yourself if you compile and execute the fsubtest.s
application whose listing appears below (a simple Makefile
is listed too). The output of the program is as follows:
input: st(0): 5.0
st(1): 7.0
instr: fsub %st(0), %st(1)
output: st(0): 5.0
st(1): -2.0 [<- No: right register, wrong result, Ed.]
input: st(0): 5.0
st(1): 7.0
instr: fsub %st(1), %st(0)
output: st(0): -2.0
st(1): 7.0
input: st(0): 5.0
st(1): 7.0
instr: fsubr %st(0), %st(1)
output: st(0): 5.0
st(1): 2.0 [<- No: right register, wrong result, Ed.]
input: st(0): 5.0
st(1): 7.0
instr: fsubr %st(1), %st(0)
output: st(0): 2.0
st(1): 7.0
I don’t know whether the following table makes things any easier to follow, but the expected and actual outcomes are made explicit:
instruction | assemby | example | behaviour | |
---|---|---|---|---|
fsub %st(0), %st(1) |
expected | %st(1)= %st(1)-%st(0) |
%st(1) = 7-5 = 2 |
|
actual | %st(1)= %st(0)-%st(1) |
%st(1) = 5-7 = -2 |
“wrong” | |
fsub %st(1), %st(0) |
expected | %st(0)= %st(0)-%st(1) |
%st(0) = 5-7 = -2 |
|
actual | %st(0)= %st(0)-%st(1) |
%st(0) = 5-7 = -2 |
||
fsubr %st(0), %st(1) |
expected | %st(1)= %st(0)-%st(1) |
%st(1) = 5-7 = -2 |
|
actual | %st(1)= %st(1)-%st(0) |
%st(1) = 7-5 = 2 |
“wrong” | |
fsubr %st(1), %st(0) |
expected | %st(0)= %st(1)-%st(0) |
%st(0) = 7-5 = 2 |
|
actual | %st(0)= %st(1)-%st(0) |
%st(0) = 7-5 = 2 |
The table hopefully shows how the assembler fails to produce the correct opcodes for the fsub
and fsubr
instructions where %st(0)
is the source operand. Put another way, the assembler has fsub %st(0), %st(n)
mixed up with fsubr %st(0), %st(n)
.
I only found out about the bug as I was trying to translate some FASM code into gas and couldn’t understand the reason for the incorrect result. The problem was worse as the only logical explanation was that the compiler was broken — which is usually a good indicator that your logic is flawed and you need to check it again.
The other thing which surprises me is that this bug isn’t the top result for a search such as “(Gnu assembler|gas) fsub”. I couldn’t find mention of it in Blum’s Professional Assembly Language book, an very good reference regardless. It turns out that the fdiv
instruction family is also mangled by the Gnu assembler. Alan Modra, who has dealt with this bug on the Sourceware and GCC mailing lists since at least 1999, writes:
Here are examples of `broken' opcodes. You might like to verify that your
Unixware assemblers produce the same.
1 0000 DCE3 fsub %st,%st(3)
2 0002 DCEB fsubr %st,%st(3)
3 0004 DCF3 fdiv %st,%st(3)
4 0006 DCFB fdivr %st,%st(3)
5 0008 DEE3 fsubp %st,%st(3)
6 000a DEEB fsubrp %st,%st(3)
7 000c DEF3 fdivp %st,%st(3)
8 000e DEFB fdivrp %st,%st(3)
Here’s the short test program I mentioned earlier. It passes a function pointer as a parameter to a routine which prints the top two values on the FPU stack, calls the function at the pointer before printing %st(0)
and %st(1)
again. It’s very simple:
fsubtest.s
.section .data
msgfmt:
.ascii "input: st(0): %.1f\n"
.ascii " st(1): %.1f\n"
.ascii "instr: %s\n"
.ascii "output: st(0): %.1f\n"
.asciz " st(1): %.1f\n\n"
s_subst0st1:
.asciz "fsub %st(0), %st(1)"
s_subst1st0:
.asciz "fsub %st(1), %st(0)"
s_subrst0st1:
.asciz "fsubr %st(0), %st(1)"
s_subrst1st0:
.asciz "fsubr %st(1), %st(0)"
st0:
.double 5.0
st1:
.double 7.0
.section .bss
.lcomm result, 0x8
.section .text
.globl _start
_start:
nop
, %rdi
lea fsubst0st1call finvoker
, %rdi
lea fsubst1st0call finvoker
, %rdi
lea fsubrst0st1call finvoker
, %rdi
lea fsubrst1st0call finvoker
xor %rdi, %rdi
call exit
fsubst0st1:
fsub %st(0), %st(1)
, %rsi
lea s_subst0st1ret
fsubst1st0:
fsub %st(1), %st(0)
, %rsi
lea s_subst1st0ret
fsubrst0st1:
fsubr %st(0), %st(1)
, %rsi
lea s_subrst0st1ret
fsubrst1st0:
fsubr %st(1), %st(0)
, %rsi
lea s_subrst1st0ret
finvoker:
push %rbp # Store base-pointer
and $~0xF, %rsp # Align stack-pointer for call to printf
finit
# Push value at st1 onto FP stack
fldl st1 # Push value at st0 onto FP stack
fldl st0 call *%rdi # Invoke function pointer
movsd st0, %xmm0 # Copy value at st0 to XMM0
movsd st1, %xmm1 # Copy value at st1 to XMM1
# Copy st(0) to result0 and pop FP stack
fstpl result movsd result, %xmm2 # Copy value from FP stack to XMM2
# Repeat for next top-of-stack
fstpl result movsd result, %xmm3 # Copy value to XMM3
, %rdi # Load address of msgfmt into RDI
lea msgfmt$0x5, %al # Set varargs-count in AL
mov call printf #
pop %rbp # Restore base pointer prior to return
ret
And its Makefile
:
Makefile
all: fsubtest.o
ld --dynamic-linker /lib/ld-linux-x86-64.so.2 -o fsubtest -lc fsubtest.o
fsubtest.o: fsubtest.s
as -gstabs -o fsubtest.o fsubtest.s
clean:
rm -f *.o fsubtest
..
So that’s it?
Well, actually, there’s more. It turns out that you can’t inspect what the assembler has done for you by using objdump ‑D myprog
, as it mistranslates the opcodes again. A broken implementation of objdump
would swear blind that the opcodes you are looking at are, in fact, the ones which you asked for:
00000000004002c5 <fsubst0st1>:
4002c5: dc e1 fsub %st,%st(1) # NO! Those opcodes are not FSUB!
4002c7: 48 8d 34 25 15 05 60 lea 0x600515,%rsi
4002ce: 00
4002cf: c3 retq
00000000004002d0 <fsubst1st0>:
4002d0: d8 e1 fsub %st(1),%st
4002d2: 48 8d 34 25 29 05 60 lea 0x600529,%rsi
4002d9: 00
4002da: c3 retq
00000000004002db <fsubrst0st1>:
4002db: dc e9 fsubr %st,%st(1) # NO! Those opcodes are not FSUBR!
4002dd: 48 8d 34 25 3d 05 60 lea 0x60053d,%rsi
4002e4: 00
4002e5: c3 retq
00000000004002e6 <fsubrst1st0>:
4002e6: d8 e9 fsubr %st(1),%st
4002e8: 48 8d 34 25 52 05 60 lea 0x600552,%rsi
4002ef: 00
4002f0: c3 retq
However, if you use a “fixed” version of objdump
, it shows you the true state of affairs. When I say “fixed” I mean that it’s been compiled with the SYSV386_COMPAT
preprocessor value #define
d as 0
, about which more follows below.
00000000004002c5 <fsubst0st1>:
4002c5: dc e1 fsubr %st,%st(1) # fsubr now correctly reflects the opcodes
4002c7: 48 8d 34 25 15 05 60 lea 0x600515,%rsi
4002ce: 00
4002cf: c3 retq
00000000004002d0 <fsubst1st0>:
4002d0: d8 e1 fsub %st(1),%st
4002d2: 48 8d 34 25 29 05 60 lea 0x600529,%rsi
4002d9: 00
4002da: c3 retq
00000000004002db <fsubrst0st1>:
4002db: dc e9 fsub %st,%st(1) # fsub now correctly reflects the opcodes
4002dd: 48 8d 34 25 3d 05 60 lea 0x60053d,%rsi
4002e4: 00
4002e5: c3 retq
00000000004002e6 <fsubrst1st0>:
4002e6: d8 e9 fsubr %st(1),%st
4002e8: 48 8d 34 25 52 05 60 lea 0x600552,%rsi
4002ef: 00
4002f0: c3 retq
Of course, we want to be able to generate the correct opcodes. Of course, in order to do so, you need to have compiled the source file with a “fixed” as
binary or be using one of the other work-arounds outlined below, and you also have to use a “fixed” version of objdump
.
00000000004002cd <fsubst0st1>:
4002cd: dc e9 fsub %st,%st(1) # Correct
4002cf: 48 8d 34 25 5d 05 60 lea 0x60055d,%rsi
4002d6: 00
4002d7: c3 retq
00000000004002d8 <fsubst1st0>:
4002d8: d8 e1 fsub %st(1),%st
4002da: 48 8d 34 25 71 05 60 lea 0x600571,%rsi
4002e1: 00
4002e2: c3 retq
00000000004002e3 <fsubrst0st1>:
4002e3: dc e1 fsubr %st,%st(1) # Correct
4002e5: 48 8d 34 25 85 05 60 lea 0x600585,%rsi
4002ec: 00
4002ed: c3 retq
00000000004002ee <fsubrst1st0>:
4002ee: d8 e9 fsubr %st(1),%st
4002f0: 48 8d 34 25 9a 05 60 lea 0x60059a,%rsi
4002f7: 00
4002f8: c3 retq
So what does one do to fix it?
The first mention of the issue I found is on the binutils mailing list by Alan Modra in an exchange with Horst von Brand. It seems that Modra was aware of the issue prior to this as he says:
FYI, here's a comment I added to binutils/include/opcode/i386.h, just to make you aware of a horrible kludge. /* The UnixWare assembler, and probably other AT&T derived ix86 Unix assemblers, generate floating point instructions with reversed source and destination registers in certain cases. Unfortunately, gcc and possibly many other programs use this reversed syntax, so we're stuck with it. eg. `fsub %st(3),%st' results in st <- st - st(3) as expected, but `fsub %st,%st(3)' results in st(3) <- st - st(3), rather than the expected st(3) <- st(3) - st ! This happens with all the non-commutative arithmetic floating point operations with two register operands, where the source register is %st, and destination register is %st(i). Look for FloatDR below. */ #ifndef UNIXWARE_COMPAT /* Set non-zero for broken, compatible instructions. Set to zero for non-broken opcodes at your peril. gcc generates UnixWare compatible instructions. */ #define UNIXWARE_COMPAT 1 #endif I would love to get rid of this stupidity, but that needs a synchronised update of both gcc and binutils.
So there is the reason why it hasn’t been fixed, since GCC is coded to expect that broken behaviour. However, somewhat confusingly, Modra later posts a patch to GCC in March 2000 and renames the preprocessor value from UNIXWARE_COMPAT
to SYSV386_COMPAT
(I guess that Sourceware binutils != Gnu GCC). If the value is set to 0
it causes GCC to emit the correct instructions to its assembler (in the hope that it is expecting them). Just to be clear: setting SYSV386_COMPAT
to 0
also fixes the as
binary. To compile binutils in this way you need to set the CPPFLAGS
option to the configure
script as follows:
./configure CPPFLAGS=-DSYSV386_COMPAT=0 --prefix=/path/to/basedir/etc
The CPPFLAGS
option is the preferred way of setting preprocessor flags and will permit the default CFLAGS
options to be set, eg: with -g -O2
.
The subject pops up from time to time on the mailing list, with gems such as this:
> I was reading the manual (vol 2a) and it looks like this is supposed > to assemble as de f9, > am I nuts or reading something wrong? See the comment at the start of include/opcode/i386.h. gcc is nuts. :)
It all seems to go quiet until 2007, when an H.J. Lu gets involved and proposes to make the output selection a runtime option. This is now incorporated into gas
, and specifying ‑mmnemonic=intel
as an argument to as
causes the test-case to produce the correct output bytes. However, I’m not sure what else this switch changes, as in the patch there’s a comment which reads:
+ /* intel_mnemonic implies intel_syntax. */ + intel_mnemonic = intel_syntax = mnemonic_flag;
Is that something that I want? What are the effects on my code if intel_syntax
is enabled? The Gnu as
docs suggest another (probably better) way of causing the assembler to behave in the way you expect it to, and that’s through using the .intel_mnemonic
directive in the source. It’s not clear whether the directive is scope-limited to the source-file being assembled or until the next appearance of the .att_mnemonic
directive. The effect of the .intel_mnemonic
is described in the docs as follows:
9.13.5 AT&T Mnemonic versus Intel Mnemonic
as
supports assembly using Intel mnemonic..intel_mnemonic
selects Intel mnemonic with Intel syntax, and.att_mnemonic
switches back to the usual AT&T mnemonic with AT&T syntax for compatibility with the output ofgcc
. Several x87 instructions,fadd
,fdiv
,fdivp
,fdivr
,fdivrp
,fmul
,fsub
,fsubp
,fsubr
andfsubrp
, are implemented in AT&T System V/386 assembler with different mnemonics from those in Intel IA32 specification.gcc
generates those instructions with AT&T mnemonic.
To be honest, none of this is very clear: “.intel_mnemonic
selects Intel mnemonic with Intel syntax”. What does that mean? The various side-effects of compiling with SYSV386_COMPAT
, assembling with ‑mmnemonic=intel
or using the .intel_mnemonic
are completely opaque and render the use of gas
questionable for those writing assembler, unless they tip-toe around the set of FPU instructions which are the known trouble-makers and write “wrong” assembler to produce “right” opcodes - which is a horrible idea. There should be a dedicated switch for putting as
into a sane, predictable mode for fixing these instructions — or, if that is exactly what the Intel mnemonic switches do, could someone please make this clear in the docs and describe exactly what you get when using the switches described? Please?