Michael S <
already5chosen@yahoo.com> writes:
On Fri, 06 Sep 2024 07:25:35 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
What gcc produces for both formulations is longer than
dec %rdi
jno ...
>
>
Good trick.
Thanks. It's not from me. I published it in 2015
<
https://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29.pdf>,
but unfortunately did not give a reference to where I have it from (I
read it elsewhere).
The same trick in non-destructive form would be 1 byte longer.
cmp $1, %rdi
jno ...
>
But I was not able to force any of compilers currently installed on my
home desktop (gcc 13.2, clang 18.1, MSVC 19.30.30706 == VS2022) to
produce such code.
>
The closest was MSVC that sometimes (not in all circumstances) produces
2 bytes longer versiin:
49 8d 49 ff lea -0x1(%r9),%rcx
4c 3b c9 cmp %rcx,%r9
>
Of course, it's still good deal shorter than
48 ba 00 00 00 00 00 00 00 80 movabs $0x8000000000000000,%rdx
4c 3b ca cmp %rdx,%r9
>
Both gcc and clang [under -fwrapv] insisted on turning x<=x-1 into
x==LLONG_MIN.
>
However even if we were able to force compiler to produce desired code,
the space saving is architecture-specific.
With this gcc-specific code we can force it:
extern long foo1(long);
extern long foo2(long);
long bar(long a, long b)
{
long c;
if (__builtin_sub_overflow(b,1,&c))
return foo1(a);
else
return foo2(a);
}
gcc -O3 -c and gcc -Os -c (gcc-12.2) produce, on AMD64:
0: 48 83 c6 ff add $0xffffffffffffffff,%rsi
4: 70 05 jo b <bar+0xb>
6: e9 00 00 00 00 jmp b <bar+0xb>
b: e9 00 00 00 00 jmp 10 <bar+0x10>
So, even though %rsi is dead afterwards, it does not use dec, but it's
certainly better than the other variants.
On Arch A64 both gcc invocations (gcc-10.2) produce:
0: f1000421 subs x1, x1, #0x1
4: 54000046 b.vs c <bar+0xc>
8: 14000000 b 0 <foo2>
c: 14000000 b 0 <foo1>
On RV64GC bith gcc invocations (gcc-10.3) produce:
0000000000000000 <bar>:
0: fff58793 addi a5,a1,-1
4: 00f5c663 blt a1,a5,10 <.L6>
8: 00000317 auipc t1,0x0
c: 00030067 jr t1 # 8 <bar+0x8>
0000000000000010 <.L6>:
10: 00000317 auipc t1,0x0
14: 00030067 jr t1 # 10 <.L6>
So on RISC-V gcc manages to actually translate the if back into "if (b
< b-1)" without pessimising that (but gcc-10 does not pessimize this
code on AMD64, either.
E.g. I expect no saving on ARM64 where both variants occupie 8 bytes.
Here we have the three variants:
#include <limits.h>
extern long foo1(long);
extern long foo2(long);
long bar(long a, long b)
{
long c;
if (__builtin_sub_overflow(b,1,&c))
return foo1(a);
else
return foo2(a);
}
long bar2(long a, long b)
{
if (b < b-1)
return foo1(a);
else
return foo2(a);
}
long bar3(long a, long b)
{
if (b == LONG_MIN)
return foo1(a);
else
return foo2(a);
}
And here is what gcc-10 -Os -fwrapv -Wall -c produces:
ARM A64:
subs x1, x1, #0x1 sub x2, x1, #0x1 mov x2, #0x8000000000000000
b.vs c <bar+0xc> cmp x2, x1 cmp x1, x2
b.le 20 <bar2+0x10> b.ne 34 <bar3+0x10>
RV64GC:
addi a5,a1,-1 addi a5,a1,-1 li a5,-1
bge a1,a5,10 <.L4> bge a1,a5,28 <.L6> slli a5,a5,0x3f
bne a1,a5,40 <.L8>
8 Bytes 8 Bytes 8 Bytes
AMD64:
add $-1,%rsi lea -0x1(%rsi),%rax mov $0x1,%eax
jo b <bar+0xb> cmp %rsi,%rax shl $0x3f,%rax
jle 1e <bar2+0xe> cmp %rax,%rsi
jne 36 <bar3+0x13>
6 Bytes 9 Bytes 14 Bytes
With gcc-12 on AMD64:
add -1,%rsi mov $0x1,%eax mov $0x1,%eax
jo b <bar+0xb> shl $0x3f,%rax shl $0x3f,%rax
cmp %rax,%rsi cmp %rax,%rsi
jne 23 <bar2+0x13> jne 23 <bar2+0x13>
6 Bytes 14 Bytes 14 Bytes
(Actually in the latter case gcc recognizes that bar2 and bar3 are
equivalent and jumps from bar3 to bar2, but I am sure that without
bar2, bar3 would look the same as bar2 does now).
So when gcc does not pessimize "b < b-1" into "b == LONG_MIN", the
straightforward code for the former has the same or smaller size, and
the same or smaller number of instructions on these architectures.
The "__builtin_sub_overflow(b,1,&c)" has the same or fewer bytes than
"b < b-1" and the same or fewer instructions. So, with
straightforward translations "__builtin_sub_overflow(b,1,&c)"
dominates "b < b-1", which dominates "b == LONG_MIN".
As a new feature, gcc-12 recognizes "b < b-1" and pessimizes it into
the same code as "b == LONG_MIN".
Interestingly, the first idiom is a case where gcc recognizes what the
intention of the programmer is, and warns that it is going to
miscompile it. The warning is good, the miscompilation not (but it
would be worse without the warning).
>
>
You had more luck with warnings than I did.
In all my test cases both gcc and clang [in absence of -fwrapv]
silently dropped the check and depended code.
Interesting. I tried both "b < b-1" and "b >= b+1" and got no warning
(with gcc-10 and gcc-12), but I have seen a warning with one of those
idioms in the past. Maybe someone decided that warning about this
idiom is unnecessary, while "optimizing" it is.
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>