David Brown <
david.brown@hesbynett.no> writes:
Anton writes code that seriously pushes the boundary of what can be
achieved. For at least some of the things he does (such as GForth) he
is trying to squeeze every last drop of speed out of the target. And he
is /really/ good at it. But that means he is forever relying on nuances
about code generation. His code, at least for efficiency if not for
correctness, is dependent on details far beyond what is specified and
documented for C and for the gcc compiler. He might spend a long time
working with his code and a version of gcc, fine-tuning the details of
his source code to get out exactly the assembly he wants from the
compiler.
No. We distribute Gforth as source code. It works for a wide variety
of architectures and compilers. So unlike what you suggest and what
some people have suggested earlier to avoid problems with new
"optimizations" in newer releases of gcc, we don't concentrate on a
specific version of gcc.
Of course it is frustrating for him when the next version of
gcc generates very different assembly from that same source, but he is
not really programming at the level of C, and he should not expect
consistency from C compilers like he does.
It's normal and no problem when the next version of gcc generates
different assembly language. There are some basic assumptions that
our code relies on, and that mostly does not change between gcc
versions.
An essential assumption is that, when we have:
A:
C code
B:
... that when we do &&A and &&B (which is documented in the GNU C
manual), we get the addresses pointing to the start and end of the
machine code corresponding to the C code. In the days starting with
gcc-3.0, we found that gcc started reordering the basic blocks within
loops, so replaced loops in the part of the code that needs such
assumptions into separate functions. Around gcc-7, gcc started to
compile
A: C-code1
B: C-code2
C: goto *...
to the same code as
A: C-code1; C-code2; goto *...;
B: C-code2; goto *...;
C: goto *...;
I found a workaround that avoids this kind of code generation.
Another problem from gcc-3.1 to at least gcc-4.4 (intermittently) is
that gcc compiled
goto *ca;
into the equivalent of
goto gotoca;
/* and elsewhere */
gotoca: goto *ca;
We reported that repeatedly. At one point a gcc maintainer gave us
some bullshit about a possible performance advantage from this
transformation, of course without presenting any empirical support,
while we saw a big slowdown on our code. We developed workarounds for
that, and they are in Gforth to this day, even though we have not
encountered a new gcc version with this problem for over a decade, but
new Gforth should also work on old gcc.
Another assumption is that when we concatenate the code snippet
between label A and B (which contains C-code1) and the code snippet
between label X and Y (which contains C-code3), executing the result
will behave like the concatenation of C-code1 and C-code3 in source
code. This assumption has two aspects:
1) Do the register assignments at the labels fit together. It turns
out that we never had a problem with that, and I think that the
reason for that is that the "goto *" can jump to any of those
labels (all their addresses are taken), and so the register
assignment must be the same right after each label.
What guarantees that the assignments are the same right before each
label? Probably that after the label, there is not much between
the label and the next goto*, and that makes all registers at
potential targets live.
2) If we have two pieces of machine code produced in this fashion,
does the architecture guarantee that such a concatenation works?
It turns out that in general-purpose architectures, all-but-one do.
That includes IA-64. The exception is MIPS with its architectural
load-delay slot (and there are also scheduling restrictions having
to do with the hilo register that may be problematic): the first
code snippet may end in a load, and the next code snippet may start
with an instruction that reads the result of the load. So we just
disabled this concatenation on MIPS.
We do a number of things to achieve stability: We do sanity-checking
on the resulting machine code snippets and fall back to plain threaded
code if the snippets turn out not to be relocatable.
Also, we enable all the flags for defining behaviour in gcc that we
find (unfortunately, in the documentation they are intermixed with
other options). For good measure, this includes
-fno-delete-null-pointer-checks, although I doubt that it makes a
difference for our code either way.
One thing that came up about a year ago was that gcc auto-vectorizes
adjacent memory accesses on AMD64 (apparently the AMD64 port
maintainers are unhappy because AMD64 does not have instructions like
ARM A64's ldp and stp:-), which did not impact correctness, but led to
worse performance (not just in Gforth; I have also seen it in the
bubble benchmark from John Hennessy's Stanford small integer
benchmarks; I'm sure there is some SPEC benchmark that benefits). A
quick addition of -fno-tree-vectorize fixed that.
We have been thinking about moving from C to a better-defined
language, namely assembly language, but have not yet taken the plunge,
and it has not been necessary yet. Gcc has not been as crazy in our
experience as the UB rethoric might make one think. Why is that? I
think the reasons are:
1) Gforth and a lot of other "irrelevant" (to the gcc maintainers)
projects sail in the slipstream of "relevant" code like SPEC and
the Linux kernel that are all full of undefined behaviour (Linux
defines many of them with flags, like Gforth does), so gcc does not
"optimize" as crazily as a UB fan might wish.
2) The code snippets are very short, with many in-edges on the
preceding and following label, which tends to destroy any
"knowledge" that the compiler might have derived from the
assumption that the program does not exercise undefined behaviour.
This severely limits the distance over which such "optimizations"
can be performed.
Nevertheless, the last time I tried what happens if I compile without
the behaviour-defining options, the result did not work; I did not
investigate this further.
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>