antispam@fricas.org (Waldek Hebisch) writes:
Potential alternative is
a pair of operations, say PUSH and POP, and Forth compiler
that replaces pair like V1 @ by PUSH(V1). Note that here
address of V1 is intended to be part to PUSH (so it will
take as much space as separate V1 and @, but is only a
single primitive).
In Gforth variables are compiled as "lit <addr>", and Gforth has a
primitive LIT@, and a generalized constant-folding optimization that
replaces "lit <addr> @" with "LIT@ <addr>". In the Gforth image there
are 490 uses of lit@ (out of 33611 uses of primitives) and 76
occurences of "lit @" (in parts that are compiled before the
generalized constant-folding is active). There are also 293
occurences of "lit !".
However, given the minimal difference between the code produced for
"LIT@" and "LIT @", LIT@ is no longer beneficial. E.g.:
variable v ok
: foo1 v @ + ; ok
: foo2 v [ basic-block-end ] @ + ; ok
see-code foo1
$7F27184A08B0 lit@ 1->2
$7F27184A08B8 v
7F271806B580: mov rax,$08[rbx]
7F271806B584: mov r15,[rax]
$7F27184A08C0 + 2->1
7F271806B587: add r13,r15
$7F27184A08C8 ;s 1->1
...
see-code foo2
$7F27184A08F8 lit 1->2
$7F27184A0900 v
7F271806B596: mov r15,$08[rbx]
$7F27184A0908 @ 2->2
7F271806B59A: mov r15,[r15]
$7F27184A0910 + 2->1
7F271806B59D: add r13,r15
$7F27184A0918 ;s 1->1
...
More generally, a simple "optimizer" that replaces short
sequences of Forth primitives by different, shorter sequence
of primitives is likely to give similar gain. However,
chance of match decreases with length of the sequence.
Gforth has that as static superinstructions. You can see the
sequences in
<
http://git.savannah.gnu.org/cgit/gforth.git/tree/peeprules.vmg>, in
the lines before there is any occurence of prim-states or something
similar. As you can see, many of the formerly-used sequences are now
commented out, because static superinstructions do not play well with
a) static stack caching (currently static superinstructions only work
for the default stack cache state) and b) IP-update optimization (if
one of the primitives in the sequence has an immediate argument (e.g.,
LIT), you would need additional variants for various IP offsets, or
update the IP before the sequence).
The remaining static superinstructions
* have to do with stacks where we do not have stack caching (FP stack,
locals stack, return stack),
* are combinations of comparison primitives and ?BRANCH (this avoids
the need to reify the result of the comparison in a general-purpose
register), or
* are sequences of typical memory-access words (not because they occur
so often, but because it's better to have a small number of words
that can be combined, and a number of combinations in the optimizer
than to have a combinatorial explosion of words).
Above you bet on relatively long seqences (and on programmer
writing alternative seqence). Shorter seqences have more
chance of matching, so you need smaller number of them
for similar gain.
That's certainly our experience. Long sequences with high dynamic
counts often come out of the inner loop of a single benchmark, and do
not help other programs at all. We later preferred to go with static
usage counts (i.e., the sequence occurs several times in the code),
and this naturally leads to short sequences.
One can
do better than using machine stack, namely keeping thing in
registers, but that means generating machine code and doing
optimization.
Gforth does stack caching at the level of primitives, by having
several variants of the primitives for different start and end states
of the primitives, and using a shortest-path search for finding out
which combination of these variants to use. However, for multiple
stacks this leads to a large number of states, and the shortest-path
algorithm becomes too expensive. For now we only stack-cache the data
stack.
For extending this to multiple stacks, I see several alternatives:
* Use a greedy algorithm instead of an optimal shortest-path
algorithm. The difference is probably non-existent in most cases.
* Manage the stack cache using register allocation techniques instead
of representing it as an abstract state. This would often produce
similar results as the greedy technique, but it can also handle
stack manipulation words cheaply without having an explosion of
stack states and the related complexity in the generator that
generates the states and the tables for the state-handling.
- anton
-- M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.htmlcomp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html New standard: https://forth-standard.org/EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/EuroForth 2024 proceedings:
http://www.euroforth.org/ef24/papers/