zbigniew2011@gmail.com (LIT) writes:
[Anton Ertl:]
Earlier you wrote about performance, now you switch to clarity of the
code. What is the goal?
>
Both — one isn't contrary to another.
Sometimes the clearer code is slower and the faster code is less clear
(as in the FAXPY-NOSTRIDE example).
What I have in mind is: by performing OOS operation
we don't have to employ the whole "Forth machine" to
do the usual things (I mean, of course, the usual
steps described by Brad Rodriguez in his "Moving
Forth" paper).
What does "OOS" stand for? What do you mean with "the usual steps"; I
am not going to read the whole paper and guess which of the code shown
there you have in mind.
It comes with a cost: usual Forth words, that use
the stack, are versatile, while such OOS words
aren't that versatile anymore — yet (at least in
the case of ITC non-optimizing Forths) they should
be faster.
One related thing is the work on "register"-based virtual machines.
For interpreted implementations the VM registers are in memory, but
they are accessed by "register number"; these usually correspond to
locals slots on machines line the JavaVM. A well-known example of
that is the switch of the Lua VM from stack-based to register-based.
A later example is Android's Dalvik VM for Java, in contrast to the
stack-based JavaVM.
There is a paper [shi+08] that provides an academic justification for
this approach. The gist of it is that, with some additional compiler
complexity, the register-based machine can reduce the number of NEXTs
(in Forth threaded-code terminology); depending on the implementation
approach and the hardware, the NEXTs could be the major cost at the
time. However, already at the time dynamic superinstructions (in
implementation technique for virtual-machine interpreters) reduced the
number of NEXTs to one per basic block, and VM registers did nothing
to reduce NEXTs in that case; Shi et al. also showed that with a lot
of compiler sophistication (data flow analysis etc.) VM registers can
be as fast as stacks even with dynamic superinstructions.
However, given that dynamic superinstructions are easier to implement
and the VM registers do not give a benefit when they are employed, why
would one go for VM registers? Of course, in the Forth setting one
could offload the optimization onto the programmer, but even Chuck
Moore did not go there.
In any case, here's an example extracted from Figure 6 of the paper:
Java VM (stack) VM registers
19 iload_1
20 bipush #31 iconst #31 -> r1
21 imul imul r6 r1 -> r3
22 aload_0
23 getfield value getfield r0.value -> r5
24 iload_3
26 caload caload r5 r7 -> r5
27 iadd iadd r3 r5 -> r6
28 istore_1
So yes, the VM register code contains fewer VM instructions. Is it
clearer?
The corresponding Gforth code is the stuff between IF and THEN in the
following:
0
value: some-field
value: value
constant some-struct
: foo
{: r0 r1 r3 :}
if
r1 31 * r0 value r3 + c@ + to r1
then
r1 ;
The code that Gforth produces for the basic block under consideration
is:
$7FC624AA0958 @local1 1->1
7FC62464A5BA: mov [r10],r13
7FC62464A5BD: sub r10,$08
7FC62464A5C1: mov r13,$08[rbp]
$7FC624AA0960 lit 1->2
$7FC624AA0968 #31
7FC62464A5C5: sub rbx,$50
7FC62464A5C9: mov r15,-$08[rbx]
$7FC624AA0970 * 2->1
7FC62464A5CD: imul r13,r15
$7FC624AA0978 @local0 1->2
7FC62464A5D1: mov r15,$00[rbp]
$7FC624AA0980 lit+ 2->2
$7FC624AA0988 #8
7FC62464A5D5: add r15,$18[rbx]
$7FC624AA0990 @ 2->2
7FC62464A5D9: mov r15,[r15]
$7FC624AA0998 @local2 2->3
7FC62464A5DC: mov r9,$10[rbp]
$7FC624AA09A0 + 3->2
7FC62464A5E0: add r15,r9
$7FC624AA09A8 c@ 2->2
7FC62464A5E3: movzx r15d,byte PTR [r15]
$7FC624AA09B0 + 2->1
7FC62464A5E7: add r13,r15
$7FC624AA09B8 !local1 1->1
7FC62464A5EA: add r10,$08
7FC62464A5EE: mov $08[rbp],r13
7FC62464A5F2: mov r13,[r10]
7FC62464A5F5: add rbx,$50
There are 8 loads and 2 stores in that code. If the VM registers are
held in memory (as they usually are, and as the Gforth locals are),
the VM register code performs at least 9 loads (7 register accesses,
the getfield, and the caload) and 5 stores. Of course, in Forth one
would write the block as:
: foo1 ( n3 a0 n1 -- n )
31 * swap value rot + c@ + ;
and the code for that is (without the ";"):
$7FC624AA0A10 lit 1->2
$7FC624AA0A18 #31
7FC62464A617: mov r15,$08[rbx]
$7FC624AA0A20 * 2->1
7FC62464A61B: imul r13,r15
$7FC624AA0A28 swap 1->2
7FC62464A61F: mov r15,$08[r10]
7FC62464A623: add r10,$08
$7FC624AA0A30 lit+ 2->2
$7FC624AA0A38 #8
7FC62464A627: add r15,$28[rbx]
$7FC624AA0A40 @ 2->2
7FC62464A62B: mov r15,[r15]
$7FC624AA0A48 rot 2->3
7FC62464A62E: mov r9,$08[r10]
7FC62464A632: add r10,$08
$7FC624AA0A50 + 3->2
7FC62464A636: add r15,r9
$7FC624AA0A58 c@ 2->2
7FC62464A639: movzx r15d,byte PTR [r15]
$7FC624AA0A60 + 2->1
7FC62464A63D: add r13,r15
6 loads, 0 stores.
And if we feed the equivalent standard code
0
field: some-field
field: value-addr
constant some-struct
: foo1 ( n3 a0 n1 -- n )
31 * swap value-addr @ rot + c@ + ;
into other Forth systems, some produce even better code:
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
FOO1
( 0050A310 486BDB1F ) IMUL RBX, RBX, # 1F
( 0050A314 488B5500 ) MOV RDX, [RBP]
( 0050A318 488B4D08 ) MOV RCX, [RBP+08]
( 0050A31C 48034A08 ) ADD RCX, [RDX+08]
( 0050A320 480FB609 ) MOVZX RCX, Byte 0 [RCX]
( 0050A324 4803D9 ) ADD RBX, RCX
( 0050A327 488D6D10 ) LEA RBP, [RBP+10]
( 0050A32B C3 ) RET/NEXT
( 28 bytes, 8 instructions )
5 loads, 0 stores. And VFX does not do data-flow analysis across
basic blocks, unlike the Java VM -> VM register compiler that Shi
used; i.e., VFX is probably simpler than the compiler Shi used.
@Article{shi+08,
author = {Yunhe Shi and Kevin Casey and M. Anton Ertl and
David Gregg},
title = {Virtual machine showdown: Stack versus registers},
journal = {ACM Transactions on Architecture and Code
Optimization (TACO)},
year = {2008},
volume = {4},
number = {4},
pages = {21:1--21:36},
month = jan,
url = {
http://doi.acm.org/10.1145/1328195.1328197},
abstract = {Virtual machines (VMs) enable the distribution of
programs in an architecture-neutral format, which
can easily be interpreted or compiled. A
long-running question in the design of VMs is
whether a stack architecture or register
architecture can be implemented more efficiently
with an interpreter. We extend existing work on
comparing virtual stack and virtual register
architectures in three ways. First, our translation
from stack to register code and optimization are
much more sophisticated. The result is that we
eliminate an average of more than 46\% of
executed VM instructions, with the bytecode size of
the register machine being only 26\% larger
than that of the corresponding stack one. Second, we
present a fully functional virtual-register
implementation of the Java virtual machine (JVM),
which supports Intel, AMD64, PowerPC and Alpha
processors. This register VM supports
inline-threaded, direct-threaded, token-threaded,
and switch dispatch. Third, we present experimental
results on a range of additional optimizations such
as register allocation and elimination of redundant
heap loads. On the AMD64 architecture the register
machine using switch dispatch achieves an average
speedup of 1.48 over the corresponding stack
machine. Even using the more efficient
inline-threaded dispatch, the register VM achieves a
speedup of 1.15 over the equivalent stack-based VM.}
}
Clarity of the code comes as a "bonus" :) yes, we've
got VALUEs and I use them when needed, but their use
still means employing the "Forth machine".
What do you mean with 'the "Forth machine"', and how does "OOS"
(whatever that is) avoid it?
- anton
-- M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.htmlcomp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html New standard: https://forth-standard.org/EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/EuroForth 2024 proceedings:
http://www.euroforth.org/ef24/papers/