Newsportal USENET - Re: Stack vs stackless operation

zbigniew2011@gmail.com (LIT) writes:
[Anton Ertl:]

Earlier you wrote about performance, now you switch to clarity of the
code. What is the goal?
>
Both — one isn't contrary to another.

Sometimes the clearer code is slower and the faster code is less clear
(as in the FAXPY-NOSTRIDE example).

What I have in mind is: by performing OOS operation
we don't have to employ the whole "Forth machine" to
do the usual things (I mean, of course, the usual
steps described by Brad Rodriguez in his "Moving
Forth" paper).

What does "OOS" stand for? What do you mean with "the usual steps"; I
am not going to read the whole paper and guess which of the code shown
there you have in mind.

It comes with a cost: usual Forth words, that use
the stack, are versatile, while such OOS words
aren't that versatile anymore — yet (at least in
the case of ITC non-optimizing Forths) they should
be faster.

One related thing is the work on "register"-based virtual machines.
For interpreted implementations the VM registers are in memory, but
they are accessed by "register number"; these usually correspond to
locals slots on machines line the JavaVM. A well-known example of
that is the switch of the Lua VM from stack-based to register-based.
A later example is Android's Dalvik VM for Java, in contrast to the
stack-based JavaVM.

There is a paper [shi+08] that provides an academic justification for
this approach. The gist of it is that, with some additional compiler
complexity, the register-based machine can reduce the number of NEXTs
(in Forth threaded-code terminology); depending on the implementation
approach and the hardware, the NEXTs could be the major cost at the
time. However, already at the time dynamic superinstructions (in
implementation technique for virtual-machine interpreters) reduced the
number of NEXTs to one per basic block, and VM registers did nothing
to reduce NEXTs in that case; Shi et al. also showed that with a lot
of compiler sophistication (data flow analysis etc.) VM registers can
be as fast as stacks even with dynamic superinstructions.

However, given that dynamic superinstructions are easier to implement
and the VM registers do not give a benefit when they are employed, why
would one go for VM registers? Of course, in the Forth setting one
could offload the optimization onto the programmer, but even Chuck
Moore did not go there.

In any case, here's an example extracted from Figure 6 of the paper:

   Java VM (stack)    VM registers
19 iload_1
20 bipush #31 iconst #31 -> r1
21 imul imul r6 r1 -> r3
22 aload_0
23 getfield value getfield r0.value -> r5
24 iload_3
26 caload caload r5 r7 -> r5
27 iadd iadd r3 r5 -> r6
28 istore_1

So yes, the VM register code contains fewer VM instructions. Is it
clearer?

The corresponding Gforth code is the stuff between IF and THEN in the
following:

0
value: some-field
value: value
constant some-struct

: foo
{: r0 r1 r3 :}
if
r1 31 * r0 value r3 + c@ + to r1
then
r1 ;

The code that Gforth produces for the basic block under consideration
is:

$7FC624AA0958 @local1 1->1
7FC62464A5BA:   mov    [r10],r13
7FC62464A5BD:   sub    r10,$08
7FC62464A5C1:   mov    r13,$08[rbp]
$7FC624AA0960 lit 1->2
$7FC624AA0968 #31
7FC62464A5C5:   sub    rbx,$50
7FC62464A5C9:   mov    r15,-$08[rbx]
$7FC624AA0970 * 2->1
7FC62464A5CD:   imul r13,r15
$7FC624AA0978 @local0 1->2
7FC62464A5D1:   mov    r15,$00[rbp]
$7FC624AA0980 lit+ 2->2
$7FC624AA0988 #8
7FC62464A5D5:   add    r15,$18[rbx]
$7FC624AA0990 @ 2->2
7FC62464A5D9:   mov    r15,[r15]
$7FC624AA0998 @local2 2->3
7FC62464A5DC:   mov    r9,$10[rbp]
$7FC624AA09A0 + 3->2
7FC62464A5E0:   add    r15,r9
$7FC624AA09A8 c@ 2->2
7FC62464A5E3:   movzx   r15d,byte PTR [r15]
$7FC624AA09B0 + 2->1
7FC62464A5E7:   add    r13,r15
$7FC624AA09B8 !local1 1->1
7FC62464A5EA:   add    r10,$08
7FC62464A5EE:   mov    $08[rbp],r13
7FC62464A5F2:   mov    r13,[r10]
7FC62464A5F5:   add    rbx,$50

There are 8 loads and 2 stores in that code. If the VM registers are
held in memory (as they usually are, and as the Gforth locals are),
the VM register code performs at least 9 loads (7 register accesses,
the getfield, and the caload) and 5 stores. Of course, in Forth one
would write the block as:

: foo1 ( n3 a0 n1 -- n )
31 * swap value rot + c@ + ;

and the code for that is (without the ";"):

$7FC624AA0A10 lit 1->2
$7FC624AA0A18 #31
7FC62464A617:   mov    r15,$08[rbx]
$7FC624AA0A20 * 2->1
7FC62464A61B:   imul r13,r15
$7FC624AA0A28 swap 1->2
7FC62464A61F:   mov    r15,$08[r10]
7FC62464A623:   add    r10,$08
$7FC624AA0A30 lit+ 2->2
$7FC624AA0A38 #8
7FC62464A627:   add    r15,$28[rbx]
$7FC624AA0A40 @ 2->2
7FC62464A62B:   mov    r15,[r15]
$7FC624AA0A48 rot 2->3
7FC62464A62E:   mov    r9,$08[r10]
7FC62464A632:   add    r10,$08
$7FC624AA0A50 + 3->2
7FC62464A636:   add    r15,r9
$7FC624AA0A58 c@ 2->2
7FC62464A639:   movzx   r15d,byte PTR [r15]
$7FC624AA0A60 + 2->1
7FC62464A63D:   add    r13,r15

6 loads, 0 stores.

And if we feed the equivalent standard code

0
field: some-field
field: value-addr
constant some-struct

: foo1 ( n3 a0 n1 -- n )
31 * swap value-addr @ rot + c@ + ;

into other Forth systems, some produce even better code:

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
FOO1
( 0050A310 486BDB1F ) IMUL RBX, RBX, # 1F
( 0050A314 488B5500 ) MOV    RDX, [RBP]
( 0050A318 488B4D08 ) MOV    RCX, [RBP+08]
( 0050A31C 48034A08 ) ADD    RCX, [RDX+08]
( 0050A320 480FB609 ) MOVZX   RCX, Byte 0 [RCX]
( 0050A324 4803D9 ) ADD    RBX, RCX
( 0050A327 488D6D10 ) LEA    RBP, [RBP+10]
( 0050A32B C3 ) RET/NEXT
( 28 bytes, 8 instructions )

5 loads, 0 stores. And VFX does not do data-flow analysis across
basic blocks, unlike the Java VM -> VM register compiler that Shi
used; i.e., VFX is probably simpler than the compiler Shi used.

@Article{shi+08,
author =    {Yunhe Shi and Kevin Casey and M. Anton Ertl and
David Gregg},
title = {Virtual machine showdown: Stack versus registers},
journal = {ACM Transactions on Architecture and Code
Optimization (TACO)},
year =    {2008},
volume =    {4},
number =    {4},
pages = {21:1--21:36},
month = jan,
url = {http://doi.acm.org/10.1145/1328195.1328197},
abstract =    {Virtual machines (VMs) enable the distribution of
programs in an architecture-neutral format, which
can easily be interpreted or compiled. A
long-running question in the design of VMs is
whether a stack architecture or register
architecture can be implemented more efficiently
with an interpreter. We extend existing work on
comparing virtual stack and virtual register
architectures in three ways. First, our translation
from stack to register code and optimization are
much more sophisticated. The result is that we
eliminate an average of more than 46\% of
executed VM instructions, with the bytecode size of
the register machine being only 26\% larger
than that of the corresponding stack one. Second, we
present a fully functional virtual-register
implementation of the Java virtual machine (JVM),
which supports Intel, AMD64, PowerPC and Alpha
processors. This register VM supports
inline-threaded, direct-threaded, token-threaded,
and switch dispatch. Third, we present experimental
results on a range of additional optimizations such
as register allocation and elimination of redundant
heap loads. On the AMD64 architecture the register
machine using switch dispatch achieves an average
speedup of 1.48 over the corresponding stack
machine. Even using the more efficient
inline-threaded dispatch, the register VM achieves a
speedup of 1.15 over the equivalent stack-based VM.}
}

Clarity of the code comes as a "bonus" :) yes, we've
got VALUEs and I use them when needed, but their use
still means employing the "Forth machine".

What do you mean with 'the "Forth machine"', and how does "OOS"
(whatever that is) avoid it?

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

Date	Sujet	#	Auteur
24 Feb 25	Stack vs stackless operation	72	LIT
24 Feb 25	Re: Stack vs stackless operation	4	minforth
24 Feb 25	Re: Stack vs stackless operation	3	LIT
24 Feb 25	Re: Stack vs stackless operation	2	minforth
24 Feb 25	Re: Stack vs stackless operation	1	LIT
24 Feb 25	Re: Stack vs stackless operation	14	Anton Ertl
24 Feb 25	Re: Stack vs stackless operation	13	LIT
25 Feb 25	Re: Stack vs stackless operation	12	Anton Ertl
25 Feb 25	Re: Stack vs stackless operation	11	LIT
25 Feb 25	Re: Stack vs stackless operation	10	Anton Ertl
25 Feb 25	Re: Stack vs stackless operation	9	LIT
25 Feb 25	Re: Stack vs stackless operation	5	minforth
25 Feb 25	Re: Stack vs stackless operation	4	LIT
25 Feb 25	Re: Stack vs stackless operation	3	minforth
25 Feb 25	Re: Stack vs stackless operation	2	LIT
25 Feb 25	Re: Stack vs stackless operation	1	Gerry Jackson
25 Feb 25	Re: Stack vs stackless operation	3	Anton Ertl
25 Feb 25	Re: Stack vs stackless operation	2	LIT
25 Feb 25	Re: Stack vs stackless operation	1	Anton Ertl
25 Feb 25	Re: Stack vs stackless operation	9	dxf
25 Feb 25	Re: Stack vs stackless operation	8	LIT
25 Feb 25	Re: Stack vs stackless operation	6	dxf
25 Feb 25	Re: Stack vs stackless operation	5	LIT
26 Feb 25	Re: Stack vs stackless operation	4	dxf
26 Feb 25	Re: Stack vs stackless operation	3	LIT
26 Feb 25	Re: Stack vs stackless operation	2	minforth
26 Feb 25	Re: Stack vs stackless operation	1	LIT
25 Feb 25	Re: Stack vs stackless operation	1	Hans Bezemer
25 Feb 25	Re: Stack vs stackless operation	2	LIT
25 Feb 25	do...loop (was: Stack vs stackless operation)	1	Anton Ertl
25 Feb 25	Re: Stack vs stackless operation	10	LIT
26 Feb 25	Re: Stack vs stackless operation	9	Hans Bezemer
26 Feb 25	Re: Stack vs stackless operation	8	LIT
26 Feb 25	Re: Stack vs stackless operation	7	Hans Bezemer
26 Feb 25	Re: Stack vs stackless operation	6	LIT
27 Feb 25	Re: Stack vs stackless operation	5	LIT
27 Feb 25	Re: Stack vs stackless operation	4	LIT
2 Mar 25	Re: Stack vs stackless operation	3	LIT
5 Mar 25	Re: Stack vs stackless operation	2	Hans Bezemer
6 Mar 25	Re: Stack vs stackless operation	1	LIT
25 Feb 25	Re: Stack vs stackless operation	32	LIT
25 Feb 25	Re: Stack vs stackless operation	10	Anton Ertl
25 Feb 25	Re: Stack vs stackless operation	1	LIT
26 Feb 25	Re: Stack vs stackless operation	8	LIT
26 Feb 25	Re: Stack vs stackless operation	1	LIT
26 Feb 25	Re: Stack vs stackless operation	6	John Ames
26 Feb 25	Re: Stack vs stackless operation	5	LIT
27 Feb 25	Re: Stack vs stackless operation	4	dxf
27 Feb 25	Re: Stack vs stackless operation	3	LIT
27 Feb 25	Re: Stack vs stackless operation	2	Hans Bezemer
27 Feb 25	Re: Stack vs stackless operation	1	LIT
26 Feb 25	Re: Stack vs stackless operation	2	Waldek Hebisch
26 Feb 25	Re: Stack vs stackless operation	1	Anton Ertl
26 Feb 25	Re: Stack vs stackless operation	19	mhx
26 Feb 25	Re: Stack vs stackless operation	1	minforth
26 Feb 25	Re: Stack vs stackless operation	16	Anton Ertl
26 Feb 25	Re: Stack vs stackless operation	15	Anton Ertl
26 Feb 25	Re: Stack vs stackless operation	7	Paul Rubin
26 Feb 25	Re: Stack vs stackless operation	1	minforth
27 Feb 25	Re: Stack vs stackless operation	5	Anton Ertl
27 Feb 25	Re: Stack vs stackless operation	2	Paul Rubin
27 Feb 25	Re: Stack vs stackless operation	1	Anton Ertl
27 Feb 25	Re: Stack vs stackless operation	2	Gerry Jackson
27 Feb 25	Re: Stack vs stackless operation	1	Anton Ertl
28 Feb 25	Re: Stack vs stackless operation	7	Anton Ertl
28 Feb 25	Re: Stack vs stackless operation	6	Paul Rubin
1 Mar 25	Re: Stack vs stackless operation	5	Anton Ertl
1 Mar 25	Stack caching (: Stack vs stackless operation)	1	Anton Ertl
1 Mar 25	Re: Stack vs stackless operation	3	Anton Ertl
1 Mar 25	Re: Stack vs stackless operation	2	Anton Ertl
1 Mar 25	Re: Stack vs stackless operation	1	mhx
27 Feb 25	Re: Stack vs stackless operation	1	mhx