Re: Stack vs stackless operation

Liste des GroupesRevenir à cl forth 
Sujet : Re: Stack vs stackless operation
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.lang.forth
Date : 25. Feb 2025, 10:07:19
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Feb25.100719@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5
User-Agent : xrn 10.11
zbigniew2011@gmail.com (LIT) writes:
[Anton Ertl:]
Earlier you wrote about performance, now you switch to clarity of the
code.  What is the goal?
>
Both — one isn't contrary to another.

Sometimes the clearer code is slower and the faster code is less clear
(as in the FAXPY-NOSTRIDE example).

What I have in mind is: by performing OOS operation
we don't have to employ the whole "Forth machine" to
do the usual things (I mean, of course, the usual
steps described by Brad Rodriguez in his "Moving
Forth" paper).

What does "OOS" stand for?  What do you mean with "the usual steps"; I
am not going to read the whole paper and guess which of the code shown
there you have in mind.

It comes with a cost: usual Forth words, that use
the stack, are versatile, while such OOS words
aren't that versatile anymore — yet (at least in
the case of ITC non-optimizing Forths) they should
be faster.

One related thing is the work on "register"-based virtual machines.
For interpreted implementations the VM registers are in memory, but
they are accessed by "register number"; these usually correspond to
locals slots on machines line the JavaVM.  A well-known example of
that is the switch of the Lua VM from stack-based to register-based.
A later example is Android's Dalvik VM for Java, in contrast to the
stack-based JavaVM.

There is a paper [shi+08] that provides an academic justification for
this approach.  The gist of it is that, with some additional compiler
complexity, the register-based machine can reduce the number of NEXTs
(in Forth threaded-code terminology); depending on the implementation
approach and the hardware, the NEXTs could be the major cost at the
time.  However, already at the time dynamic superinstructions (in
implementation technique for virtual-machine interpreters) reduced the
number of NEXTs to one per basic block, and VM registers did nothing
to reduce NEXTs in that case; Shi et al. also showed that with a lot
of compiler sophistication (data flow analysis etc.) VM registers can
be as fast as stacks even with dynamic superinstructions.

However, given that dynamic superinstructions are easier to implement
and the VM registers do not give a benefit when they are employed, why
would one go for VM registers?  Of course, in the Forth setting one
could offload the optimization onto the programmer, but even Chuck
Moore did not go there.

In any case, here's an example extracted from Figure 6 of the paper:

   Java VM (stack)     VM registers
19 iload_1
20 bipush #31          iconst #31 -> r1
21 imul                imul r6 r1 -> r3
22 aload_0
23 getfield value      getfield r0.value -> r5
24 iload_3
26 caload              caload r5 r7 -> r5
27 iadd                iadd r3 r5 -> r6
28 istore_1

So yes, the VM register code contains fewer VM instructions.  Is it
clearer?

The corresponding Gforth code is the stuff between IF and THEN in the
following:

0
value: some-field
value: value
constant some-struct

: foo
  {: r0 r1 r3 :}
  if
    r1 31 * r0 value r3 + c@ + to r1
  then
  r1 ;

The code that Gforth produces for the basic block under consideration
is:

$7FC624AA0958 @local1    1->1
7FC62464A5BA:   mov     [r10],r13
7FC62464A5BD:   sub     r10,$08
7FC62464A5C1:   mov     r13,$08[rbp]
$7FC624AA0960 lit    1->2
$7FC624AA0968 #31
7FC62464A5C5:   sub     rbx,$50
7FC62464A5C9:   mov     r15,-$08[rbx]
$7FC624AA0970 *    2->1
7FC62464A5CD:   imul    r13,r15
$7FC624AA0978 @local0    1->2
7FC62464A5D1:   mov     r15,$00[rbp]
$7FC624AA0980 lit+    2->2
$7FC624AA0988 #8
7FC62464A5D5:   add     r15,$18[rbx]
$7FC624AA0990 @    2->2
7FC62464A5D9:   mov     r15,[r15]
$7FC624AA0998 @local2    2->3
7FC62464A5DC:   mov     r9,$10[rbp]
$7FC624AA09A0 +    3->2
7FC62464A5E0:   add     r15,r9
$7FC624AA09A8 c@    2->2
7FC62464A5E3:   movzx   r15d,byte PTR [r15]
$7FC624AA09B0 +    2->1
7FC62464A5E7:   add     r13,r15
$7FC624AA09B8 !local1    1->1
7FC62464A5EA:   add     r10,$08
7FC62464A5EE:   mov     $08[rbp],r13
7FC62464A5F2:   mov     r13,[r10]
7FC62464A5F5:   add     rbx,$50

There are 8 loads and 2 stores in that code.  If the VM registers are
held in memory (as they usually are, and as the Gforth locals are),
the VM register code performs at least 9 loads (7 register accesses,
the getfield, and the caload) and 5 stores.  Of course, in Forth one
would write the block as:

: foo1 ( n3 a0 n1 -- n )
  31 * swap value rot + c@ + ;

and the code for that is (without the ";"):

$7FC624AA0A10 lit    1->2
$7FC624AA0A18 #31
7FC62464A617:   mov     r15,$08[rbx]
$7FC624AA0A20 *    2->1
7FC62464A61B:   imul    r13,r15
$7FC624AA0A28 swap    1->2
7FC62464A61F:   mov     r15,$08[r10]
7FC62464A623:   add     r10,$08
$7FC624AA0A30 lit+    2->2
$7FC624AA0A38 #8
7FC62464A627:   add     r15,$28[rbx]
$7FC624AA0A40 @    2->2
7FC62464A62B:   mov     r15,[r15]
$7FC624AA0A48 rot    2->3
7FC62464A62E:   mov     r9,$08[r10]
7FC62464A632:   add     r10,$08
$7FC624AA0A50 +    3->2
7FC62464A636:   add     r15,r9
$7FC624AA0A58 c@    2->2
7FC62464A639:   movzx   r15d,byte PTR [r15]
$7FC624AA0A60 +    2->1
7FC62464A63D:   add     r13,r15

6 loads, 0 stores.

And if we feed the equivalent standard code

0
field: some-field
field: value-addr
constant some-struct

: foo1 ( n3 a0 n1 -- n )
  31 * swap value-addr @ rot + c@ + ;

into other Forth systems, some produce even better code:

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
FOO1
( 0050A310    486BDB1F )              IMUL    RBX, RBX, # 1F
( 0050A314    488B5500 )              MOV     RDX, [RBP]
( 0050A318    488B4D08 )              MOV     RCX, [RBP+08]
( 0050A31C    48034A08 )              ADD     RCX, [RDX+08]
( 0050A320    480FB609 )              MOVZX   RCX, Byte 0 [RCX]
( 0050A324    4803D9 )                ADD     RBX, RCX
( 0050A327    488D6D10 )              LEA     RBP, [RBP+10]
( 0050A32B    C3 )                    RET/NEXT
( 28 bytes, 8 instructions )

5 loads, 0 stores.  And VFX does not do data-flow analysis across
basic blocks, unlike the Java VM -> VM register compiler that Shi
used; i.e., VFX is probably simpler than the compiler Shi used.

@Article{shi+08,
  author =       {Yunhe Shi and Kevin Casey and M. Anton Ertl and
                  David Gregg},
  title =        {Virtual machine showdown: Stack versus registers},
  journal =      {ACM Transactions on Architecture and Code
                  Optimization (TACO)},
  year =         {2008},
  volume =       {4},
  number =       {4},
  pages =        {21:1--21:36},
  month =        jan,
  url =          {http://doi.acm.org/10.1145/1328195.1328197},
  abstract =     {Virtual machines (VMs) enable the distribution of
                  programs in an architecture-neutral format, which
                  can easily be interpreted or compiled. A
                  long-running question in the design of VMs is
                  whether a stack architecture or register
                  architecture can be implemented more efficiently
                  with an interpreter. We extend existing work on
                  comparing virtual stack and virtual register
                  architectures in three ways. First, our translation
                  from stack to register code and optimization are
                  much more sophisticated. The result is that we
                  eliminate an average of more than 46\% of
                  executed VM instructions, with the bytecode size of
                  the register machine being only 26\% larger
                  than that of the corresponding stack one. Second, we
                  present a fully functional virtual-register
                  implementation of the Java virtual machine (JVM),
                  which supports Intel, AMD64, PowerPC and Alpha
                  processors. This register VM supports
                  inline-threaded, direct-threaded, token-threaded,
                  and switch dispatch. Third, we present experimental
                  results on a range of additional optimizations such
                  as register allocation and elimination of redundant
                  heap loads. On the AMD64 architecture the register
                  machine using switch dispatch achieves an average
                  speedup of 1.48 over the corresponding stack
                  machine. Even using the more efficient
                  inline-threaded dispatch, the register VM achieves a
                  speedup of 1.15 over the equivalent stack-based VM.}
}

Clarity of the code comes as a "bonus" :) yes, we've
got VALUEs and I use them when needed, but their use
still means employing the "Forth machine".

What do you mean with 'the "Forth machine"', and how does "OOS"
(whatever that is) avoid it?

- anton
--
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

Date Sujet#  Auteur
24 Feb 25 * Stack vs stackless operation72LIT
24 Feb 25 +* Re: Stack vs stackless operation4minforth
24 Feb 25 i`* Re: Stack vs stackless operation3LIT
24 Feb 25 i `* Re: Stack vs stackless operation2minforth
24 Feb 25 i  `- Re: Stack vs stackless operation1LIT
24 Feb 25 +* Re: Stack vs stackless operation14Anton Ertl
24 Feb 25 i`* Re: Stack vs stackless operation13LIT
25 Feb 25 i `* Re: Stack vs stackless operation12Anton Ertl
25 Feb 25 i  `* Re: Stack vs stackless operation11LIT
25 Feb 25 i   `* Re: Stack vs stackless operation10Anton Ertl
25 Feb 25 i    `* Re: Stack vs stackless operation9LIT
25 Feb 25 i     +* Re: Stack vs stackless operation5minforth
25 Feb 25 i     i`* Re: Stack vs stackless operation4LIT
25 Feb 25 i     i `* Re: Stack vs stackless operation3minforth
25 Feb 25 i     i  `* Re: Stack vs stackless operation2LIT
25 Feb 25 i     i   `- Re: Stack vs stackless operation1Gerry Jackson
25 Feb 25 i     `* Re: Stack vs stackless operation3Anton Ertl
25 Feb 25 i      `* Re: Stack vs stackless operation2LIT
25 Feb 25 i       `- Re: Stack vs stackless operation1Anton Ertl
25 Feb 25 +* Re: Stack vs stackless operation9dxf
25 Feb 25 i`* Re: Stack vs stackless operation8LIT
25 Feb 25 i +* Re: Stack vs stackless operation6dxf
25 Feb 25 i i`* Re: Stack vs stackless operation5LIT
26 Feb 25 i i `* Re: Stack vs stackless operation4dxf
26 Feb 25 i i  `* Re: Stack vs stackless operation3LIT
26 Feb 25 i i   `* Re: Stack vs stackless operation2minforth
26 Feb 25 i i    `- Re: Stack vs stackless operation1LIT
25 Feb 25 i `- Re: Stack vs stackless operation1Hans Bezemer
25 Feb 25 +* Re: Stack vs stackless operation2LIT
25 Feb 25 i`- do...loop (was: Stack vs stackless operation)1Anton Ertl
25 Feb 25 +* Re: Stack vs stackless operation10LIT
26 Feb 25 i`* Re: Stack vs stackless operation9Hans Bezemer
26 Feb 25 i `* Re: Stack vs stackless operation8LIT
26 Feb 25 i  `* Re: Stack vs stackless operation7Hans Bezemer
26 Feb 25 i   `* Re: Stack vs stackless operation6LIT
27 Feb 25 i    `* Re: Stack vs stackless operation5LIT
27 Feb 25 i     `* Re: Stack vs stackless operation4LIT
2 Mar 25 i      `* Re: Stack vs stackless operation3LIT
5 Mar 25 i       `* Re: Stack vs stackless operation2Hans Bezemer
6 Mar 25 i        `- Re: Stack vs stackless operation1LIT
25 Feb 25 `* Re: Stack vs stackless operation32LIT
25 Feb 25  +* Re: Stack vs stackless operation10Anton Ertl
25 Feb 25  i+- Re: Stack vs stackless operation1LIT
26 Feb 25  i`* Re: Stack vs stackless operation8LIT
26 Feb 25  i +- Re: Stack vs stackless operation1LIT
26 Feb 25  i `* Re: Stack vs stackless operation6John Ames
26 Feb 25  i  `* Re: Stack vs stackless operation5LIT
27 Feb 25  i   `* Re: Stack vs stackless operation4dxf
27 Feb 25  i    `* Re: Stack vs stackless operation3LIT
27 Feb 25  i     `* Re: Stack vs stackless operation2Hans Bezemer
27 Feb 25  i      `- Re: Stack vs stackless operation1LIT
26 Feb 25  +* Re: Stack vs stackless operation2Waldek Hebisch
26 Feb 25  i`- Re: Stack vs stackless operation1Anton Ertl
26 Feb 25  `* Re: Stack vs stackless operation19mhx
26 Feb 25   +- Re: Stack vs stackless operation1minforth
26 Feb 25   +* Re: Stack vs stackless operation16Anton Ertl
26 Feb 25   i`* Re: Stack vs stackless operation15Anton Ertl
26 Feb 25   i +* Re: Stack vs stackless operation7Paul Rubin
26 Feb 25   i i+- Re: Stack vs stackless operation1minforth
27 Feb 25   i i`* Re: Stack vs stackless operation5Anton Ertl
27 Feb 25   i i +* Re: Stack vs stackless operation2Paul Rubin
27 Feb 25   i i i`- Re: Stack vs stackless operation1Anton Ertl
27 Feb 25   i i `* Re: Stack vs stackless operation2Gerry Jackson
27 Feb 25   i i  `- Re: Stack vs stackless operation1Anton Ertl
28 Feb 25   i `* Re: Stack vs stackless operation7Anton Ertl
28 Feb 25   i  `* Re: Stack vs stackless operation6Paul Rubin
1 Mar 25   i   `* Re: Stack vs stackless operation5Anton Ertl
1 Mar 25   i    +- Stack caching (: Stack vs stackless operation)1Anton Ertl
1 Mar 25   i    `* Re: Stack vs stackless operation3Anton Ertl
1 Mar 25   i     `* Re: Stack vs stackless operation2Anton Ertl
1 Mar 25   i      `- Re: Stack vs stackless operation1mhx
27 Feb 25   `- Re: Stack vs stackless operation1mhx

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal