Re: Performance benefits of primitive-centric code

Liste des GroupesRevenir à cl forth 
Sujet : Re: Performance benefits of primitive-centric code
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.lang.forth
Date : 13. Jun 2025, 06:09:39
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Jun13.070939@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6
User-Agent : xrn 10.11
minforth@gmx.net (minforth) writes:
It looks like the biggest improvement came from switching
to the benchark engine. What does that mean?

The speedup from switching to the benchmark engine means that the
debugging features of the debugging engine have a cost.  See below.

However, the speedup factor from 1) adding dynamic superinstructions
with replication and from optimizing away IP updates are higher for
several benchmarks.

As for the cost of debugging features, let's look at the code for

: squared dup * ;

for the two engines compared in this step, and for default gforth-fast
(all optimizations enabled):

debugging               benchmarking            benchmarking
with ip updates, no multi-state stack caching   all optimizations
dup    0->0             dup    1->1             dup    1->2        
  mov     $50[r13],r15    add     rbx,$08         mov     r15,r13  
  add     r15,$08         mov     [r10],r13                        
  mov     rax,[r14]       sub     r10,$08                          
  sub     r14,$08                                                  
  mov     [r14],rax                                                
*    0->0               *    1->1               *    2->1          
  mov     $50[r13],r15    add     rbx,$08         imul    r13,r15  
  add     r15,$08         imul    r13,$08[r10]                     
  mov     rax,$08[r14]    add     r10,$08                          
  imul    rax,[r14]                                                
  add     r14,$08                                                  
  mov     [r14],rax                                                
;s    0->0              ;s    1->1              ;s    1->1         
  mov     $50[r13],r15    mov     rbx,[r14]       mov     rbx,[r14]
  mov     rax,$58[r13]    add     r14,$08         add     r14,$08  
  mov     r10,[rax]       mov     rax,[rbx]       mov     rax,[rbx]
  add     rax,$08         jmp     eax             jmp     eax      
  mov     $58[r13],rax
  mov     r15,r10
  mov     rcx,[r15]
  jmp     ecx

In the debugging engine, you see, at the start of each primitive, the
instruction

mov     $50[r13],r15

This saves the current instruction pointer.  If there is a signal,
e.g., because a stack underflow produces a segmentation violation, the
signal handler can then save the instruction pointer in the backtrace
and cause a Forth-level THROW, and the system CATCH handler can then
report exactly where the stack underflow happened.

In order for that to work, the signal handler also needs to know the
return stack pointer, so in the debugging engine we don't keep the
return stack pointer in a local variable (which ideally is kept in a
register), but keep it in a struct, and we see the accesses to this
struct in ";S":

mov     rax,$58[r13]
...
mov     $58[r13],rax

The benchmarking engine does not have all these memory accesses.

Moreover, in order to report stack underflows even in cases like DUP,
the debugging engine keeps no stack item in a register across
primitive boundaries, while the benchmarking engine keeps one stack
item in the second column and 0-3 in the third column.  So we see all
these accesses through [r14] (the data stack pointer) in the debugging
engine, while we see fewer accesses through [r10] (the data stack
pointer in this engine in the second column, and no data stack memoru
access in the third column.

Moreover, the debugging engine keeps the item below the stack bottom
in inaccessible memory, so that every stack underflow produces a
signal.  This does not cost additional performance.

The bottom line is that in the debugging engine every stack underflow
causes a SIGSEGV, and we get a backtrace that includes the primitive
that caused the stack underflow:

: squared dup * ;  ok
.s <0>  ok
squared
*the terminal*:3:1: error: Stack underflow
squared<<<
Backtrace:
*terminal*:1:11:                         0 $7FFB668A0DA0 dup

Gforth also keeps information about the
source-code-to-instruction-pointer mapping, and reports the location
of the source code ("*terminal*:1:11:") in addition to decompiling the
involved word ("dup").  The "0" is the index of the backtrace entry
(if you want to look at the code for this backtrace entry), and the
"$7FFB668A0DA0" is the actual value of the return stack item in the
backtrace.

By contrast, the benchmarking engine does not notice the stack
underflow in this case, and even in cases where a primitive causes a
signal (e.g., when @ tries to access inaccessible memory), neither the
instruction pointer nor the return stack pointer are available to the
signal handler, so you get no backtrace from THROWs due to signals
caused by primitives.

These are the differences between the first and second column, i.e.,
between the debugging and the benchmarking engine at otherwise the
same optimization level (without optimizing IP updates away, and
without multi-state stack caching).  Let's look at the differences
between the second and third column (all optimizations).

The first difference is that the threaded-code instruction pointer
updates are optimized away in the third column.  In the second column,
they are still present:

add     rbx,$08

at the start of DUP and *.  At the start of ;S there is no such
update, because the instruction pointer rbx is overwritten by the load
of the instruction pointer from the return stack.

The other difference is that the second column always has one stack
item in a register, whereas the third column supports different stack
representations.  In particular, the "dup 1->2" means that DUP starts
with one stack item in a register, and finishes with two stack items
in registers; "* 2->1" means that * starts with two stack items in
registers and ends with one stack item in a register.  This
multi-state stack caching eliminates the overhead of storing stack
items to memory, loading stack items from memory, and updating the
data stack pointer (r10 in column 2).

- anton
--
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

Date Sujet#  Auteur
11 Jun 25 * Actually... why not?13LIT
11 Jun 25 `* Re: Actually... why not?12Anton Ertl
12 Jun 25  `* Re: Actually... why not?11LIT
12 Jun 25   `* Re: Actually... why not?10Anton Ertl
12 Jun 25    +* Re: Actually... why not?4LIT
12 Jun 25    i+- Re: Actually... why not?1LIT
12 Jun 25    i`* Re: Actually... why not?2Anton Ertl
12 Jun 25    i `- Re: Actually... why not?1LIT
12 Jun 25    `* Performance benefits of primitive-centric code (was: Actually... )5Anton Ertl
13 Jun 25     `* Re: Performance benefits of primitive-centric code4minforth
13 Jun 25      +* Re: Performance benefits of primitive-centric code2Paul Rubin
13 Jun 25      i`- Re: Performance benefits of primitive-centric code1Anton Ertl
13 Jun 25      `- Re: Performance benefits of primitive-centric code1Anton Ertl

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal