Sujet : Re: Stack vs stackless operation
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.lang.forthDate : 25. Feb 2025, 08:26:58
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Feb25.082658@mips.complang.tuwien.ac.at>
References : 1 2 3
User-Agent : xrn 10.11
zbigniew2011@gmail.com (LIT) writes:
Probably because the case where the two operands
of a + are in memory, and the result is needed
in memory is not that frequent.
>
One example could be matrix multiplication.
It's rather trivial but cumbersome operation,
where usually a few transitional variables are
used to maintain clarity of the code.
Earlier you wrote about performance, now you switch to clarity of the
code. What is the goal?
If we stick with performance, the fastest version in
<
http://theforth.net/package/matmul/current-view/matmul.4th> on all
systems (which I measured and that does not use a primitive FAXPY) is
version 2, and that spends most of its time in:
: faxpy-nostride ( ra f_x f_y ucount -- )
\ vy=ra*vx+vy
dup >r 3 and 0 ?do
fdup over f@ f* dup f+! float+ swap float+ swap
loop
r> 2 rshift 0 ?do
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
loop
2drop fdrop ;
It's not the clearest code, and certainly the version without
unrolling is clearer (and may be almost as fast in the newer versions
of SwiftForth and VFX which make counted loops significantly faster):
: faxpy-nostride ( ra f_x f_y ucount -- )
\ vy=ra*vx+vy
0 ?do
fdup over f@ f* dup f+! float+ swap float+ swap
loop
2drop fdrop ;
Each iteration performs 2 FP loads and 1 FP store. With
memory-to-memory variants of F* and F+ that would be 4 FP loads and 2
FP stores, and I don't think it would be any clearer. And if you use
memory-to-memory variants of the address computation, things would
become even slower. And I doubt that they would become clearer.
Some time later I worked on how SIMD could be integrated into Forth,
and used matrix multiplication as an example. With the wordset I
propose this whole loop became
( v1 r addr ) v@ f*vs f+v ( v2 )
Only one memory access is visible here at all; there are some more in
the implementation of these words, however. You can find the paper
about that at <
http://www.euroforth.org/ef17/papers/ertl.pdf>. A
further refinement of that work can be found at
<
https://www.complang.tuwien.ac.at/papers/ertl18manlang.pdf>
(presented in a Java setting for the audience of the conference, but
the implementation was in a Forth setting, see
<
https://github.com/AntonErtl/vectors>). This work eliminates many of
the memory accesses that the earlier implementation performs,
demonstrating that the memory accesses are not fundamental in the
model. In particular, Figure 11 shows code corresponding to
( v1 r1 addr1 r2 addr2 ) v@ f*vs v@ f+v v@ f*vs f+v ( v2 )
i.e., the code above unrolled by a factor of 2; it has 3 SIMD loads
and 1 SIMD store per SIMD-granule processed (the SIMD granule is 4
doubles for AVX). Further unrolling results in even fewer loads and
stores per FLOP (FP multiplication and FP addition).
Probably "bigger" Forth compilers are indeed
already "too good" for the difference to be
(practically) noticeable — still maybe for
simpler Forths, I mean like the ones for DOS
or even for 8-bit machines it would make sense?
Forth was designed for small machines and very simple implementations.
We have words like "1+" that are beneficial in that setting. We also
have "+!", which is the closest to what you have in mind. But even in
those times nobody went for a word like "+> ( addr1 addr2 addr3 -- )",
because it is not useful often enough.
- anton
-- M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.htmlcomp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html New standard: https://forth-standard.org/EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/EuroForth 2024 proceedings:
http://www.euroforth.org/ef24/papers/