Newsportal USENET - Re: Stack vs stackless operation

Re: Stack vs stackless operation

Sujet : Re: Stack vs stackless operation
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.lang.forth
Date : 25. Feb 2025, 08:26:58

Autres entêtes

Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Feb25.082658@mips.complang.tuwien.ac.at>
References : 1 2 3
User-Agent : xrn 10.11

zbigniew2011@gmail.com (LIT) writes:

Probably because the case where the two operands
of a + are in memory, and the result is needed
in memory is not that frequent.
>
One example could be matrix multiplication.
It's rather trivial but cumbersome operation,
where usually a few transitional variables are
used to maintain clarity of the code.

Earlier you wrote about performance, now you switch to clarity of the
code. What is the goal?

If we stick with performance, the fastest version in
<http://theforth.net/package/matmul/current-view/matmul.4th> on all
systems (which I measured and that does not use a primitive FAXPY) is
version 2, and that spends most of its time in:

: faxpy-nostride ( ra f_x f_y ucount -- )
\ vy=ra*vx+vy
dup >r 3 and 0 ?do
fdup over f@ f* dup f+! float+ swap float+ swap
loop
r> 2 rshift 0 ?do
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
loop
2drop fdrop ;

It's not the clearest code, and certainly the version without
unrolling is clearer (and may be almost as fast in the newer versions
of SwiftForth and VFX which make counted loops significantly faster):

: faxpy-nostride ( ra f_x f_y ucount -- )
\ vy=ra*vx+vy
0 ?do
fdup over f@ f* dup f+! float+ swap float+ swap
loop
2drop fdrop ;

Each iteration performs 2 FP loads and 1 FP store. With
memory-to-memory variants of F* and F+ that would be 4 FP loads and 2
FP stores, and I don't think it would be any clearer. And if you use
memory-to-memory variants of the address computation, things would
become even slower. And I doubt that they would become clearer.

Some time later I worked on how SIMD could be integrated into Forth,
and used matrix multiplication as an example. With the wordset I
propose this whole loop became

( v1 r addr ) v@ f*vs f+v ( v2 )

Only one memory access is visible here at all; there are some more in
the implementation of these words, however. You can find the paper
about that at <http://www.euroforth.org/ef17/papers/ertl.pdf>. A
further refinement of that work can be found at
<https://www.complang.tuwien.ac.at/papers/ertl18manlang.pdf>
(presented in a Java setting for the audience of the conference, but
the implementation was in a Forth setting, see
<https://github.com/AntonErtl/vectors>). This work eliminates many of
the memory accesses that the earlier implementation performs,
demonstrating that the memory accesses are not fundamental in the
model. In particular, Figure 11 shows code corresponding to

( v1 r1 addr1 r2 addr2 ) v@ f*vs v@ f+v v@ f*vs f+v ( v2 )

i.e., the code above unrolled by a factor of 2; it has 3 SIMD loads
and 1 SIMD store per SIMD-granule processed (the SIMD granule is 4
doubles for AVX). Further unrolling results in even fewer loads and
stores per FLOP (FP multiplication and FP addition).

Probably "bigger" Forth compilers are indeed
already "too good" for the difference to be
(practically) noticeable — still maybe for
simpler Forths, I mean like the ones for DOS
or even for 8-bit machines it would make sense?

Forth was designed for small machines and very simple implementations.
We have words like "1+" that are beneficial in that setting. We also
have "+!", which is the closest to what you have in mind. But even in
those times nobody went for a word like "+> ( addr1 addr2 addr3 -- )",
because it is not useful often enough.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

Les messages affichés proviennent d'usenet.

Date	Sujet	#	Auteur
24 Feb 25	Stack vs stackless operation	72	LIT
24 Feb 25	Re: Stack vs stackless operation	4	minforth
24 Feb 25	Re: Stack vs stackless operation	3	LIT
24 Feb 25	Re: Stack vs stackless operation	2	minforth
24 Feb 25	Re: Stack vs stackless operation	1	LIT
24 Feb 25	Re: Stack vs stackless operation	14	Anton Ertl
24 Feb 25	Re: Stack vs stackless operation	13	LIT
25 Feb 25	Re: Stack vs stackless operation	12	Anton Ertl
25 Feb 25	Re: Stack vs stackless operation	11	LIT
25 Feb 25	Re: Stack vs stackless operation	10	Anton Ertl
25 Feb 25	Re: Stack vs stackless operation	9	LIT
25 Feb 25	Re: Stack vs stackless operation	5	minforth
25 Feb 25	Re: Stack vs stackless operation	4	LIT
25 Feb 25	Re: Stack vs stackless operation	3	minforth
25 Feb 25	Re: Stack vs stackless operation	2	LIT
25 Feb 25	Re: Stack vs stackless operation	1	Gerry Jackson
25 Feb 25	Re: Stack vs stackless operation	3	Anton Ertl
25 Feb 25	Re: Stack vs stackless operation	2	LIT
25 Feb 25	Re: Stack vs stackless operation	1	Anton Ertl
25 Feb 25	Re: Stack vs stackless operation	9	dxf
25 Feb 25	Re: Stack vs stackless operation	8	LIT
25 Feb 25	Re: Stack vs stackless operation	6	dxf
25 Feb 25	Re: Stack vs stackless operation	5	LIT
26 Feb 25	Re: Stack vs stackless operation	4	dxf
26 Feb 25	Re: Stack vs stackless operation	3	LIT
26 Feb 25	Re: Stack vs stackless operation	2	minforth
26 Feb 25	Re: Stack vs stackless operation	1	LIT
25 Feb 25	Re: Stack vs stackless operation	1	Hans Bezemer
25 Feb 25	Re: Stack vs stackless operation	2	LIT
25 Feb 25	do...loop (was: Stack vs stackless operation)	1	Anton Ertl
25 Feb 25	Re: Stack vs stackless operation	10	LIT
26 Feb 25	Re: Stack vs stackless operation	9	Hans Bezemer
26 Feb 25	Re: Stack vs stackless operation	8	LIT
26 Feb 25	Re: Stack vs stackless operation	7	Hans Bezemer
26 Feb 25	Re: Stack vs stackless operation	6	LIT
27 Feb 25	Re: Stack vs stackless operation	5	LIT
27 Feb 25	Re: Stack vs stackless operation	4	LIT
2 Mar 25	Re: Stack vs stackless operation	3	LIT
5 Mar 25	Re: Stack vs stackless operation	2	Hans Bezemer
6 Mar 25	Re: Stack vs stackless operation	1	LIT
25 Feb 25	Re: Stack vs stackless operation	32	LIT
25 Feb 25	Re: Stack vs stackless operation	10	Anton Ertl
25 Feb 25	Re: Stack vs stackless operation	1	LIT
26 Feb 25	Re: Stack vs stackless operation	8	LIT
26 Feb 25	Re: Stack vs stackless operation	1	LIT
26 Feb 25	Re: Stack vs stackless operation	6	John Ames
26 Feb 25	Re: Stack vs stackless operation	5	LIT
27 Feb 25	Re: Stack vs stackless operation	4	dxf
27 Feb 25	Re: Stack vs stackless operation	3	LIT
27 Feb 25	Re: Stack vs stackless operation	2	Hans Bezemer
27 Feb 25	Re: Stack vs stackless operation	1	LIT
26 Feb 25	Re: Stack vs stackless operation	2	Waldek Hebisch
26 Feb 25	Re: Stack vs stackless operation	1	Anton Ertl
26 Feb 25	Re: Stack vs stackless operation	19	mhx
26 Feb 25	Re: Stack vs stackless operation	1	minforth
26 Feb 25	Re: Stack vs stackless operation	16	Anton Ertl
26 Feb 25	Re: Stack vs stackless operation	15	Anton Ertl
26 Feb 25	Re: Stack vs stackless operation	7	Paul Rubin
26 Feb 25	Re: Stack vs stackless operation	1	minforth
27 Feb 25	Re: Stack vs stackless operation	5	Anton Ertl
27 Feb 25	Re: Stack vs stackless operation	2	Paul Rubin
27 Feb 25	Re: Stack vs stackless operation	1	Anton Ertl
27 Feb 25	Re: Stack vs stackless operation	2	Gerry Jackson
27 Feb 25	Re: Stack vs stackless operation	1	Anton Ertl
28 Feb 25	Re: Stack vs stackless operation	7	Anton Ertl
28 Feb 25	Re: Stack vs stackless operation	6	Paul Rubin
1 Mar 25	Re: Stack vs stackless operation	5	Anton Ertl
1 Mar 25	Stack caching (: Stack vs stackless operation)	1	Anton Ertl
1 Mar 25	Re: Stack vs stackless operation	3	Anton Ertl
1 Mar 25	Re: Stack vs stackless operation	2	Anton Ertl
1 Mar 25	Re: Stack vs stackless operation	1	mhx
27 Feb 25	Re: Stack vs stackless operation	1	mhx