mhx@iae.nl (mhx) writes:
On Mon, 14 Jul 2025 7:50:04 +0000, Anton Ertl wrote:
C) Perform tree addition
>
a) Using 80-bit addition. This will be faster than sequential
addition because in many cases several additions can run in
parallel. It will also be quite accurate because it uses 80-bit
addition, and because the addition chains are reduced to
ld(length(vector)).
>
This looks very interesting. I can find Kahan and Neumaier, but
"tree addition" didn't turn up (There is a suspicious looking
reliability paper about the approach which surely is not what
you meant). Or is it pairwise addition what I should look for?
Yes, "tree addition" is not a common term, and Wikipedia calls it
pairwise addition. Except that unlike suggeseted in
<
https://en.wikipedia.org/wiki/Pairwise_summation> I would not switch to
a sequential approach for small n, for both accuracy and performance.
In any case the idea is to turn the evaluation tree from a degenerate
tree into a balanced tree. E.g., if you add up a, b, c, and d, then
the naive evaluation
a b f+ c f+ d f+
has the evaluation tree
a b
\ /
f+ c
\ /
f+ d
\ /
f+
with the three F+ each depending on the previous one, and also
increasing the rounding errors. If you balance the tree
a b c d
\ / \ /
f+ f+
\ /
f+
corresponding to
a b f+ c d f+ f+
the first two f+ can run in parallel (increasing performance), and the
rounding errors tend to be less.
So how to implement this for an arbitrary N? We had an extensive
discussion of a similar problem in the thread on the subject "balanced
REDUCE: a challenge for the brave", and you can find that discussion
at
<
https://comp.lang.forth.narkive.com/GIg9V9HK/balanced-reduce-a-challenge-for-the-brave>
But I decided to use a recursive approach (recursive-sum, REC) that
uses the largest 2^k<n as the left child and the rest as the right
child, and as base cases for the recursion use a straight-line
balanced-tree evaluation for 2^k with k<=7 (and combine these for n
that are not 2^k). For systems with tiny FP stacks, I added the
option to save intermediate results on a software stack in the
recursive word. Concerning the straight-line code, it turned out that
the highest k I could use on sf64 and vfx64 is 5 (corresponding to 6
FP stack items); it's not clear to me why; on lxf I can use k=7 (and
it uses the 387 stack, too).
I also coded the shift-reduce-sum algorithm (shift-reduce-sum, SR)
described in <
https://en.wikipedia.org/wiki/Pairwise_summation> in
Forth, because it can make use of Forth's features (such as the FP
stack) where the C code has to hand-code it. It uses the FP stack
beyond 8 elements if there are more than 128 elements in the array, so
it does not work for the benchmark (with 100_000 elements in the
array) on lxf, sf64, and vfx64. As you will see, this is no loss.
I also coded the naive, sequential approach (naive-sum, NAI).
One might argue that the straight-line stuff in REC puts REC at an
advantage, so i also produced an unrolled version of the naive code
(unrolled-sum, UNR) that uses straight-line sequences for adding up to
2^7 elements to the intermediate result.
You can find a file containing all these versions, compatibility
configurations for various Forth systems, and testing and benchmarking
code and data, on
https://www.complang.tuwien.ac.at/forth/programs/pairwise-sum.4thI did not do any accuracy measurements, but I did performance
measurements on a Ryzen 5800X:
cycles:u
gforth-fast iforth lxf SwiftForth VFX
3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI
6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR
3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC
9_150_679_812 14_634_786_781 SR
cycles:u
gforth-fast iforth lxf SwiftForth VFX
13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR
9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC
51_113_853_111 29_264_267_850 SR
The versions used are:
Gforth 0.7.9_20250625
iForth 5.1-mini
lxf 1.7-172-983
SwiftForth x64-Linux 4.0.0-RC89
VFX Forth 64 5.43 [build 0199] 2023-11-09
The ":u" means that I measured what happened at the user-level, not at
the kernel-level.
Each benchmark run performs 1G f@ and f+ operations, and the naive
approach performs 1G iterations of the loop.
The NAIve and UNRolled results show that performance in both is
limited by the latency of the F+: 3 cycles for the DP SSE2 operation
in Gforth-fast, 6 cycles for the 80-bit 387 fadd on the other systems.
It's unclear to me why UNR is much slower on gforth-fast compared to
NAI.
The RECursive balanced-tree sum is faster on iForth, lxf and VFX than
the NAIve and UNRolled versions. It is slower on Gforth: My guess is
that, despite all hardware advances, the lack of multi-state stack
caching in Gforth means that the hardware of the Ryzen 5800X does not
just see the real data flow, but a lot of additional dependences; or
it may be related to whatever causes the slowdown for UNRolled.
The SR (shift-reduce) sum looks cute, but performs so many additional
instructions, even on iForth, that it is uncompetetive. It's unclear
to me what slows it down so much on iForth, however.
I expect that vectorized implementations using AVX will be several
times faster than the fastest scalar stuff we see here.
- anton
-- M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.htmlcomp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html New standard: https://forth-standard.org/EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html