mhx@iae.nl (mhx) writes:
On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:
>
[..] if your implementation performs the same
bit-exact operations for computing a transcendental function on two
IEEE 754 compliant platforms, the result will be bit-identical (if it
is a number). So just use the same implementations of transcentental
functions, and your results will be bit-identical; concerning the
NaNs, if you find a difference, check if the involved values are NaNs.
>
When e.g. summing the elements of a DP vector, it is hard to see why
that couldn't be done on the FPU stack (with 80 bits) before (possibly)
storing the result to a DP variable in memory. I am not sure that Forth
users would be able to resist that approach.
The question is: What properties do you want your computation to have?
1) Bit-identical result to a naively-coded IEEE 754 DP computation?
2) A more accurate result? How much more accuracy?
3) More performance?
If you want 1), there is little alternative to actually performing the
operations sequentially, using scalar SSE2 operations.
If you can live without 1), there's a wide range of options:
A) Perform the naive summation, but using 80-bit addition. This will
produce higher accuracy, but limit performance to typically 4
cycles or so per addition (as does the naive SSE2 approach),
because the latency of the floating-point addition is 4 cycles or
so (depending on the actual processor).
B) Perform vectorized summation using SIMD instructions (e.g.,
AVX-512), with enough parallel additions (beyond the vector size)
that either the load unit throughput, the FPU throughput, or the
instruction issue rate will limit the performance. Reduce the n
intermediate results to one intermediate result in the end. If I
give the naive loop to gcc -O3 and allow it to pretend that
floating-point addition is associative, it produces such a
computation automatically. The result will typically be a little
more accurate than the result of 1), because the length of the
addition chains is lenth(vector)/lanes+ld(lanes) rather than
length(vector).
C) Perform tree addition
a) Using 80-bit addition. This will be faster than sequential
addition because in many cases several additions can run in
parallel. It will also be quite accurate because it uses 80-bit
addition, and because the addition chains are reduced to
ld(length(vector)).
b) Using DP addition. This allows to use SIMD instructions for
increased performance (except near the root of the tree), but the
accuracy is not as good as with 80-bit addition. It is still
good because the length of the addition chains is only
ld(length(vector)).
D) Use Kahan summation (you must not allow the compiler to pretend
that FP addition is associative, or this will not work) or one of
its enhancements. This provides a very high accuracy, but (in case
of the original Kahan summation) requires four FP operations for
each summand, and each operation depends on the previous one. So
you get the latency of 4 FP additions per iteration for a version
that goes across the array sequentially. You can apply
vectorization to eliminate the effect of these latencies, but you
will still see the increased resource consumption. If the vector
resides in a distant cache or in main memory, the memory limit may
limit performance more than lack of FPU resources, however.
E) Sort the vector, then start with the element closest to 0. At
every step, add the element of the sign other than the current
intermediate sum that is closest to 0. If there is no such element
left, add the remaining elements in order, starting with the one
closest to 0. This is pretty accurate and slower than naive
addition. At the current relative costs of sorting and FP
operations, Kahan summation probably dominates over this approach.
So, as you can see, depending on your objectives there may be more
attractive ways to add a vector than what you suggested. Your
suggestion actually looks pretty unattractive, except if your
objectives are "ease of implementation" and "more accuracy than the
naive approach".
- anton
-- M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.htmlcomp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html New standard: https://forth-standard.org/EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/EuroForth 2024 proceedings:
http://www.euroforth.org/ef24/papers/