Liste des Groupes | Revenir à cl forth |
peter <peter.noreply@tin.it> writes:I did a test coding the sum128 as a code word with avx-512 instructions
and got the following results
>
285,584,376 cycles:u
941,856,077 instructions:u
>
timing was
timer-reset ' recursive-sum bench .elapsed 51 ms elapsed
>
so half the time of the original recursive.
with 32 zmm registers I could have done a sum256 also
One could do sum128 with just 8 registers by performing the adds ASAP,
i.e., for sum32
vmovapd zmm0, [rbx]
vmovapd zmm1, [rbx+64]
vaddpd zmm0, zmm0, zmm1
vmovapd zmm1, [rbx+128]
vmovapd zmm2, [rbx+192]
vaddpd zmm1, zmm1, zmm2
vaddpd zmm0, zmm0, zmm1
; and then the Horizontal sum
And you can code this as:
vmovapd zmm0, [rbx]
vaddpd zmm0, zmm0, [rbx+64]
vmovapd zmm1, [rbx+128]
vaddpd zmm1, zmm1, [rbx+192]
vaddpd zmm0, zmm0, zmm1
; and then the Horizontal sum
; Horizontal sum of zmm0
>
vextractf64x4 ymm1, zmm0, 1
vaddpd ymm2, ymm1, ymm0
>
vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2
>
vhaddpd xmm0, xmm4, xmm4
Instead of doing the horizontal sum once for every sum128, it might be
more efficient (assuming the whole thing is not
cache-bandwidth-limited) to have the result of sum128 be a full SIMD
width, and then add them up with vaddpd instead of addsd, and do the
horizontal sum once in the end.
But if the recursive part is to be programmed in Forth, we would need
a way to represent a SIMD width of data in Forth, maybe with a SIMD
stack. I see a few problems there:
* What to do about the mask registers of AVX-512? In the RISC-V
vector extension masks are stored in regular SIMD registers.
* There is a trend visible in ARM SVE and the RISC-V Vector extension
to have support for dealing with loops across longer vectors. Do we
also need to support something like that.
For the RISC-V vector extension, see
<https://riscv.org/wp-content/uploads/2024/12/15.20-15.55-18.05.06.VEXT-bcn-v1.pdf>
One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.
An advantage of having a (short) SIMD stack would be that one could
use SIMD operations for other uses where the long-vector wordset looks
too heavy-weight (or would need optimizations to get rid of the
long-vector overhead). The question is if enough such uses exist to
justify adding such a stack.
- anton
Les messages affichés proviennent d'usenet.