Sujet : Re: Parsing timestamps?
De : peter.noreply (at) *nospam* tin.it (peter)
Groupes : comp.lang.forthDate : 17. Jul 2025, 21:48:25
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <20250717224825.00007b8c@tin.it>
References : 1 2 3 4 5 6 7 8 9 10 11
User-Agent : Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32)
On Thu, 17 Jul 2025 12:54:29 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
peter <peter.noreply@tin.it> writes:
Ryzen 9950X
>
lxf64
5,010,566,495 NAI cycles:u
2,011,359,782 UNR cycles:u
646,926,001 REC cycles:u
3,589,863,082 SR cycles:u
>
lxf64 =20
7,019,247,519 NAI instructions:u =20
4,128,689,843 UNR instructions:u =20
4,643,499,656 REC instructions:u=20
25,019,182,759 SR instructions:u=20
>
>
gforth-fast 20250219
2,048,316,578 NAI cycles:u
7,157,520,448 UNR cycles:u
3,589,638,677 REC cycles:u
17,199,889,916 SR cycles:u
>
gforth-fast 20250219
13,107,999,739 NAI instructions:u=20
6,789,041,049 UNR instructions:u
9,348,969,966 REC instructions:u=20
50,108,032,223 SR instructions:u=20
>
lxf
6,005,617,374 NAI cycles:u
6,004,157,635 UNR cycles:u
1,303,627,835 REC cycles:u
9,187,422,499 SR cycles:u
>
lxf
9,010,888,196 NAI instructions:u
4,237,679,129 UNR instructions:u=20
4,955,258,040 REC instructions:u=20
26,018,680,499 SR instructions:u
lxf uses the x87 builtin fp stack, lxf64 uses sse4 and a large fp stack=20
Apparently the latency of ADDSD (SSE2) is down to 2 cycles on Zen5
(visible in lxf64 UNR and gforth-fast NAI) while the latency of FADD
(387) is still 6 cycles (lxf NAI and UNR). I have no explanation why
on lxf64 NAI performs so much worse than UNR, and in gforth-fast UNR
so much worse than NAI.
For REC the latency should not play a role. There lxf64 performs at
7.2IPC and 1.55 F+/cycle, whereas lxf performs only at 3.8IPC and 0.77
F+/cycle. My guess is that FADD can only be performed by one FPU, and
that's connected to one dispatch port, and other instructions also
need or are at least assigned to this dispatch port.
- anton
I did a test coding the sum128 as a code word with avx-512 instructions
and got the following results
285,584,376 cycles:u
941,856,077 instructions:u
timing was
timer-reset ' recursive-sum bench .elapsed 51 ms elapsed
so half the time of the original recursive.
with 32 zmm registers I could have done a sum256 also
the code is below for reference
r13 is the fp stack pointer
rbx top of stack
xmm0 top of fp stack
code asum128
movsd [r13-0x8], xmm0
lea r13, [r13-0x8]
vmovapd zmm0, [rbx]
vmovapd zmm1, [rbx+64]
vmovapd zmm2, [rbx+128]
vmovapd zmm3, [rbx+192]
vmovapd zmm4, [rbx+256]
vmovapd zmm5, [rbx+320]
vmovapd zmm6, [rbx+384]
vmovapd zmm7, [rbx+448]
vmovapd zmm8, [rbx+512]
vmovapd zmm9, [rbx+576]
vmovapd zmm10, [rbx+640]
vmovapd zmm11, [rbx+704]
vmovapd zmm12, [rbx+768]
vmovapd zmm13, [rbx+832]
vmovapd zmm14, [rbx+896]
vmovapd zmm15, [rbx+960]
vaddpd zmm0, zmm0, zmm1
vaddpd zmm2, zmm2, zmm3
vaddpd zmm4, zmm4, zmm5
vaddpd zmm6, zmm6, zmm7
vaddpd zmm8, zmm8, zmm9
vaddpd zmm10, zmm10, zmm11
vaddpd zmm12, zmm12, zmm13
vaddpd zmm14, zmm14, zmm15
vaddpd zmm0, zmm0, zmm2
vaddpd zmm4, zmm4, zmm6
vaddpd zmm8, zmm8, zmm10
vaddpd zmm12, zmm12, zmm14
vaddpd zmm0, zmm0, zmm4
vaddpd zmm8, zmm8, zmm12
vaddpd zmm0, zmm0, zmm8
; Horizontal sum of zmm0
vextractf64x4 ymm1, zmm0, 1
vaddpd ymm2, ymm1, ymm0
vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2
vhaddpd xmm0, xmm4, xmm4
ret
end-code
lxf64 uses a modified fasm as the backend assembler
so full support for all instructions
BR
Peter