On Wed, 16 Jul 2025 15:39:26 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
I did not do any accuracy measurements, but I did performance
measurements on a Ryzen 5800X:
>
cycles:u
gforth-fast iforth lxf SwiftForth VFX
3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI
6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR
3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC
9_150_679_812 14_634_786_781 SR
cycles:u
This second table is about instructions:u
gforth-fast iforth lxf SwiftForth VFX
13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR
9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC
51_113_853_111 29_264_267_850 SR
- anton
I have run this test now on my Ryzen 9950X for lxf, lxf64 ans a snapshot of gforth
Here are the results
Ryzen 9950X
lxf64
5,010,566,495 NAI cycles:u
2,011,359,782 UNR cycles:u
646,926,001 REC cycles:u
3,589,863,082 SR cycles:u
lxf64
7,019,247,519 NAI instructions:u
4,128,689,843 UNR instructions:u
4,643,499,656 REC instructions:u
25,019,182,759 SR instructions:u
gforth-fast 20250219
2,048,316,578 NAI cycles:u
7,157,520,448 UNR cycles:u
3,589,638,677 REC cycles:u
17,199,889,916 SR cycles:u
gforth-fast 20250219
13,107,999,739 NAI instructions:u
6,789,041,049 UNR instructions:u
9,348,969,966 REC instructions:u
50,108,032,223 SR instructions:u
lxf
6,005,617,374 NAI cycles:u
6,004,157,635 UNR cycles:u
1,303,627,835 REC cycles:u
9,187,422,499 SR cycles:u
lxf
9,010,888,196 NAI instructions:u
4,237,679,129 UNR instructions:u
4,955,258,040 REC instructions:u
26,018,680,499 SR instructions:u
Doing the milliseconds timing gives
lxf64 native code
timer-reset ' naive-sum bench .elapsed 889 ms elapsed ok
timer-reset ' unrolled-sum bench .elapsed 360 ms elapsed ok
timer-reset ' recursive-sum bench .elapsed 114 ms elapsed ok
timer-reset ' shift-reduce-sum bench .elapsed 647 ms elapsed ok
lxf64 token code
timer-reset ' naive-sum bench .elapsed 2´284 ms elapsed ok
timer-reset ' unrolled-sum bench .elapsed 2´723 ms elapsed ok
timer-reset ' recursive-sum bench .elapsed 3´474 ms elapsed ok
timer-reset ' shift-reduce-sum bench .elapsed 6´842 ms elapsed ok
lxf
timer-reset ' naive-sum bench .elapsed 1073 milli-seconds ok
timer-reset ' unrolled-sum bench .elapsed 1103 milli-seconds ok
timer-reset ' recursive-sum bench .elapsed 234 milli-seconds ok
timer-reset ' shift-reduce-sum bench .elapsed 1632 milli-seconds ok
It is interesting to note how the Best algorithm" change depending
on the underlying system implementation.
lxf uses the x87 builtin fp stack, lxf64 uses sse4 and a large fp stack
Thanks for these tests, they uncovered a problem with the lxf64 code
generator. It could only handle 114 immediate values in a basic block!
Both sum128 and nsum128 compiles gigantic functions of over 2k compile code.
Best Regards
Peter