On 8/13/2024 12:24 PM, MitchAlsup1 wrote:
On Tue, 13 Aug 2024 3:50:04 +0000, BGB wrote:
On 8/12/2024 8:23 PM, MitchAlsup1 wrote:
On Tue, 13 Aug 2024 0:34:55 +0000, BGB wrote:
>
On 8/12/2024 5:35 PM, MitchAlsup1 wrote:
On Mon, 12 Aug 2024 20:58:45 +0000, BGB wrote:
>
On 8/12/2024 3:12 PM, MitchAlsup1 wrote:
>
See polpak:: r8_erf()
>
>
r8_erf: ; @r8_erf
<snip>
>
Why don't yuo download polpack, compile it, and state how many
instructions it takes and how many words of storage it takes ??
>
Found what I assume you are talking about.
>
Needed to add "polpak_test.c" as otherwise BGBCC lacks a main and prunes
everything;
Also needed to hack over some compiler holes related to "complex
_Double" to get it to build;
Also needed to stub over some library functions that were added in C99
but missing in my C library.
>
I only ask for r8_erf()
>
<snip>
>
As for "r8_erf()":
>
<===
>
r8_erf:
<snip>
>
I count 283 instructions compared to my 85 including the 104
instructions
it takes your compiler to get to the 1st instruction in My 66000 code !!
>
>
Yeah, this is a compiler issue...
Why not sit down and code it in ASM to see what your ISA can really do?
Feel free to use My 66000 code as an example.
Assuming I use all of the ISA features that currently exist:
r8_erf: ; @r8_erf
MOV R4, R1
FABS R1,R2
FCMPGT 0x3780, R2 //Half
BF .LBB141_5
FCMPGT 0x4400, R2 //Half
BF .LBB141_6
FCMPGE 0x403A8B020C49BA5E, R2
BT .LBB141_7
FMUL R1, R1, R3
FLDCH 0x3C00, R2
FDIV R2, R3, R3
MOV 0x3F90B4FB18B485C7, R4
MOV 0x3FD38A78B9F065F6, R16
FMAC R3, R16, R4, R4
FADD R3, 0x40048C54508800DB, R5
MOV 0x3FD70FE40E2425B8, R16
FMAC R3, R16, R4, R4
MOV 0x3FFDF79D6855F0AD, R16
FMAC R3, R16, R5, R5
MOV 0x3FC0199D980A842F, R16
FMAC R3, R16, R4, R4
MOV 0x3FE0E4993E122C39, R16
FMAC R3, R16, R5, R5
MOV 0x3F9078448CD6C5B5, R16
FMAC R3, R16, R4, R4
MOV 0x3FAEFC42917D7DE7, R16
FMAC R3, R16, R5, R5
MOV 0x3F4595FD0D71E33C, R16
FMAC R3, R16, R4, R4
FMUL R4,R3,R4
MOV 0x3F632147A014BAD1, R16
FMAC R5, R3, R16, R3
FDIV R4, R3, R3
FNEG R3, R3
FADD R3, 0x3FE20DD750429B6D, R3
FDIV R3, R2, R3
BRA .LBB141_4
LBB141_5:
FMUL R1, R1, R3
MOV 0, R4
FCMPGT 0x3C9FFE5AB7E8AD5E, R2
CSELT R3, R4, R2
MOV 0x3FC7C7905A31C322, R3
MOV 0x400949FB3ED443E9, R16
fmac R2, R16, R3, R3
FADD R2,#0x403799EE342FB2DE, R4
MOV 0x405C774E4D365DA3, R16
RMAC R2, R16, R3, R3
MOV 0x406E80C9D57E55B8, R16
FMAC R2, R16, R4, R4
MOV 0x407797C38897528B, R16
FMAC R2, R16, R3, R3
MOV 0x40940A77529CADC8, R16
FMAC R2, R16, R4, R4
MOV 0x40A912C1535D121A, R16
FMAC R2, R16, R3, R3
FMUL R3, R1, R1
MOV 0x40A63879423B87AD, R16
FMAC R2, R16, R4, R2
FDIV R1, R2, R2
RTS
LBB141_6:
MOV 0x3E571E703C5F5815, R3
fmac r3,r2,r3,#0x3FE20DD508EB103E
fadd r4,r2,#0x402F7D66F486DED5
fmac r3,r2,r3,#0x4021C42C35B8BC02
fmac r4,r2,r4,#0x405D6C69B0FFCDE7
fmac r3,r2,r3,#0x405087A0D1C420D0
fmac r4,r2,r4,#0x4080C972E588749E
fmac r3,r2,r3,#0x4072AA2986ABA462
fmac r4,r2,r4,#0x4099558EECA29D27
fmac r3,r2,r3,#0x408B8F9E262B9FA3
fmac r4,r2,r4,#0x40A9B599356D1202
fmac r3,r2,r3,#0x409AC030C15DC8D7
fmac r4,r2,r4,#0x40B10A9E7CB10E86
fmac r3,r2,r3,#0x40A0062821236F6B
fmac r4,r2,r4,#0x40AADEBC3FC90DBD
fmac r3,r2,r3,#0x4093395B7FD2FC8E
fmac r4,r2,r4,#0x4093395B7FD35F61
fdiv r3,r3,r4
LBB141_4:
FMUL R2, 0x40300000, R4
FMUL R4, 0x3FB00000, R4
FSTCI R4, R4
FLDCI R4, R4
FNEG R4, R6
fadd R2, R6, R5
fadd R2, R4, R2
fmul R4, R6, R4
fexp r4,r4 //?
fmul R2,R7, R2
fexp r2,r2
fmul R4, R2, R2
FNEG R2, R2
fmac r2,r2,r3,#0x3F000000
fadd r2,r2,#0x3F000000
pdlt r1,T //?
fadd r2,#0,-r2
RTS
LBB141_7:
FLDCH 0xBC00, R2
FLDCH 0x3C00, R3
FCMPGT 0, R1
CSELT R2,R3,R2
RTS
Well, partial conversion at least, but ran out of time.
Not guaranteed to be correct, it is only so far as I could figure it out.
FMAC does exist, but is granted, slower than FMUL+FADD, and doesn't currently have an immediate form.
So, say:
fmac r3,r2,r3,#0x3FE20DD508EB103E
Translated to:
FMUL R2, 0x3FE20DD508EB103E, R16
FADD R3, R16, R3
Is potentially faster than trying to use the FMAC op.
It might have been less if the code was like:
static const double somearr[8]={ ... };
>
But, this would still have used memory loads.
Getting the constants into expressions would likely require using
#define or similar...
>
This is admittedly more how I would have imagined performance-oriented
code to be written. Not so much with dynamically initialized arrays.
That particular piece of code was originally written in FORTRAN
probably late 1960s or early 1970s then ported to C a while back.
Doesn't change that they are putting constant values into dynamically initialized arrays.
Like, "static const" exists for a reason...
Say, so that the compiler can know that the array's contents wont be modified, and is allowed to put it in read-only memory or optimize it away.
<snip>
>
But, as I will note, even with this general level of lackluster code
generation, have still been managing to often beat RV64G performance...
Anybody claiming RISC-V has a good ISA should have their degree revoked.
Either way...
Have also noted that I can often get a decent performance improvement by rewriting code in ASM (but, rather variable; and carefully written C can approach ASM perf).
Not so much with RISC-V, where rewriting stuff in ASM is typically ineffective (the ASM often can't do much better than what the C does already).