Liste des Groupes | Revenir à c arch |
On Mon, 12 Aug 2024 19:27:22 +0000, BGB wrote:FWIW:
On 8/12/2024 12:36 PM, MitchAlsup1 wrote:See polpak:: r8_erf()On Mon, 12 Aug 2024 6:29:36 +0000, Anton Ertl wrote:>
>Brett <ggtgp@yahoo.com> writes:>Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:>Brett <ggtgp@yahoo.com> writes:>The lack of CPU’s with 64 registers is what makes for a market,>
that 4%
that could benefit have no options to pick from.
They had:
>
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
>
AMD29K: IIRC a 128-register stack and 64 additional registers
>
IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.
All antiques no longer available.
SPARC is still available: <https://en.wikipedia.org/wiki/SPARC> says:
>
|Fujitsu will also discontinue their SPARC production [...] end-of-sale
|in 2029, of UNIX servers and a year later for their mainframe.
>
No word of when Oracle will discontinue (or has discontinued) sales,
but both companies introduced their last SPARC CPUs in 2017.
>
In any case, my point still stands: these architectures were
available, and the large number of registers failed to give them a
decisive advantage. Maybe it even gave them a decisive disadvantage:
AMD29K and IA-64 never had OoO implementations, and SPARC got them
only with the Fujitsu SPARC64 V in 2002 and the Oracle SPARC T4 in
2011, years after Intel, MIPS, HP switched to OoO im 1995/1996 and
Power and Alpha switched in 1998 (POWER3, 21264).
>>Where is your 4% number coming from?>
The 4% number is poor memory and a guess.
Here is an antique paper on the issue:
>
https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf
Interesting. I only skimmed the paper, but I read a lot about
inlining and interprocedural register allocation. SPARCs register
windows and AMD29K's and IA-64's register stacks were intended to be
useful for that, but somehow the other architectures did not suffer a
big-enough disadvantage to make them adopt one of these concepts, and
that's despite register windows/stacks working even for indirect calls
(e.g., method calls in the general case), where interprocedural
register allocation or inlining don't help.
>
It seems to me that with OoO the cycle cost of spilling and refilling
on call boundaries was lowered: the spills can be delayed until the
computation is complete, and the refills can start early because the
stack pointer tends to be available early.
>
And recent OoO CPUs even have zero-cycle store-to-load forwarding, so
even if the called function is short, the spilling and refilling
around it (if any) does not increase the latency of the value that's
spilled and refilled. But that consideration is only relevant for
Intel APX, ARM A64 and RISC-V went for 32 registers several years
before zero-cycle store-to-load-forwarding was implemented.
>
One other optimization that they use the additional registers for is
"register promotion", i.e., putting values from memory into registers
for a while (if absence of aliasing can be proven). One interesting
aspect here is that register promotion with 64 or 256 registers (RP-64
and RP-256) is usually not much better (if better at all) than
register promotion with 32 registers (RP-32); see Figure 1. So
register promotion does not make a strong case for more registers,
either, at least in this paper.
With full access to constants, there is even less need to promote
addresses or immediates into registers as you can simply poof them
up anything you want one.
>
There are tradeoffs still, if constants need space to encode...
>
Inline is still better than a memory load, granted.
>
May make sense to consolidate multiple uses of a value into a register
rather than try encoding them as an immediate each time.
r8_erf: ; @r8_erf
; %bb.0:
fabs r2,r1
fcmp r3,r2,#0x3EF00000
bngt r3,.LBB141_5
; %bb.1:
fcmp r3,r2,#4
bngt r3,.LBB141_6
; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E
bnlt r3,.LBB141_7
; %bb.3:
fmul r3,r1,r1
fdiv r3,#1,r3
mov r4,#0x3F90B4FB18B485C7
fmac r4,r3,r4,#0x3FD38A78B9F065F6
fadd r5,r3,#0x40048C54508800DB
fmac r4,r3,r4,#0x3FD70FE40E2425B8
fmac r5,r3,r5,#0x3FFDF79D6855F0AD
fmac r4,r3,r4,#0x3FC0199D980A842F
fmac r5,r3,r5,#0x3FE0E4993E122C39
fmac r4,r3,r4,#0x3F9078448CD6C5B5
fmac r5,r3,r5,#0x3FAEFC42917D7DE7
fmac r4,r3,r4,#0x3F4595FD0D71E33C
fmul r4,r3,r4
fmac r3,r3,r5,#0x3F632147A014BAD1
fdiv r3,r4,r3
fadd r3,#0x3FE20DD750429B6D,-r3
fdiv r3,r3,r2
br .LBB141_4
LBB141_5:
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E
sra r2,r2,#8,#1
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322
fmac r3,r2,r3,#0x400949FB3ED443E9
fadd r4,r2,#0x403799EE342FB2DE
fmac r3,r2,r3,#0x405C774E4D365DA3
fmac r4,r2,r4,#0x406E80C9D57E55B8
fmac r3,r2,r3,#0x407797C38897528B
fmac r4,r2,r4,#0x40940A77529CADC8
fmac r3,r2,r3,#0x40A912C1535D121A
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD
fdiv r2,r1,r2
mov r1,r2
ret
LBB141_6:
mov r3,#0x3E571E703C5F5815
fmac r3,r2,r3,#0x3FE20DD508EB103E
fadd r4,r2,#0x402F7D66F486DED5
fmac r3,r2,r3,#0x4021C42C35B8BC02
fmac r4,r2,r4,#0x405D6C69B0FFCDE7
fmac r3,r2,r3,#0x405087A0D1C420D0
fmac r4,r2,r4,#0x4080C972E588749E
fmac r3,r2,r3,#0x4072AA2986ABA462
fmac r4,r2,r4,#0x4099558EECA29D27
fmac r3,r2,r3,#0x408B8F9E262B9FA3
fmac r4,r2,r4,#0x40A9B599356D1202
fmac r3,r2,r3,#0x409AC030C15DC8D7
fmac r4,r2,r4,#0x40B10A9E7CB10E86
fmac r3,r2,r3,#0x40A0062821236F6B
fmac r4,r2,r4,#0x40AADEBC3FC90DBD
fmac r3,r2,r3,#0x4093395B7FD2FC8E
fmac r4,r2,r4,#0x4093395B7FD35F61
fdiv r3,r3,r4
LBB141_4:
fmul r4,r2,#16
fmul r4,r4,#0x3D800000
rnd r4,r4,#5
fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4
fmul r2,r2,-r5
fexp r2,r2
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000
fadd r2,r2,#0x3F000000
pdlt r1,T
fadd r2,#0,-r2
mov r1,r2
ret
LBB141_7:
fcmp r1,r1,#0
sra r1,r1,#8,#1
cvtsd r2,#-1
cvtsd r3,#1
mux r2,r1,r3,r2
mov r1,r2
ret
All of the constants are use once !
RISC-V takes 240 instructions and uses 342 words of
memory {.text, .data, .rodata}
My 66000 takes 85 instructions and uses 169 words of
memory {.text, .data, .rodata}
Les messages affichés proviennent d'usenet.