On 8/12/2024 12:36 PM, MitchAlsup1 wrote:
On Mon, 12 Aug 2024 6:29:36 +0000, Anton Ertl wrote:
Brett <ggtgp@yahoo.com> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Brett <ggtgp@yahoo.com> writes:
The lack of CPU’s with 64 registers is what makes for a market, that 4%
that could benefit have no options to pick from.
>
They had:
>
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
>
AMD29K: IIRC a 128-register stack and 64 additional registers
>
IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.
>
All antiques no longer available.
>
SPARC is still available: <https://en.wikipedia.org/wiki/SPARC> says:
>
|Fujitsu will also discontinue their SPARC production [...] end-of-sale
|in 2029, of UNIX servers and a year later for their mainframe.
>
No word of when Oracle will discontinue (or has discontinued) sales,
but both companies introduced their last SPARC CPUs in 2017.
>
In any case, my point still stands: these architectures were
available, and the large number of registers failed to give them a
decisive advantage. Maybe it even gave them a decisive disadvantage:
AMD29K and IA-64 never had OoO implementations, and SPARC got them
only with the Fujitsu SPARC64 V in 2002 and the Oracle SPARC T4 in
2011, years after Intel, MIPS, HP switched to OoO im 1995/1996 and
Power and Alpha switched in 1998 (POWER3, 21264).
>
Where is your 4% number coming from?
>
The 4% number is poor memory and a guess.
Here is an antique paper on the issue:
>
https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf
>
Interesting. I only skimmed the paper, but I read a lot about
inlining and interprocedural register allocation. SPARCs register
windows and AMD29K's and IA-64's register stacks were intended to be
useful for that, but somehow the other architectures did not suffer a
big-enough disadvantage to make them adopt one of these concepts, and
that's despite register windows/stacks working even for indirect calls
(e.g., method calls in the general case), where interprocedural
register allocation or inlining don't help.
>
It seems to me that with OoO the cycle cost of spilling and refilling
on call boundaries was lowered: the spills can be delayed until the
computation is complete, and the refills can start early because the
stack pointer tends to be available early.
>
And recent OoO CPUs even have zero-cycle store-to-load forwarding, so
even if the called function is short, the spilling and refilling
around it (if any) does not increase the latency of the value that's
spilled and refilled. But that consideration is only relevant for
Intel APX, ARM A64 and RISC-V went for 32 registers several years
before zero-cycle store-to-load-forwarding was implemented.
>
One other optimization that they use the additional registers for is
"register promotion", i.e., putting values from memory into registers
for a while (if absence of aliasing can be proven). One interesting
aspect here is that register promotion with 64 or 256 registers (RP-64
and RP-256) is usually not much better (if better at all) than
register promotion with 32 registers (RP-32); see Figure 1. So
register promotion does not make a strong case for more registers,
either, at least in this paper.
With full access to constants, there is even less need to promote
addresses or immediates into registers as you can simply poof them
up anything you want one.
There are tradeoffs still, if constants need space to encode...
Inline is still better than a memory load, granted.
May make sense to consolidate multiple uses of a value into a register rather than try encoding them as an immediate each time.
...
For example, when I was working on adding the code to display HDR pixels to the screen (need conversion to RGB555).
First attempt:
TKGDI_CopyPixelSpan_GetRGB24H:
MOVU.L (R4), R6
PSHUF.W R5, 0x00, R20 //word shuffle
PSHUF.W R5, 0x55, R21
PLDCM8UH R6, R16 //FP8U to Binary16
PMUL.H R16, R20, R16 //Scale
PADD.H R16, R21, R16 //Bias
MOV 0x3C003C003C003C00, R17 // 4x 1.0
PADD.H R16, R17, R18 // Map to 1.0 .. 1.999
TSTQ 0x0000C000C000C000, R18
BF .L1
.L0:
MOV 0xFFFF000000000000, R7 //alpha ones
PCVTH2UW R18, R19 //Convert to packed word
OR R19, R7, R5 //Set alpha all ones
RGB5PCK64 R5, R2 //convert to RGB555
RTS
.L1:
TSTQ 0x00000000C000, R18
AND?F 0xFFFFFFFF0000, R18
OR?F 0x000000003BFF, R18
TSTQ 0x0000C0000000, R18
AND?F 0xFFFF0000FFFF, R18
OR?F 0x00003BFF0000, R18
TSTQ 0xC00000000000, R18
AND?F 0x0000FFFFFFFF, R18
OR?F 0x3BFF00000000, R18
BRA .L0
Which was valid ASM in my case, but the constants are still bulky.
Unsigned SIMD convert works over the 1.0 to 1.999 range or so. The RGB555 converter needs alpha set so that it knows pixel is opaque (otherwise, it may try to use the alpha encoding and reduce color fidelity).
For now, it assumes opaque images for screen and window framebuffers.
Then noted that I already had a few instructions for the purpose of range clamping, so it became:
TKGDI_CopyPixelSpan_GetRGB24H:
MOVU.L (R4), R6
PSHUF.W R5, 0x00, R20
PSHUF.W R5, 0x55, R21
PLDCM8UH R6, R16
MOV 0x3C003C003C003C00, R17
PMUL.H R16, R20, R16
MOV 0x3FFF3FFF3FFF3FFF, R22
PADD.H R16, R21, R16
PADD.H R16, R17, R18
PCMPGT.H R22, R18
PCSELT.W R22, R18, R18
PCMPGT.H R18, R17
PCSELT.W R17, R18, R18
MOV 0xFFFF000000000000, R7
PCVTH2UW R18, R19
OR R19, R7, R5
RGB5PCK64 R5, R2
RTS
Worked a little better...
But, still, not very fast.
Then ended up doing a version that converted 4 pixels in parallel and had a special case for no scaling or biasing the HDR values, and added alternate entry points for 32-bit and 24 bit pixels.
No longer as concise, but around 3x faster.
Not much bundling as a lot of these are "Lane 1 only" ops, or cases where bundling would not have any benefit.
<===
TKGDI_CopyPixelSpan_GetRGB32x4H:
TST R5, R5
BT TKGDI_CopyPixelSpan_GetRGB32x4H_NoScale
MOVU.L (R4, 0), R16
MOVU.L (R4, 4), R17
MOVU.L (R4, 8), R18
MOVU.L (R4, 12), R19
TKGDI_CopyPixelSpan_GetRGB32x4H_P1:
PLDCM8UH R16, R16
PLDCM8UH R17, R17
PLDCM8UH R18, R18
PLDCM8UH R19, R19
PSHUF.W R5, 0x00, R21
PSHUF.W R5, 0x55, R23
MOV 0x3C003C003C003C00, R20
MOV 0x3FFF3FFF3FFF3FFF, R22
PMUL.H R16, R21, R16
PMUL.H R17, R21, R17
PMUL.H R18, R21, R18
PMUL.H R19, R21, R19
PADD.H R16, R23, R16
PADD.H R17, R23, R17
PADD.H R18, R23, R18
PADD.H R19, R23, R19
PADD.H R16, R20, R16
PADD.H R17, R20, R17
PADD.H R18, R20, R18
PADD.H R19, R20, R19
PCMPGT.H R22, R16
PCSELT.W R22, R16, R16
PCMPGT.H R22, R17
PCSELT.W R22, R17, R17
PCMPGT.H R22, R18
PCSELT.W R22, R18, R18
PCMPGT.H R22, R19
PCSELT.W R22, R19, R19
PCMPGT.H R16, R20
PCSELT.W R20, R16, R16
PCMPGT.H R17, R20
PCSELT.W R20, R17, R17
PCMPGT.H R18, R20
PCSELT.W R20, R18, R18
PCMPGT.H R19, R20
PCSELT.W R20, R19, R19
MOV 0xFFFF000000000000, R3
PCVTH2UW R16, R4
PCVTH2UW R17, R5
OR R4, R3, R4 | PCVTH2UW R18, R6
OR R5, R3, R5 | PCVTH2UW R19, R7
OR R6, R3, R6 | RGB5PCK64 R4, R4
OR R7, R3, R7 | RGB5PCK64 R5, R5
RGB5PCK64 R6, R6
RGB5PCK64 R7, R7
MOVLLW R5, R4, R4
MOVLLW R7, R6, R6
MOVLLD R6, R4, R2
RTS
.balign 4
TKGDI_CopyPixelSpan_GetRGB32x4H_NoScale:
MOVU.L (R4, 0), R16
MOVU.L (R4, 4), R17
MOVU.L (R4, 8), R18
MOVU.L (R4, 12), R19
TKGDI_CopyPixelSpan_GetRGB32x4H_P1NS:
MOV 0x3C003C003C003C00, R20
MOV 0x3FFF3FFF3FFF3FFF, R22
MOV 0xFFFF000000000000, R3
PLDCM8UH R16, R16
PLDCM8UH R17, R17
PLDCM8UH R18, R18
PLDCM8UH R19, R19
PADD.H R16, R20, R16
PADD.H R17, R20, R17
PADD.H R18, R20, R18
PADD.H R19, R20, R19
PCMPGT.H R22, R16
PCSELT.W R22, R16, R16
PCMPGT.H R22, R17
PCSELT.W R22, R17, R17
PCMPGT.H R22, R18
PCSELT.W R22, R18, R18
PCMPGT.H R22, R19
PCSELT.W R22, R19, R19
PCVTH2UW R16, R4
PCVTH2UW R17, R5
OR R4, R3, R4 | PCVTH2UW R18, R6
OR R5, R3, R5 | PCVTH2UW R19, R7
OR R6, R3, R6 | RGB5PCK64 R4, R4
OR R7, R3, R7 | RGB5PCK64 R5, R5
RGB5PCK64 R6, R6
RGB5PCK64 R7, R7
MOVLLW R5, R4, R4
MOVLLW R7, R6, R6
MOVLLD R6, R4, R2
RTS
TKGDI_CopyPixelSpan_GetRGB24x4H:
ADD R4, 3, R21 //(*1)
ADD R4, 6, R22
ADD R4, 9, R23
MOVU.L (R4 ), R16
MOVU.L (R21), R17
MOVU.L (R22), R18
MOVU.L (R23), R19
TST R5, R5
BT TKGDI_CopyPixelSpan_GetRGB32x4H_P1NS
BRA TKGDI_CopyPixelSpan_GetRGB32x4H_P1
// *1: Needed because BJX2 lacks misaligned Load/Store displacements.
===>
This case makes it tempting to redefine "PCVTH2UW" to perform its own range clamping (at present, out of range values will wrap).
>
- anton