Sujet : Re: Microarch Club
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 26. Mar 2024, 20:16:07
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
References : 1 2 3 4
User-Agent : Rocksolid Light
BGB wrote:
On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
Say, "we have an instruction, but it is a boat anchor" isn't an ideal situation (unless to be a placeholder for if/when it is not a boat anchor).
If the boat anchor is a required unit of functionality, and I believe
IDIV and FPDIV is, it should be defined in ISA and if you can't afford
it find some way to trap rapidly so you can fix it up without excessive
overhead. Like a MIPS TLB reload. If you can't get trap and emulate at
sufficient performance, then add the HW to perform the instruction.
again. Might also make sense to add an architectural zero register, and eliminate some number of encodings which exist merely because of the lack of a zero register (though, encodings are comparably cheap, as the
I got an effective zero register without having to waste a register name to "get it". My 66000 gives you 32 registers of 64-bits each and you can put any bit pattern in any register and treat it as you like.
Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
available.
I guess offloading this to the compiler can also make sense.
Least common denominator would be, say, not providing things like NEG instructions and similar (pretending as-if one had a zero register), and if a program needs to do a NEG or similar, it can load 0 into a register itself.
In the extreme case (say, one also lacks a designated "load immediate" instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to zero a register...
MOV Rd,#imm16
Cost 1 instruction of 32-bits in size and can be performed in 0 cycles
Say:
XOR R14, R14, R14 //Designate R14 as pseudo-zero...
...
ADD R14, 0x123, R8 //Load 0x123 into R8
Though, likely still makes sense in this case to provide some "convenience" instructions.
internal uArch has a zero register, and effectively treats immediate values as a special register as well, ...). Some of the debate is more related to the logic cost of dealing with some things in the decoder.
The problem is universal constants. RISCs being notably poor in their
support--however this is better than addressing modes which require
µCode.
Yeah.
I ended up with jumbo-prefixes. Still not perfect, and not perfectly orthogonal, but mostly works.
Allows, say:
ADD R4, 0x12345678, R6
To be performed in potentially 1 clock-cycle and with a 64-bit encoding, which is better than, say:
LUI X8, 0x12345
ADD X8, X8, 0x678
ADD X12, X10, X8
This strategy completely fails when the constant contains more than 32-bits
FDIV R9,#3.141592653589247,R17
When you have universal constants (including 5-bit immediates), you rarely
need a register containing 0.
Though, for jumbo-prefixes, did end up adding a special case in the compile where it will try to figure out if a constant will be used multiple times in a basic-block and, if so, will load it into a register rather than use a jumbo-prefix form.
This is a delicate balance:: while each use of the constant takes a
unit or 2 of the instruction stream, each use cost 0 more instructions.
The breakeven point in My 66000 is about 4 uses in a small area (loop)
means that it should be hoisted into a register.
It could maybe make sense to have function-scale static-assigned constants, but have not done so yet.
Though, it appears as if one of the "top contenders" here would be 0, mostly because things like:
foo->x=0;
And:
bar[i]=0;
I see no need for a zero register:: the following are 1 instruction !
ST #0,[Rfoo,offset(x)]
ST #0,[Rbar,Ri]
Are semi-common, and as-is end up needing to load 0 into a register each time they appear.
Had already ended up with a similar sort of special case to optimize "return 0;" and similar, mostly because this was common enough that it made more sense to have a special case:
BRA .lbl_ret //if function does not end with "return 0;"
.lbl_ret_zero:
MOV 0, R2
.lbl_ret:
... epilog ...
For many functions, which allowed "return 0;" to be emitted as:
BRA .lbl_ret_zero
Rather than:
MOV 0, R2
BRA .lbl_ret
Which on average ended up as a net-win when there are more than around 3 of them per function.
Special defined tails......
Though, another possibility could be to allow constants to be included in the "statically assign variables to registers" logic (as-is, they are excluded except in "tiny leaf" functions).
Though, would likely still make a few decisions differently from those in RISC-V. Things like indexed load/store,
Absolutely
predicated ops (with a designated flag bit),
Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}
I have per instruction predication:
CMPxx ...
OP?T //if-true
OP?F //if-false
Or:
OP?T | OP?F //both in parallel, subject to encoding and ISA rules
CMP Rt,Ra,#whatever
PLE Rt,TTTTTEEE
// This begins the then-clause 5Ts -> 5 instructions
OP1
OP2
OP3
OP4
OP5
// this begins the else-clause 3Es -> 3 instructions
OP6
OP7
OP8
// we are now back join point.
Notice no internal flow control instructions.
Performance gains are modest, but still noticeable (part of why predication ended up as a core ISA feature). Effect on pipeline seems to be small in its current form (it is handled along with register fetch, mostly turning non-executed instructions into NOPs during the EX stages).
The effect is that one uses Predication whenever you will have already
fetched instructions at the join point by the time you have determined
the predicate value {then, else} clauses. The PARSE and DECODE do the
flow control without bothering FETCH.
For the most part, 1-bit seems sufficient.
How do you do && and || predication with 1 bit ??
More complex schemes generally ran into issues (had experimented with allowing a second predicate bit, or handling predicates as a stack-machine, but these ideas were mostly dead on arrival).
Also note: the instructions in the then and else clauses know NOTHING about being under a predicate mask (or not) Thus, they waste no bit
while retaining the ability to run under predication.
and large-immediate encodings,
Nothing else is so poorly served in typical ISAs.
Probably true.
help enough with performance (relative to cost)
+40%
I am mostly seeing around 30% or so, for Doom and similar.
A few other programs still being closer to break-even at present.
Things are a bit more contentious in terms of code density:
With size-minimizing options to GCC:
".text" is slightly larger with BGBCC vs GCC (around 11%);
However, the GCC output has significantly more ".rodata".
A lot of this .rodata becomes constants in .text with universal constants.
A reasonable chunk of the code-size difference could be attributed to jumbo prefixes making the average instruction size slightly bigger.
Size is one thing and it primarily diddles in cache footprint statstics.
Instruction count is another and primarily diddles in pipeline cycles
to execute statistics. Fewer instruction wins almost all the time.
More could be possible with more compiler optimization effort. Currently, a few recent optimization cases are disabled as they seem to be causing bugs that I haven't figured out yet.
to be worth keeping (though, mostly because the alternatives are not so good in terms of performance).
Damage to pipeline ability less than -5%.
Yeah.