Re: Microarch Club

Liste des GroupesRevenir à c arch 
Sujet : Re: Microarch Club
De : bohannonindustriesllc (at) *nospam* gmail.com (BGB-Alt)
Groupes : comp.arch
Date : 26. Mar 2024, 22:59:57
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <utvggu$2cgkl$1@dont-email.me>
References : 1 2 3 4 5
User-Agent : Mozilla Thunderbird
On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
BGB wrote:
 
On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
>
 
Say, "we have an instruction, but it is a boat anchor" isn't an ideal situation (unless to be a placeholder for if/when it is not a boat anchor).
 If the boat anchor is a required unit of functionality, and I believe
IDIV and FPDIV is, it should be defined in ISA and if you can't afford
it find some way to trap rapidly so you can fix it up without excessive
overhead. Like a MIPS TLB reload. If you can't get trap and emulate at
sufficient performance, then add the HW to perform the instruction.
 
Though, 32-bit ARM managed OK without integer divide.
In my case, I ended up supporting it mostly for sake of the RV64 'M' extension, but it is in this case a little faster than a pure software solution (unlike on the K10 and Piledriver).
Still costs around 1.5 kLUTs though for 64-bit MUL/DIV support, and a little more to route FDIV through it.
Cheapest FPU approach is still the "ADD/SUB/MUL only" route.

again. Might also make sense to add an architectural zero register, and eliminate some number of encodings which exist merely because of the lack of a zero register (though, encodings are comparably cheap, as the
>
I got an effective zero register without having to waste a register name to "get it". My 66000 gives you 32 registers of 64-bits each and you can put any bit pattern in any register and treat it as you like.
Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
available.
>
 
I guess offloading this to the compiler can also make sense.
 
Least common denominator would be, say, not providing things like NEG instructions and similar (pretending as-if one had a zero register), and if a program needs to do a NEG or similar, it can load 0 into a register itself.
 
In the extreme case (say, one also lacks a designated "load immediate" instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to zero a register...
      MOV   Rd,#imm16
 Cost 1 instruction of 32-bits in size and can be performed in 0 cycles
 
Though, RV had skipped this:
   ADD Xd, Zero, Imm12s
Or:
   LUI Xd, ImmHi20
   ADD Xd, Xd, ImmLo12s
One can argue for this on the basis of not needing an immediate-load instruction (nor a MOV instruction, nor NEG, nor ...).
Though, yeah, in my case I ended up with more variety here:
   LDIZ   Imm10u, Rn  //10-bit, zero-extend, Imm12u (XG2)
   LDIN   Imm10n, Rn  //10-bit, one-extend, Imm12n (XG2)
   LDIMIZ Imm10u, Rn  //Rn=Imm10u<<16 (newish)
   LDIMIN Imm10n, Rn  //Rn=Imm10n<<16 (newish)
   LDIHI  Imm10u, Rn  //Rn=Imm10u<<22
   LDIQHI Imm10u, Rn  //Rn=Imm10u<<54
   LDIZ   Imm16u, Rn  //16-bit, zero-extend
   LDIN   Imm16n, Rn  //16-bit, one-extend
Then 64-bit jumbo forms:
   LDI    Imm33s, Rn  //33-bit, sign-extend
   LDIHI  Imm33s, Rn  //Rn=Imm33s<<16
   LDIQHI Imm33s, Rn  //Rn=Imm33s<<32
Then, 96 bit:
   LDI    Imm64, Rn  //64-bit, sign-extend
And, some special cases:
   FLDCH  Imm16u, Rn //Binary16->Binary64
One could argue though that this is wild extravagance...
The recent addition of LDIMIx was mostly because otherwise one needed a 64-bit encoding to load constants like 262144 or similar (and a lot of bit-masks).
At one point I did evaluate a more ARM32-like approach (effectively using a small value and a rotate). But, this cost more than the other options (would have required the great evil of effectively being able to feed two immediate values into the integer-shift unit, whereas many of the others could be routed through logic I already have for other ops).
Though, one can argue that the drawback is that one does end up with more instructions in the ISA listing.

Say:
   XOR R14, R14, R14  //Designate R14 as pseudo-zero...
   ...
   ADD R14, 0x123, R8  //Load 0x123 into R8
 
Though, likely still makes sense in this case to provide some "convenience" instructions.
 
internal uArch has a zero register, and effectively treats immediate values as a special register as well, ...). Some of the debate is more related to the logic cost of dealing with some things in the decoder.
>
The problem is universal constants. RISCs being notably poor in their
support--however this is better than addressing modes which require
µCode.
>
 
Yeah.
 
I ended up with jumbo-prefixes. Still not perfect, and not perfectly orthogonal, but mostly works.
 
Allows, say:
   ADD R4, 0x12345678, R6
 
To be performed in potentially 1 clock-cycle and with a 64-bit encoding, which is better than, say:
   LUI X8, 0x12345
   ADD X8, X8, 0x678
   ADD X12, X10, X8
 This strategy completely fails when the constant contains more than 32-bits
      FDIV   R9,#3.141592653589247,R17
 When you have universal constants (including 5-bit immediates), you rarely
need a register containing 0.
 
The jumbo prefixes at least allow for a 64-bit constant load, but as-is not for 64-bit immediate values to 3RI ops. The latter could be done, but would require 128-bit fetch and decode, which doesn't seem worth it.
There is the limbo feature of allowing for 57-bit immediate values, but this is optional.
OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.
Typical GCC response on RV64 seems to be to turn nearly all of the big-constant cases into memory loads, which kinda sucks.
Even something like a "LI Xd, Imm17s" instruction, would notably reduce the number of constants loaded from memory (as GCC seemingly prefers to use a LHU or LW or similar rather than encode it using LUI+ADD).
I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping them enabled in the CPU core (they involved the non-zero cost of repacking them into Binary16 in ID1 and then throwing a Binary16->Binary64 converter into the ID2 stage).
Generally, the "FLDCH Imm16, Rn" instruction works well enough here (and can leverage a more generic Binary16->Binary64 converter path).
For FPU compare with zero, can almost leverage the integer compare ops, apart from the annoying edge cases of -0.0 and NaN leading to "not strictly equivalent" behavior (though, an ASM programmer could more easily get away with this). But, not common enough to justify adding FPU specific ops for this.

Though, for jumbo-prefixes, did end up adding a special case in the compile where it will try to figure out if a constant will be used multiple times in a basic-block and, if so, will load it into a register rather than use a jumbo-prefix form.
 This is a delicate balance:: while each use of the constant takes a
unit or 2 of the instruction stream, each use cost 0 more instructions.
The breakeven point in My 66000 is about 4 uses in a small area (loop)
means that it should be hoisted into a register.
 
IIRC, I had set it as x>2, since 2 seemed break-even, 3+ favoring using a register, and 1 favoring a constant.
This did require adding logic to look forward over the current basic block to check for additional usage, which does seem a little like a hack. But, similar logic had already been used by the register allocator to try to prioritize which values should be evicted.

It could maybe make sense to have function-scale static-assigned constants, but have not done so yet.
 
Though, it appears as if one of the "top contenders" here would be 0, mostly because things like:
   foo->x=0;
And:
   bar[i]=0;
 I see no need for a zero register:: the following are 1 instruction !
     ST     #0,[Rfoo,offset(x)]
     ST     #0,[Rbar,Ri]
 
In my case, there is no direct equivalent in the core ISA.
But, in retrospect, a zero-register is probably not worth it if one is not trying for RISC-V style minimalism in the core ISA (nevermind if they seemingly promptly disregard minimalism as soon as one gets outside of 'I', but then fail to address the shortcomings of said minimalism in 'I').

Are semi-common, and as-is end up needing to load 0 into a register each time they appear.
 
Had already ended up with a similar sort of special case to optimize "return 0;" and similar, mostly because this was common enough that it made more sense to have a special case:
   BRA .lbl_ret  //if function does not end with "return 0;"
   .lbl_ret_zero:
   MOV 0, R2
   .lbl_ret:
   ... epilog ...
 
For many functions, which allowed "return 0;" to be emitted as:
   BRA .lbl_ret_zero
Rather than:
   MOV 0, R2
   BRA .lbl_ret
Which on average ended up as a net-win when there are more than around 3 of them per function.
 Special defined tails......
 
I didn't really go much beyond 0, as supporting more than 0 ends up adding cost in terms of needing additional branches.
But, yeah, 0/1/NULL/etc, seem to be fairly common return values.

 
Though, another possibility could be to allow constants to be included in the "statically assign variables to registers" logic (as-is, they are excluded except in "tiny leaf" functions).
 
Went and tweaked the compiler rules, so now they are included in the ranking in normal functions but given a lower priority than the local variables (by roughly "one loop nesting level"). Seemed to help slightly (making it 1 level less appeared to be the local optima judging by binary size).
Seems that generally 0 still isn't quite common enough to justify having one register fewer for variables though (or to have a designated zero register), but otherwise it seems there is not much to justify trying to exclude the "implicit zero" ops from the ISA listing.

 
Though, would likely still make a few decisions differently from those in RISC-V. Things like indexed load/store,
>
Absolutely
>
                                           predicated ops (with a designated flag bit),
>
Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}
>
 
I have per instruction predication:
   CMPxx ...
   OP?T  //if-true
   OP?F  //if-false
Or:
   OP?T | OP?F  //both in parallel, subject to encoding and ISA rules
      CMP  Rt,Ra,#whatever
     PLE  Rt,TTTTTEEE
     // This begins the then-clause 5Ts -> 5 instructions
     OP1
     OP2
     OP3
     OP4
     OP5
     // this begins the else-clause 3Es -> 3 instructions
     OP6
     OP7
     OP8
     // we are now back join point.
 Notice no internal flow control instructions.
 
It can be similar in my case, with the ?T / ?F encoding scheme.
While poking at it, did go and add a check to exclude large struct-copy operations from predication, as it is slower to turn a large struct copy into NOPs than to branch over it.
Did end up leaving struct-copies where sz<=64 as allowed though (where a 64 byte copy at least has the merit of achieving full pipeline saturation and being roughly break-even with a branch-miss, whereas a 128 byte copy would cost roughly twice as much as a branch miss).

Performance gains are modest, but still noticeable (part of why predication ended up as a core ISA feature). Effect on pipeline seems to be small in its current form (it is handled along with register fetch, mostly turning non-executed instructions into NOPs during the EX stages).
 The effect is that one uses Predication whenever you will have already
fetched instructions at the join point by the time you have determined
the predicate value {then, else} clauses. The PARSE and DECODE do the
flow control without bothering FETCH.
 
Yeah, though in my pipeline, it is still a tradeoff of the relative cost of a missed branch, vs the cost of sliding over both the THEN and ELSE branches as a series of NOPs.

For the most part, 1-bit seems sufficient.
 How do you do && and || predication with 1 bit ??
 
Originally, it didn't.
Now I added some 3R and 3RI CMPxx encodings.
This allows, say:
CMPGT R8, R10, R4
CMPGT R8, R11, R5
TST R4, R5
...
The stack-machine predicate model could have worked, but would have been more of an issue to deal with in the compiler (in addition to the non-zero LUT cost, and not really saving any clock-cycles as compared with the newer bit-twiddling in registers strategy).
Compiler does need some additional handling here, to not try this if the expressions could lead to side-effects.

More complex schemes generally ran into issues (had experimented with allowing a second predicate bit, or handling predicates as a stack-machine, but these ideas were mostly dead on arrival).
 Also note: the instructions in the then and else clauses know NOTHING about being under a predicate mask (or not) Thus, they waste no bit
while retaining the ability to run under predication.
 
It is a tradeoff. In my case they cost 2 bits in the encoding, which is shared with the WEX scheme.

 
                      and large-immediate encodings,
>
Nothing else is so poorly served in typical ISAs.
>
 
Probably true.
 
                                                     help enough with performance (relative to cost)
>
+40%
>
 
I am mostly seeing around 30% or so, for Doom and similar.
   A few other programs still being closer to break-even at present.
 
Things are a bit more contentious in terms of code density:
   With size-minimizing options to GCC:
     ".text" is slightly larger with BGBCC vs GCC (around 11%);
     However, the GCC output has significantly more ".rodata".
 A lot of this .rodata becomes constants in .text with universal constants.
 
Yeah.

A reasonable chunk of the code-size difference could be attributed to jumbo prefixes making the average instruction size slightly bigger.
 Size is one thing and it primarily diddles in cache footprint statstics.
Instruction count is another and primarily diddles in pipeline cycles
to execute statistics. Fewer instruction wins almost all the time.
 
Yeah, pretty much...

More could be possible with more compiler optimization effort. Currently, a few recent optimization cases are disabled as they seem to be causing bugs that I haven't figured out yet.
 
                               to be worth keeping (though, mostly because the alternatives are not so good in terms of performance).
>
Damage to pipeline ability less than -5%.
 
Yeah.

Date Sujet#  Auteur
21 Mar 24 * Microarch Club22George Musk
25 Mar 24 `* Re: Microarch Club21BGB-Alt
26 Mar 24  `* Re: Microarch Club20MitchAlsup1
26 Mar 24   `* Re: Microarch Club19BGB
26 Mar 24    `* Re: Microarch Club18MitchAlsup1
26 Mar 24     `* Re: Microarch Club17BGB-Alt
27 Mar 24      +* Re: Microarch Club12Michael S
27 Mar 24      i`* Re: Microarch Club11BGB
27 Mar 24      i `* Re: Microarch Club10MitchAlsup1
28 Mar 24      i  +* Re: Microarch Club4Michael S
2 Apr 24      i  i`* Re: Microarch Club3BGB-Alt
5 Apr 24      i  i `* Re: Microarch Club2MitchAlsup1
6 Apr 24      i  i  `- Re: Microarch Club1BGB
28 Mar 24      i  +- Re: Microarch Club1MitchAlsup1
28 Mar 24      i  `* Re: Microarch Club4Terje Mathisen
28 Mar 24      i   `* Re: Microarch Club3Michael S
29 Mar 24      i    `* Re: Microarch Club2Terje Mathisen
29 Mar 24      i     `- Re: Microarch Club1Michael S
27 Mar 24      `* Re: Microarch Club4MitchAlsup1
27 Mar 24       `* Re: Microarch Club3BGB
27 Mar 24        `* Re: Microarch Club2MitchAlsup1
1 Apr 24         `- Re: Microarch Club1BGB

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal