Re: Microarch Club

Liste des GroupesRevenir à c arch 
Sujet : Re: Microarch Club
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 27. Mar 2024, 20:21:04
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <uu1o2p$30cnr$1@dont-email.me>
References : 1 2 3 4 5 6 7
User-Agent : Mozilla Thunderbird
On 3/26/2024 7:02 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
 
On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
BGB wrote:
>
I ended up with jumbo-prefixes. Still not perfect, and not perfectly orthogonal, but mostly works.
>
Allows, say:
   ADD R4, 0x12345678, R6
>
To be performed in potentially 1 clock-cycle and with a 64-bit encoding, which is better than, say:
   LUI X8, 0x12345
   ADD X8, X8, 0x678
   ADD X12, X10, X8
>
This strategy completely fails when the constant contains more than 32-bits
>
     FDIV   R9,#3.141592653589247,R17
>
When you have universal constants (including 5-bit immediates), you rarely
need a register containing 0.
>
 
The jumbo prefixes at least allow for a 64-bit constant load, but as-is not for 64-bit immediate values to 3RI ops. The latter could be done, but would require 128-bit fetch and decode, which doesn't seem worth it.
 
There is the limbo feature of allowing for 57-bit immediate values, but this is optional.
 
OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.
 Which the LLVM compiler for RISC-V does not do, instead it uses a AUPIC
and a LD to get the value from data memory within ±2GB of IP. This takes
3 instructions and 2 words in memory when universal constants do this in
1 instruction and 2 words in the code stream to do this.
 
I was mostly testing with GCC for RV64, but, yeah, it just does memory loads.

Typical GCC response on RV64 seems to be to turn nearly all of the big-constant cases into memory loads, which kinda sucks.
 This is typical when the underlying architecture is not very extensible to 64-bit virtual address spaces; they have to waste a portion of the 32-bit space to get access to all the 64-bit space. Universal constants
makes this problem vanish.
 
Yeah.
It at least seems worthwhile to have non-suck fallback strategies.

Even something like a "LI Xd, Imm17s" instruction, would notably reduce the number of constants loaded from memory (as GCC seemingly prefers to use a LHU or LW or similar rather than encode it using LUI+ADD).
 Reduce when compared to RISC-V but increased when compared to My 66000.
My 66000 has (at 99& level) uses no instructions to fetch or create constants, nor does it waste any register (or registers) to hold use
once constants.
 
Yeah, but the issue here is mostly with RISC-V and its lack of constant load.
Or burning lots of encoding space (on LUI and AUIPC), but still lacks good general-purpose options.
FWIW, if they had:
   LI     Xd, Imm17s
   SHORI  Xd, Imm16u   //Xd=(Xd<<16)|Imm16u
Would have still allowed a 32-bit constant in 2 ops, but would have also allowed 64-bit in 4 ops (within the limits of fixed-length 32-bit instructions); while also needing significantly less encoding space (could fit both of them into the remaining space in the OP-IMM-32 block).

I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping them enabled in the CPU core (they involved the non-zero cost of repacking them into Binary16 in ID1 and then throwing a Binary16->Binary64 converter into the ID2 stage).
 
Generally, the "FLDCH Imm16, Rn" instruction works well enough here (and can leverage a more generic Binary16->Binary64 converter path).
 Sometimes I see a::
      CVTSD     R2,#5
 Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed in register R2 so it can be accesses as an argument in the subroutine call
to happen in a few instructions.
 
I had looked into, say:
   FADD Rm, Imm5fp, Rn
Where, despite Imm5fp being severely limited, it had an OK hit rate.
Unpacking imm5fp to Binary16 being, essentially:
   aee.fff -> 0.aAAee.fff0000000
OTOH, can note that a majority of typical floating point constants can be represented exactly in Binary16 (well, excluding "0.1" or similar), so it works OK as an immediate format.
This allows a single 32-bit op to be used for constant loads (nevermind if one needs a 96 bit encoding for 0.1, or PI, or ...).
IIRC, I originally had it in the CONV2 path, which would give it a 2-cycle latency and only allow it in Lane 1.
I later migrated the logic to the "MOV_IR" path which also deals with "Rn=(Rn<<16)|Imm" and similar, and currently allows 1-cycle Binary16 immediate-loads in all 3 lanes.
Though, BGBCC still assumes it is Lane-1 only unless the FPU Immediate extension is enabled (as with the other FP converters).

Mostly, a floating point immediate is available from a 32-bit constant
container. When accesses in a float calculation it is used as IEEE32
when accessed by a 6double calculation IEEE32->IEEE64 promotion is
performed in the constant delivery path. So, one can use almost any
floating point constant that is representable in float as a double
without eating cycles and while saving code footprint.
 
Don't currently have the encoding space for this.
Could in theory pull off truncated Binary32 an Imm29s form, but not likely worth it. Would also require putting a converted in the ID2 stage, so not free.
In this case, the issue is more one of LUT cost to support these cases.

For FPU compare with zero, can almost leverage the integer compare ops, apart from the annoying edge cases of -0.0 and NaN leading to "not strictly equivalent" behavior (though, an ASM programmer could more easily get away with this). But, not common enough to justify adding FPU specific ops for this.
 Actually, the edge/noise cases are not that many gates.
a) once you are separating out NaNs, infinities are free !!
b) once you are checking denorms for zero, infinites become free !!
 Having structured a Compare-to-zero circuit based on the fields in double;
You can compose the terns to do all signed and unsigned integers and get
a gate count, then the number of gates you add to cover all 10 cases of floating point is 12% gate count over the simple integer version. Also
note:: this circuit is about 10% of the gate count of an integer adder.
 
I could add them, but, is it worth it?...
In this case, it is more a question of encoding space than logic cost.
It is semi-common in FP terms, but likely not common enough to justify dedicated compare-and-branch ops and similar (vs the minor annoyance at the integer ops not quite working correctly due to edge cases).

-----------------------
 
Seems that generally 0 still isn't quite common enough to justify having one register fewer for variables though (or to have a designated zero register), but otherwise it seems there is not much to justify trying to exclude the "implicit zero" ops from the ISA listing.
  It is common enough,
But there are lots of ways to get a zero where you want it for a return.
 
I think the main use case for a zero register is mostly that it allows using it as a special case for pseudo-ops. I guess, not quite the same if it is a normal GPR that just so happens to be 0.
Recently ended up fixing a bug where:
   y=-x;
Was misbehaving with "unsigned int":
   "NEG" produces a value which falls outside of UInt range;
   But, "NEG; EXTU.L" is a 2-op sequence.
   It had the EXT for SB/UB/SW/UW, but not for UL.
     For SL, bare NEG almost works, apart from ((-1)<<31).
Could encode it as:
   SUBU.L  Zero, Rs, Rn
But, without a zero register, the compiler needs to special-case provision this (or, in theory, add a "NEGU.L" instruction, but doesn't seem common enough to justify this).
...

>
Though, would likely still make a few decisions differently from those in RISC-V. Things like indexed load/store,
>
Absolutely
>
                                           predicated ops (with a designated flag bit),
>
Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}
>
>
I have per instruction predication:
   CMPxx ...
   OP?T  //if-true
   OP?F  //if-false
Or:
   OP?T | OP?F  //both in parallel, subject to encoding and ISA rules
>
     CMP  Rt,Ra,#whatever
     PLE  Rt,TTTTTEEE
     // This begins the then-clause 5Ts -> 5 instructions
     OP1
     OP2
     OP3
     OP4
     OP5
     // this begins the else-clause 3Es -> 3 instructions
     OP6
     OP7
     OP8
     // we are now back join point.
>
Notice no internal flow control instructions.
>
 
It can be similar in my case, with the ?T / ?F encoding scheme.
 Except you eat that/those bits in OpCode encoding.
 
It is less bad than 32-bit ARM, where I only burnt 2 bits, rather than 4.
Also seems like a reasonable tradeoff, as the 2 bits effectively gain:
   Per-instruction predication;
   WEX / Bundle encoding;
   Jumbo prefixes;
   ...
But, maybe otherwise could have justified slightly bigger immediate fields, dunno.

While poking at it, did go and add a check to exclude large struct-copy operations from predication, as it is slower to turn a large struct copy into NOPs than to branch over it.
 
Did end up leaving struct-copies where sz<=64 as allowed though (where a 64 byte copy at least has the merit of achieving full pipeline saturation and being roughly break-even with a branch-miss, whereas a 128 byte copy would cost roughly twice as much as a branch miss).
 I decided to bite the bullet and have LDM, STM and MM so the compiler does
not have to do any analysis. This puts the onus on the memory unit designer
to process these at least as fast as a series of LDs and STs. Done right
this saves ~40%of the power of the caches avoiding ~70% of tag accesses
and 90% of TLB accesses. You access the tag only when/after crossing a line boundary and you access TLB only after crossing a page boundary.
OK.
In my case, it was more a case of noting that sliding over, say, 1kB worth of memory loads/stores, is slower than branching around it.
Previously this case was not checked.
Had also changed some of the rules to allow FPU stuff to be predicated, since the current form of the ISA has no issue predicating FPU ops (IIRC, in an early form of the ISA, the FPU ops could not be predicated).
Do still need to exclude any cases which may result in a function call though, ...

Performance gains are modest, but still noticeable (part of why predication ended up as a core ISA feature). Effect on pipeline seems to be small in its current form (it is handled along with register fetch, mostly turning non-executed instructions into NOPs during the EX stages).
>
The effect is that one uses Predication whenever you will have already
fetched instructions at the join point by the time you have determined
the predicate value {then, else} clauses. The PARSE and DECODE do the
flow control without bothering FETCH.
>
 
Yeah, though in my pipeline, it is still a tradeoff of the relative cost of a missed branch, vs the cost of sliding over both the THEN and ELSE branches as a series of NOPs.
 
For the most part, 1-bit seems sufficient.
>
How do you do && and || predication with 1 bit ??
>
 
Originally, it didn't.
Now I added some 3R and 3RI CMPxx encodings.
 
This allows, say:
CMPGT R8, R10, R4
CMPGT R8, R11, R5
TST R4, R5
....
  All I had to do was to make the second predication overwrite the first
predication's mask, and the compiler did the rest.
Not so simple in my case, but the hardware is simpler, since it just cares about the state of 1 bit (which is explicitly saved/restored along with the rest of the status-register if an interrupt occurs).
Various cases get kind of annoying in the compiler, as it needs to figure out to some extent what the conditional is doing, and how to translate this through the IR stages.
...

Date Sujet#  Auteur
21 Mar 24 * Microarch Club22George Musk
25 Mar 24 `* Re: Microarch Club21BGB-Alt
26 Mar 24  `* Re: Microarch Club20MitchAlsup1
26 Mar 24   `* Re: Microarch Club19BGB
26 Mar 24    `* Re: Microarch Club18MitchAlsup1
26 Mar 24     `* Re: Microarch Club17BGB-Alt
27 Mar 24      +* Re: Microarch Club12Michael S
27 Mar 24      i`* Re: Microarch Club11BGB
27 Mar 24      i `* Re: Microarch Club10MitchAlsup1
28 Mar 24      i  +* Re: Microarch Club4Michael S
2 Apr 24      i  i`* Re: Microarch Club3BGB-Alt
5 Apr 24      i  i `* Re: Microarch Club2MitchAlsup1
6 Apr 24      i  i  `- Re: Microarch Club1BGB
28 Mar 24      i  +- Re: Microarch Club1MitchAlsup1
28 Mar 24      i  `* Re: Microarch Club4Terje Mathisen
28 Mar 24      i   `* Re: Microarch Club3Michael S
29 Mar 24      i    `* Re: Microarch Club2Terje Mathisen
29 Mar 24      i     `- Re: Microarch Club1Michael S
27 Mar 24      `* Re: Microarch Club4MitchAlsup1
27 Mar 24       `* Re: Microarch Club3BGB
27 Mar 24        `* Re: Microarch Club2MitchAlsup1
1 Apr 24         `- Re: Microarch Club1BGB

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal