Re: Microarch Club

Liste des GroupesRevenir à c arch 
Sujet : Re: Microarch Club
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.arch
Date : 27. Mar 2024, 02:02:05
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org>
References : 1 2 3 4 5 6
User-Agent : Rocksolid Light
BGB-Alt wrote:

On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
BGB wrote:
 
I ended up with jumbo-prefixes. Still not perfect, and not perfectly orthogonal, but mostly works.
 
Allows, say:
   ADD R4, 0x12345678, R6
 
To be performed in potentially 1 clock-cycle and with a 64-bit encoding, which is better than, say:
   LUI X8, 0x12345
   ADD X8, X8, 0x678
   ADD X12, X10, X8
 This strategy completely fails when the constant contains more than 32-bits
      FDIV   R9,#3.141592653589247,R17
 When you have universal constants (including 5-bit immediates), you rarely
need a register containing 0.
 

The jumbo prefixes at least allow for a 64-bit constant load, but as-is not for 64-bit immediate values to 3RI ops. The latter could be done, but would require 128-bit fetch and decode, which doesn't seem worth it.

There is the limbo feature of allowing for 57-bit immediate values, but this is optional.

OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.
Which the LLVM compiler for RISC-V does not do, instead it uses a AUPIC
and a LD to get the value from data memory within ±2GB of IP. This takes
3 instructions and 2 words in memory when universal constants do this in
1 instruction and 2 words in the code stream to do this.

Typical GCC response on RV64 seems to be to turn nearly all of the big-constant cases into memory loads, which kinda sucks.
This is typical when the underlying architecture is not very extensible to 64-bit virtual address spaces; they have to waste a portion of the 32-bit space to get access to all the 64-bit space. Universal constants
makes this problem vanish.

Even something like a "LI Xd, Imm17s" instruction, would notably reduce the number of constants loaded from memory (as GCC seemingly prefers to use a LHU or LW or similar rather than encode it using LUI+ADD).
Reduce when compared to RISC-V but increased when compared to My 66000.
My 66000 has (at 99& level) uses no instructions to fetch or create constants, nor does it waste any register (or registers) to hold use
once constants.

I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping them enabled in the CPU core (they involved the non-zero cost of repacking them into Binary16 in ID1 and then throwing a Binary16->Binary64 converter into the ID2 stage).

Generally, the "FLDCH Imm16, Rn" instruction works well enough here (and can leverage a more generic Binary16->Binary64 converter path).
Sometimes I see a::
     CVTSD     R2,#5
Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed in register R2 so it can be accesses as an argument in the subroutine call
to happen in a few instructions.
Mostly, a floating point immediate is available from a 32-bit constant
container. When accesses in a float calculation it is used as IEEE32
when accessed by a 6double calculation IEEE32->IEEE64 promotion is
performed in the constant delivery path. So, one can use almost any
floating point constant that is representable in float as a double
without eating cycles and while saving code footprint.

For FPU compare with zero, can almost leverage the integer compare ops, apart from the annoying edge cases of -0.0 and NaN leading to "not strictly equivalent" behavior (though, an ASM programmer could more easily get away with this). But, not common enough to justify adding FPU specific ops for this.
Actually, the edge/noise cases are not that many gates.
a) once you are separating out NaNs, infinities are free !!
b) once you are checking denorms for zero, infinites become free !!
Having structured a Compare-to-zero circuit based on the fields in double;
You can compose the terns to do all signed and unsigned integers and get
a gate count, then the number of gates you add to cover all 10 cases of floating point is 12% gate count over the simple integer version. Also
note:: this circuit is about 10% of the gate count of an integer adder.
-----------------------

Seems that generally 0 still isn't quite common enough to justify having one register fewer for variables though (or to have a designated zero register), but otherwise it seems there is not much to justify trying to exclude the "implicit zero" ops from the ISA listing.
It is common enough,
But there are lots of ways to get a zero where you want it for a return.

 
Though, would likely still make a few decisions differently from those in RISC-V. Things like indexed load/store,
>
Absolutely
>
                                           predicated ops (with a designated flag bit),
>
Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}
>
 
I have per instruction predication:
   CMPxx ...
   OP?T  //if-true
   OP?F  //if-false
Or:
   OP?T | OP?F  //both in parallel, subject to encoding and ISA rules
      CMP  Rt,Ra,#whatever
     PLE  Rt,TTTTTEEE
     // This begins the then-clause 5Ts -> 5 instructions
     OP1
     OP2
     OP3
     OP4
     OP5
     // this begins the else-clause 3Es -> 3 instructions
     OP6
     OP7
     OP8
     // we are now back join point.
 Notice no internal flow control instructions.
 

It can be similar in my case, with the ?T / ?F encoding scheme.
Except you eat that/those bits in OpCode encoding.

While poking at it, did go and add a check to exclude large struct-copy operations from predication, as it is slower to turn a large struct copy into NOPs than to branch over it.

Did end up leaving struct-copies where sz<=64 as allowed though (where a 64 byte copy at least has the merit of achieving full pipeline saturation and being roughly break-even with a branch-miss, whereas a 128 byte copy would cost roughly twice as much as a branch miss).
I decided to bite the bullet and have LDM, STM and MM so the compiler does
not have to do any analysis. This puts the onus on the memory unit designer
to process these at least as fast as a series of LDs and STs. Done right
this saves ~40%of the power of the caches avoiding ~70% of tag accesses
and 90% of TLB accesses. You access the tag only when/after crossing a line boundary and you access TLB only after crossing a page boundary.
Performance gains are modest, but still noticeable (part of why predication ended up as a core ISA feature). Effect on pipeline seems to be small in its current form (it is handled along with register fetch, mostly turning non-executed instructions into NOPs during the EX stages).
 The effect is that one uses Predication whenever you will have already
fetched instructions at the join point by the time you have determined
the predicate value {then, else} clauses. The PARSE and DECODE do the
flow control without bothering FETCH.
 

Yeah, though in my pipeline, it is still a tradeoff of the relative cost of a missed branch, vs the cost of sliding over both the THEN and ELSE branches as a series of NOPs.

For the most part, 1-bit seems sufficient.
 How do you do && and || predication with 1 bit ??
 

Originally, it didn't.
Now I added some 3R and 3RI CMPxx encodings.

This allows, say:
CMPGT R8, R10, R4
CMPGT R8, R11, R5
TST R4, R5
....
All I had to do was to make the second predication overwrite the first
predication's mask, and the compiler did the rest.

Date Sujet#  Auteur
21 Mar 24 * Microarch Club22George Musk
25 Mar 24 `* Re: Microarch Club21BGB-Alt
26 Mar 24  `* Re: Microarch Club20MitchAlsup1
26 Mar 24   `* Re: Microarch Club19BGB
26 Mar 24    `* Re: Microarch Club18MitchAlsup1
26 Mar 24     `* Re: Microarch Club17BGB-Alt
27 Mar 24      +* Re: Microarch Club12Michael S
27 Mar 24      i`* Re: Microarch Club11BGB
27 Mar 24      i `* Re: Microarch Club10MitchAlsup1
28 Mar 24      i  +* Re: Microarch Club4Michael S
2 Apr 24      i  i`* Re: Microarch Club3BGB-Alt
5 Apr 24      i  i `* Re: Microarch Club2MitchAlsup1
6 Apr 24      i  i  `- Re: Microarch Club1BGB
28 Mar 24      i  +- Re: Microarch Club1MitchAlsup1
28 Mar 24      i  `* Re: Microarch Club4Terje Mathisen
28 Mar 24      i   `* Re: Microarch Club3Michael S
29 Mar 24      i    `* Re: Microarch Club2Terje Mathisen
29 Mar 24      i     `- Re: Microarch Club1Michael S
27 Mar 24      `* Re: Microarch Club4MitchAlsup1
27 Mar 24       `* Re: Microarch Club3BGB
27 Mar 24        `* Re: Microarch Club2MitchAlsup1
1 Apr 24         `- Re: Microarch Club1BGB

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal