Re: Microarch Club

Liste des GroupesRevenir à c arch 
Sujet : Re: Microarch Club
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.arch
Date : 27. Mar 2024, 23:03:52
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <727d2b0fa197393309db7fefd76591e0@www.novabbs.org>
References : 1 2 3 4 5 6 7 8
User-Agent : Rocksolid Light
BGB wrote:

On 3/26/2024 7:02 PM, MitchAlsup1 wrote:
 Sometimes I see a::
      CVTSD     R2,#5
 Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed in register R2 so it can be accesses as an argument in the subroutine call
to happen in a few instructions.
 

I had looked into, say:
   FADD Rm, Imm5fp, Rn
Where, despite Imm5fp being severely limited, it had an OK hit rate.

Unpacking imm5fp to Binary16 being, essentially:
   aee.fff -> 0.aAAee.fff0000000
realistically ±{0, 1, 2, 3, 4, 5, .., 31} only misses a few of the often
used fp constants--but does include 0, 1, 2, and 10. Also, realistically
the missing cases are the 0.5s.

OTOH, can note that a majority of typical floating point constants can be represented exactly in Binary16 (well, excluding "0.1" or similar), so it works OK as an immediate format.

This allows a single 32-bit op to be used for constant loads (nevermind if one needs a 96 bit encoding for 0.1, or PI, or ...).

Mostly, a floating point immediate is available from a 32-bit constant
container. When accesses in a float calculation it is used as IEEE32
when accessed by a 6double calculation IEEE32->IEEE64 promotion is
performed in the constant delivery path. So, one can use almost any
floating point constant that is representable in float as a double
without eating cycles and while saving code footprint.
 

Don't currently have the encoding space for this.

Could in theory pull off truncated Binary32 an Imm29s form, but not likely worth it. Would also require putting a converted in the ID2 stage, so not free.

In this case, the issue is more one of LUT cost to support these cases.

For FPU compare with zero, can almost leverage the integer compare ops, apart from the annoying edge cases of -0.0 and NaN leading to "not strictly equivalent" behavior (though, an ASM programmer could more easily get away with this). But, not common enough to justify adding FPU specific ops for this.
 Actually, the edge/noise cases are not that many gates.
a) once you are separating out NaNs, infinities are free !!
b) once you are checking denorms for zero, infinites become free !!
 Having structured a Compare-to-zero circuit based on the fields in double;
You can compose the terns to do all signed and unsigned integers and get
a gate count, then the number of gates you add to cover all 10 cases of floating point is 12% gate count over the simple integer version. Also
note:: this circuit is about 10% of the gate count of an integer adder.
 

I could add them, but, is it worth it?...
Whether to add them or not is on you.
I found things like this to be more straw on the camel's back {where the camel collapses to a unified register file model.}

In this case, it is more a question of encoding space than logic cost.

It is semi-common in FP terms, but likely not common enough to justify dedicated compare-and-branch ops and similar (vs the minor annoyance at the integer ops not quite working correctly due to edge cases).
My model requires about ½ the instruction count when processing FP
comparisons compared to RISC-V (big, no; around 5% in FP code and 0 elsewhere.} Where it wins big is compare against a non-zero FP constant. My 66000 uses 1 instructions {FCMP, BB} whereas RISC=V uses 4 {AUPIC, LD, FCMP, BC}

-----------------------
 
Seems that generally 0 still isn't quite common enough to justify having one register fewer for variables though (or to have a designated zero register), but otherwise it seems there is not much to justify trying to exclude the "implicit zero" ops from the ISA listing.
  It is common enough,
But there are lots of ways to get a zero where you want it for a return.
 

I think the main use case for a zero register is mostly that it allows using it as a special case for pseudo-ops. I guess, not quite the same if it is a normal GPR that just so happens to be 0.

Recently ended up fixing a bug where:
   y=-x;
Was misbehaving with "unsigned int":
   "NEG" produces a value which falls outside of UInt range;
   But, "NEG; EXTU.L" is a 2-op sequence.
   It had the EXT for SB/UB/SW/UW, but not for UL.
     For SL, bare NEG almost works, apart from ((-1)<<31).
Could encode it as:
   SUBU.L  Zero, Rs, Rn
     ADD  Rn,#0,-Rs
But notice::
     y = -x;
     a = b + y;
can be performed as if it had been written::
     y = -x;
     a = b + (-x);
Which is encoded as::
     ADD  Ry,#0,-Rx
     ADD  Ra,Rb,-Rx

But, without a zero register,
#0 is not a register, but its value is 0x0000000000000000 anyway.
You missed the point entirely, if you can get easy access to #0
then you no longer need a register to hold this simple bit pattern.
In fact a large portion of My 66000 ISA over RISC-V comes from this
mechanism.

                              the compiler needs to special-case
The compiler needs easy access to #0 and the compiler needs to know
that #0 exists, but the compiler does not need to know if some register
contains that same bit pattern.

provision this (or, in theory, add a "NEGU.L" instruction, but doesn't seem common enough to justify this).

....

It is less bad than 32-bit ARM, where I only burnt 2 bits, rather than 4.
I burned 0 per instruction, but you can claim I burned 1 instruction PRED and
6.4 bits of that instruction are used to create masks that project upon up to
8 following instructions.

Also seems like a reasonable tradeoff, as the 2 bits effectively gain:
   Per-instruction predication;
   WEX / Bundle encoding;
   Jumbo prefixes;
   ...

But, maybe otherwise could have justified slightly bigger immediate fields, dunno.

While poking at it, did go and add a check to exclude large struct-copy operations from predication, as it is slower to turn a large struct copy into NOPs than to branch over it.
 
Did end up leaving struct-copies where sz<=64 as allowed though (where a 64 byte copy at least has the merit of achieving full pipeline saturation and being roughly break-even with a branch-miss, whereas a 128 byte copy would cost roughly twice as much as a branch miss).
 I decided to bite the bullet and have LDM, STM and MM so the compiler does
not have to do any analysis. This puts the onus on the memory unit designer
to process these at least as fast as a series of LDs and STs. Done right
this saves ~40%of the power of the caches avoiding ~70% of tag accesses
and 90% of TLB accesses. You access the tag only when/after crossing a line boundary and you access TLB only after crossing a page boundary.

OK.

In my case, it was more a case of noting that sliding over, say, 1kB worth of memory loads/stores, is slower than branching around it.
This is why My 66000 predication has use limits. Once you can get where you want faster with a branch, then a branch is what you should use.
I reasoned that my 1-wide machine would fetch 16-bytes (4 words) per
cycle and that the minimum DECODE time is 2 cycles, that Predication
wins when the number of instructions <= FWidth × Dcycles = 8.
Use predication and save cycles by not disrupting the front end.
Use branching   and save cycles by     disrupting the front end.

 All I had to do was to make the second predication overwrite the first
predication's mask, and the compiler did the rest.

Not so simple in my case, but the hardware is simpler, since it just cares about the state of 1 bit (which is explicitly saved/restored along with the rest of the status-register if an interrupt occurs).
Simpler than 8-flip flops used as a shift right register ??

Date Sujet#  Auteur
21 Mar 24 * Microarch Club22George Musk
25 Mar 24 `* Re: Microarch Club21BGB-Alt
26 Mar 24  `* Re: Microarch Club20MitchAlsup1
26 Mar 24   `* Re: Microarch Club19BGB
26 Mar 24    `* Re: Microarch Club18MitchAlsup1
26 Mar 24     `* Re: Microarch Club17BGB-Alt
27 Mar 24      +* Re: Microarch Club12Michael S
27 Mar 24      i`* Re: Microarch Club11BGB
27 Mar 24      i `* Re: Microarch Club10MitchAlsup1
28 Mar 24      i  +* Re: Microarch Club4Michael S
2 Apr 24      i  i`* Re: Microarch Club3BGB-Alt
5 Apr 24      i  i `* Re: Microarch Club2MitchAlsup1
6 Apr 24      i  i  `- Re: Microarch Club1BGB
28 Mar 24      i  +- Re: Microarch Club1MitchAlsup1
28 Mar 24      i  `* Re: Microarch Club4Terje Mathisen
28 Mar 24      i   `* Re: Microarch Club3Michael S
29 Mar 24      i    `* Re: Microarch Club2Terje Mathisen
29 Mar 24      i     `- Re: Microarch Club1Michael S
27 Mar 24      `* Re: Microarch Club4MitchAlsup1
27 Mar 24       `* Re: Microarch Club3BGB
27 Mar 24        `* Re: Microarch Club2MitchAlsup1
1 Apr 24         `- Re: Microarch Club1BGB

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal