Re: Misc: Ongoing status...

Liste des GroupesRevenir à c arch 
Sujet : Re: Misc: Ongoing status...
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 31. Jan 2025, 07:50:24
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vnhrrj$3d7i0$1@dont-email.me>
References : 1 2
User-Agent : Mozilla Thunderbird
On 1/30/2025 5:48 PM, MitchAlsup1 wrote:
On Thu, 30 Jan 2025 20:00:22 +0000, BGB wrote:
 
So, recent features added to my core ISA: None.
Reason: Not a whole lot that brings much benefit.
>
>
Have ended up recently more working on the RISC-V side of things,
because there are still gains to be made there (stuff is still more
buggy, less complete, and slower than XG2).
>
>
On the RISC-V side, did experiment with Branch-compare-Immediate
instructions, but unclear if I will carry them over:
   Adds a non-zero cost to the decoder;
     Cost primarily associated with dealing with a second immed.
   Effect on performance is very small (< 1%).
 I find this a little odd--My 66000 has a lot of CPM #immed-BC
a) so I am sensitive as this is break even wrt RISC-V
b) But perhaps the small gains is due to something about
.. how the pair runs down the pipe as opposed to how the
.. single runs down the pipe.
 
Issue I had seen is mostly, "How often does it come up?":
Seemingly, around 100-150 or so instructions between each occurrence on average (excluding cases where the constant is zero; comparing with zero being more common).
What does it save:
Typically 1 cycle that might otherwise be spent loading the value into a register (if this instruction doesn't end up getting run in parallel with another prior instruction).
In the BGBCC output, the main case it comes up is primarily in "for()" loops (followed by the occasional if-statement), so one might expect this would increase its probability of having more of an effect.
But, seemingly, not enough tight "for()" loops and similar in use for it to have a more significant effect.
So, in the great "if()" ranking:
   if(x COND 0) ...   //first place
   if(x COND y) ...   //second place
   if(x COND imm) ... //third place
However, a construct like:
   for(i=0; i<10; i++)
     { ... }
Will emit two of them, so they are not *that* rare either.
Still, a lot rarer in use than:
   val=ptr[idx];
Though...
Have noted though that simple constant for-loops are a minority, far more often they are something like:
   for(i=0; i<n; i++)
     { ... }
Which doesn't use any.
Or:
   while(i--)
     { ... }
Which uses a compare with zero (in RV, can be encoded with the zero register; in BJX2 it has its own dedicated instruction due to the lack of zero register; some of these were formally dropped in XG3 which does have access to a zero register, and encoding an op using a ZR instead is considered as preferable).

In my case, I added them as jumbo-prefixed forms, so:
   BEQI Imm17s, Rs, Disp12s
 IMM17 should be big enough.
 
It is what falls out of my encoding scheme:
* 1iiiiii-iiiii-zzzzV-100-omnpp-01-11111  J21O (Prefix, BGB)
So, there is 11 bits of immediate-extension, which (if V=1) combines with the 5-bit Ro/Rs2 field, and the Wo extension bit, to give an Imm17s (this applies to most 3R instructions, though for reasons, only a subset are supported in JX2VM; but pretty much any normal 3R op in the Verilog core).
I made it a rule that for some Disp12 ops, namely Bcc and Store, if V=1, then rather than combine with the existing Disp12, it forms a separate Imm field with Ro (as in the case of a 3R instruction).
In the decoder, it ends up routing the second immediate through Lane 3 (similar to the additional register ports for Store; so if the Rx or Ry ports are used to fetch the immediate, it gets routed back into the op). Had to add a special case to get the Lane 3's immediate back into Lane 1's ALU though (via the Rs and Rt register ports).
The main cost of this feature is mostly in the additional signal routing.

Also added Store-with-Immediate, with a similar mechanism:
   MOV.L  Imm17s, (Rm, Disp12s*1)
As, it basically dropped out for free.
>
Also unclear if it will be carried over. Also gains little, as in most
of the store-with-immediate scenarios, the immediate is 0.
 Most of my ST w/immediate is floating point data--imme17 is not
going to cut it there.
 
Well, unless it is Binary16 and gets auto-converted to Binary32 or Binary64.
Though, IIRC, I currently only have this mechanism in place in Lane 1 (mostly intended for FPU immediate values); and for Store it would need to happen in Lane 3.
Note that in both BJX2 and RV+Jx, it is theoretically possible to do Immediate synthesis on an FPU op to get an FPU immediate.
Though, the possible functional overlaps between BJX2 Jumbo Prefixes and my Jumbo Prefix extension for RISC-V are "not exactly subtle" (main drastic difference being that the RV prefixes only have 22 bits available rather than 25).

Instructions with a less than 1% gain and no compelling edge case, are
essentially clutter.
>
I can note that some of the niche ops I did add, like special-case
RGB555 to Index8 or RGBI, were because at least they had a significant
effect in one use-case (such as, speeding up how quickly the GUI can do
redraw operations).
>
My usual preference in these cases is to assign 64-bit encodings, as the
instructions might only be used in a few edge cases, so it becomes a
waste to assign them spots in the more valuable 32-bit encoding space.
>
>
The more popular option was seemingly another person's option, to define
them as 32-bit encodings.
   Their proposal was effectively:
     Bcc Imm5, Rs1', Disp12
       (IOW: a 3-bit register field, in a 32-bit instruction)
   I don't like this, this is very off-balance.
     Better IMO: Bcc Imm6s, Rs1, Disp9s (+/- 512B)
 This is the case were fusing of CMP #imm16-BC into one op is
better, unless you can use a 64-bit encoding to directly
encode that.
 
I am using a 64-bit Jumbo encoding...
   But, not much enthusiasm in RV land for jumbo prefixes.
It is infrequent enough that a 64-bit encoding makes sense.
Huawei had a "less bad" encoding, but they burnt basically the entire User-1 block on it, so that isn't going to fly.
Generally, around 95% of the function-local branches can hit in a Disp9, vs 98% for Disp12. So, better to drop to Disp9.
An Imm6s that encodes -32..31 is "mostly adequate".
The combination of Disp9, Imm6, and a 5-bit register field, is likely the "statistically best case" given the available encoding bits; assuming one really must go with a 32-bit instruction format.
For Baseline (which I am half-tempted at this point to re-dub as XG1) and XG2, the Branch-and-compare ops use a Disp8, but for XG3 it was reworked as Disp10 (at the cost of losing the L/Q distinction). Almost half-tempting to switch XG2 to using Disp10 for Branch-and-Compare, apart from the issue that this would break binary compatibility with existing binaries.
Comparably, Disp8 has a somewhat worse hit-rate than Disp9 or Disp10.
But, some changes for XG3 I ended up rolling back on:
   Had intended to move BRA and BSR into the F8 block;
   And make Disp33s use an unscaled displacement (like RV).
But:
   Ended up just moving back to the BRA/BSR space in the F0 block;
   And, keeping the same displacement scale rules as XG3;
     Though, with a tweak to both XG2 and XG3:
       Setting WO forces an unscaled/byte displacement
         (regardless of base register).
For XG3 mode, the layout of the displacement in the F0 block was changed to match that of the planned F8 block branches. Mostly this was realizing that realistically, I can't get rid of the BRA/BSR blocks in F0, and having a difference here between XG2 and XG3 was more hassle than it was worth.

The 3-bit register field also makes it nearly useless with my compiler,
as my compiler (in its RV mode) primarily uses X18..X27 for variables
(IOW: the callee save registers). But, maybe moot, as either way it
would still save less than 1%.
>
Also, as for any ops with 3-bit registers:
   Would make superscalar harder and more expensive;
   Would add ugly edge cases and cost to the instruction decoder;
   ...
 3-bit register specifier is not much better than dedicated registers
{like x86 DIV}.
 
Yeah.
In RV, it encodes X8..X15, which mostly covers the argument registers.
The argument was mostly that GCC mostly uses the scratch/argument registers for register allocation.
But, in my case, BGBCC is not GCC, and uses callee-save registers for local variables (and switching to scratch registers in non-leaf functions would be a non-trivial set of design changes).
Similarly, I am not super keen on proposals that would overly constrain implementation choices and register usage in the compiler.
BGBCC seemingly also has a wider spread in the register usage than GCC in this case (and very different patterns for which registers are most frequently used).
Also, if register allocation can be spread over more registers, it is generally better for being able to shuffle things around and extract some usable ILP.
But, can note that comparably, the RV ABI is balanced towards having more scratch registers than callee save (vs the BJX2 ABI being closer to even, but having slightly more callee-save registers than scratch registers).

I would prefer it if people not went that route (and tried to keep
things at least mostly consistent, trying to avoid making a dog chewed
mess of the
     already dog chewed
            32-bit ISA).
 
If you really feel the need for 3-bit register fields... Maybe, go to a
larger encoding?...
 I suggest a psychiatrist.
 
People are pointing to charts gathered by mining binaries and being like: "X10 and X11 are the two most commonly used registers".
But, this is like pointing at x86 and being like:
"EAX and ECX are the top two registers, who needs such obscure registers as ESI and EDI"?...

When I defined my own version of BccI (with a 64-bit encoding), how many
new instructions did I need to define in the 32-bit base ISA: Zero.
 How many 64-bit encodings did My 66000 need:: zero.
{Hint the words following the instruction specifier have no internal
format}
 
I consider the combination of Jumbo-Prefix and Suffix instruction to be a 64-bit instruction.
However, in some cases, the jumbo-prefix can supply additional opcode bits, or modify the instruction in ways to represent a new instruction form.
Nevermind whether or not jumbo prefixes are the most elegant solution.
They do allow leveraging the existing decoder logic, without too significant of additions or changes.
In some ways, it is better to selectively modify a whole "category" of instruction, than to address each encoding individually.
So, with the prefix, all the Imm12 ops "become" Imm33 ops, because the contents of the decoded Immed field were changed, not because I went in and added N new instructions to the decoder tables, or defined N new instructions in the base ISA.
If you take the Bcc op, and the decoder logic simply replaces the Ro register field with an "IMMB" virtual register, and directs an immediate to a secondary output that heads to Lane3, ...
Then, no new instructions needed to be added to the decoder.
But, can note:
The RV people seem highly averse to "change the instruction by modifying the behavior of a whole category in the decoder".
They would much rather keep adding N new instructions, one at a time, as entirely new encodings.
But, the same thing is:
Modifying a category by causing the decoder to decode a larger immediate is *not* the same as re-adding the whole ISA over again, just with a larger immediate field...
The logic for looking up what instruction it is decoding, for the most part, *does not care* about the contents of the register or immediate fields; it cares about the opcode bits.
The logic that is decoding the register fields cares about the register fields. And, the logic that is decoding immediate values, cares about immediate values.
This whole soup can then get MUX'ed together right at the end, instruction selecting what category of instruction it wants (for the decoded registers and immediate values), and then we can use a MUX to select the appropriate immediate-field contents (as a category).
But, as soon as you add new instructions whose layouts are non-standard, this has added *new* categories, and this is where a lot of the cost steps in.
What is the main arcane magic in the decoder that makes a RV jumbo prefix work, well, in a simplified sense:
   opImm12s = {
      istrWord[31] ? 21'h1FFFFF : 21'h000000,
      istrWord[31:20] };
   ...
   if(opIsJumbo)
   begin
     opImm12s[31:12] = istrJBits[19:0];
     ...
   end
It isn't really all that much harder than this.
Granted, this doesn't mean there can't be any bugs lurking in the mix (well, more so in cases where the top-level decoder starts trying to glue parts of the two ISAs together).
Though, it is pros/cons between C and Verilog here (in some cases, the C emulator code ends up requiring more handling for edge cases, than the "category altering logic" one can use in Verilog). But, OTOH, in some ways, Verilog is also harder to debug, and more sensitive to edge cases causing LUT cost to jump up or timing constraints to explode (one big merit of C being, you can plug values together however and there are no timing constraints).
But, when they are presented with "Well, bits are tight, I can't just burn a whole 25-bit block on this one thing...", "I know, 3-bit register fields!" or "Well, I will just move this field over here, and maybe stick on a new register field over there!".
Me: "Grrr..."
Would help maybe if people were not being overly wasteful with encoding space, or making things much more complicated than they need to be.

<snip>
 
But, my overall goal still being:
   Try to make it not suck.
But, it still kinda sucks.
   And, people don't want to admit that it kinda sucks;
   Or, that going some directions will make things worse.
 On the other hand, I remain upbeat on the ISA I have created.
 
I was actually more complaining about RISC-V here...
But, RISC-V is the more popular ISA.
   But, the cleverness of GCC doesn't make it a win.
BGBCC still beats GCC in terms of performance when it has a few useful ISA extensions, because it is not being held down by a boat anchor (even if arguably BGBCC is a worse compiler that has nowhere near the optimizing capabilities of GCC).
And, even with BGBCC still doing a lot of things that are arguably less optimal if compared with GCC (because, my effort budget is finite).
But, XG2 currently still holds the speed crown, I just lack much I can do to make it meaningfully faster within major architectural constraints.
XG3 has some potential, but still has not matched XG2.
Theoretically, XG3 has access to pretty much all of XG2's toolbox (being more-or-less XG2 with the encoding shuffled around).
But, thus far, XG3 binaries are still coming out slightly bigger and slower (and compiler is on average emitting them with around 30% RISC-V instructions).
Less obvious if some of this is features I hadn't mapped over yet (such as predicated instructions), or being negatively effected by the ABI:
Fewer ABI callee-save registers in a compiler that mostly uses callee-save registers;
Using 8 register arguments for functions vs the 16 I was using in XG2;
...
However, have noted that XG3 does appear to be faster than the original Baseline/XG1 ISA.
Where, to recap:
   XG1 (Baseline):
     16/32/64/96 bit encodings;
       16-bit ops can access R0..R15 with 4b registers;
         Only 2R or 2RI forms for 16-bit ops;
         16-bit ISA still fairly similar to SuperH.
     5-bit register fields by default;
       6-bit available for an ISA subset.
     Disp9u and Imm9u/n for most immediate form instructions;
     32 or 64 GPRs, Default 32.
     8 argument registers.
   XG2:
     32/64/96 bit encodings;
       All 16-bit encodings dropped.
     6-bit register fields (via a wonky encoding);
     Same basic instruction format as XG1,
       But, 3 new bits stored inverted in the HOB of instr words;
     Mostly Disp10s and Imm10u/n;
     64 GPRs native;
     16 argument registers.
   XG3:
     Basically repacked XG2;
       Can exist in same encoding space as RISC-V ops;
       Aims for ease of compatibility with RV64G.
     Encoding was made "aesthetically nicer"
       All the register bits are contiguous and non-inverted;
       Most immediate fields are also once again contiguous;
       ...
     Partly reworks branch instructions;
       Scale=4, usually relative to BasePC (like RV);
     Uses RV's register numbering space (and ABI);
       Eg: SP at R2 vs R15, ...
       (Partly carried over from XG2RV, which is now defunct).
     64 GPRs, but fudged into RV ABI rules;
       Can't rebalance ABI without breaking RV compatibility;
         Breaking RV compatibility defeating its point for existing.
     8 argument registers (because of RV ABI).
       Could in theory expand to 16, but would make issues.
     Despite being based on XG2,
       BGBCC treats XG3 as an extension to RISC-V.
Then, RV:
   16/32; 48/64/96 (Ext)
   Has 16-bit ops:
     Which are horribly dog-chewed,
       and only manage a handful of instructions.
     Many of the ops can only access X8..X15;
     With GCC, enabling RVC saves around 20% off the ".text" size.
   Imm12s and Disp12s for most ops;
   Lots of dog-chew in the encodings (particular Disp fields);
     JAL is basically confetti.
   ...
In its basic form, RV is the worst performing option here, but people actually care about RISC-V, so supporting it is value-added.
The BJX2 Core can technically boot using RISC-V code at this point, but isn't great if evaluated purely as a RISC-V core (neither achieves the fmax or DMIPS/MHz scores as some other cores).
Gets around 90k in Dhrystone, or around 1.02 DMIPS/MHz.
Though, seemingly part of this is due to having 40 cycle DIVW and REMW (and 76 cycle DIV/REM); where Dhrystone over-relies on these, and so gives a lower score if integer divide is slow.
Then again, Dhrystone score is also not strongly effected by indexed load/store or jumbo prefixes, as seemingly it doesn't rely all that much on either arrays, or large integer or displacement values.
Ironically, there seems to be a lot of overlap between what is present or absent in RISC-V, and what one needs to get a high score in Dhrystone.
But, a very different situation if one is looking at the Doom engine or similar (where, admittedly, I was often more turning stuff with Doom as a benchmark; and it has good enough coverage and can usually point out if something is broken).
But, as-is:
   Dhrystone: Roughly break even;
   Doom, XG2 is around 30-40% faster than RV64G;
   Quake, XG2 is around 15-20% faster;
   GLQuake, Somewhere around 500%-1000% delta
     TKRA-GL is horridly slow with RV64G
     Likely lack of SIMD is a deal-breaker for this.
       Generic case mostly falls back to 4-float structs.
     Further testing with RV64 not likely worthwhile.
       More likely, until I can build the RV-side parts of GL as XG3;
       When XG3 has full SIMD support re-added.
         (BGBCC treats XG3 as an extended RISC-V).
Untested (with an RV build):
   Heretic: Probably similar to Doom (mostly similar code);
   Hexen: Basically Doom, but slower;
     I suspect more because it often has bigger and more complex scenes;
     And some features that I don't really know how they work.
   ROTT: Hassle...
     I still don't have ROTT fully debugged;
     "What if we took Wolf3D engine and made it the size of Quake3?"
     Depended on accessing arrays out-of bounds acting in certain ways;
       Mostly "fixed" by masking/checks/wrappers.
     Code often sensitive to integer promotion and overflow/wrap;
       Various past issues located to overflow-dependent logic;
       And edge cases where the underlying 64-bit was leaking through.
     ...
   Quake3: Probably similar situation to GLQuake.
     I don't really expect an RV64 build to perform well at 50MHz.
For Quake3, roughly need to crank the virtual clock-speed up to around 150 or 200 MHz before it performs well.
GLQuake would perform OK at 75 MHz, if this could be pulled off without compromising other stuff (L1 caches or increasing instruction latency).
A 32-bit CPU could run at a faster clock-speed, but would have a lot of other drawbacks, so why I am still not doing so.
For both GLQuake and Quake3, a certain chunk of the time goes into the OpenGL implementation.
There is a rasterizer module, but generally the OpenGL implementation can't keep up to the thing fed. It might be easier if the rasterizer did transforms and was perspective-correct. Technically a harder problem than one which does basic edge-walking and is affine.
Not really sure how PlayStation1 managed, but in any case, probably wasn't using OpenGL...
Had noted before though that, even if the OpenGL implementation doesn't actually draw anything, GLQuake was still slow. General process of walking the BSP, gathering up a list of draw-surfaces for the PVS, and then feeding them through "glBegin()"/"glEnd()", was slow...
There is the theoretical option of building up big vertex arrays and only rebuilding them when the camera moves into a different leaf (thus changing the PVS). But, this would be a more significant change to Quake.
My Minecraft-like BTMini3 engine had built big quad arrays and primarily drew the scene by drawing the vertex array all at once (which, also, avoids needing to essentially make an RPC call every time the 3D engine calls "glEnd();").

Seems like a mostly pointless uphill battle trying to convince anyone of
things that (at least to me) seem kinda obvious.
 Do not waste you time teaching pigs to put on lipstick. ...
Theoretically, people who are working on trying to improve performance, should also see obvious things, namely, that the primary issues negatively effecting performance are:
   The lack of Register-Indexed Load/Store;
   Cases where immediate and displacement fields are not big enough;
   Lack of Load/Store Pair.
If you can fix a few 10%+ issues, this will save a whole lot more than focusing on 1% issues.
Better to go to the 1% issues *after* addressing the 10% issues.
If 20-30% of the active memory accesses are for arrays, and one needs to do, SLLI+ADD+Ld/St, this sucks.
If your Imm12 fails, and you need to do:
   LUI+ADDI+Op
This also sucks.
If your Disp12 fails, and you do LUI+ADD+Ld/St, likewise.
They can argue, but with Zba, we can do:
   SHnADD+Ld/St
But, this is still worse than a single Ld/St.
Though, will admit that if one is accessing an array inside a struct, pretending that it is [Rb+Ri*Sc+Disp] and then cracking it into SHnADD+Ld/St, or LEA.x+MOV.x in BJX2, still works out as effective.
If these issues are addressed, there is around a 30% speedup, even with a worse compiler.

Date Sujet#  Auteur
30 Jan 25 * Misc: Ongoing status...25BGB
31 Jan 25 +* Re: Misc: Ongoing status...19MitchAlsup1
31 Jan 25 i`* Re: Misc: Ongoing status...18BGB
31 Jan 25 i `* Re: Misc: Ongoing status...17MitchAlsup1
1 Feb 25 i  `* Re: Misc: Ongoing status...16BGB
1 Feb 25 i   `* Re: Misc: Ongoing status...15MitchAlsup1
1 Feb 25 i    `* Re: Misc: Ongoing status...14BGB
2 Feb 25 i     `* Re: Misc: Ongoing status...13MitchAlsup1
2 Feb 25 i      +- Re: Misc: Ongoing status...1BGB
2 Feb 25 i      `* Caller-saved vs. callee-saved registers (was: Misc: Ongoing status...)11Anton Ertl
2 Feb 25 i       `* Re: Caller-saved vs. callee-saved registers10BGB
2 Feb 25 i        `* Re: Caller-saved vs. callee-saved registers9BGB
3 Feb 25 i         `* Re: Caller-saved vs. callee-saved registers8MitchAlsup1
3 Feb 25 i          `* Re: Caller-saved vs. callee-saved registers7BGB
3 Feb 25 i           `* Re: Caller-saved vs. callee-saved registers6MitchAlsup1
3 Feb 25 i            `* Re: Caller-saved vs. callee-saved registers5BGB
4 Feb 25 i             `* Re: Caller-saved vs. callee-saved registers4MitchAlsup1
4 Feb 25 i              `* Re: Caller-saved vs. callee-saved registers3BGB
4 Feb 25 i               `* Re: Caller-saved vs. callee-saved registers2MitchAlsup1
5 Feb 25 i                `- Re: Caller-saved vs. callee-saved registers1BGB
9 Mar 25 `* Instruction Parcel Size5Robert Finch
9 Mar 25  `* Re: Instruction Parcel Size4MitchAlsup1
9 Mar 25   +- Re: Instruction Parcel Size1Robert Finch
9 Mar 25   `* Re: Instruction Parcel Size2Robert Finch
9 Mar 25    `- Re: Instruction Parcel Size1MitchAlsup1

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal