Newsportal USENET - Re: Misc: Ongoing status...

On 1/31/2025 1:30 PM, MitchAlsup1 wrote:

On Fri, 31 Jan 2025 6:50:24 +0000, BGB wrote:

On 1/30/2025 5:48 PM, MitchAlsup1 wrote:
On Thu, 30 Jan 2025 20:00:22 +0000, BGB wrote:
>
So, recent features added to my core ISA: None.
Reason: Not a whole lot that brings much benefit.
>
>
Have ended up recently more working on the RISC-V side of things,
because there are still gains to be made there (stuff is still more
buggy, less complete, and slower than XG2).
>
>
On the RISC-V side, did experiment with Branch-compare-Immediate
instructions, but unclear if I will carry them over:
   Adds a non-zero cost to the decoder;
     Cost primarily associated with dealing with a second immed.
   Effect on performance is very small (< 1%).
>
I find this a little odd--My 66000 has a lot of CPM #immed-BC
a) so I am sensitive as this is break even wrt RISC-V
b) But perhaps the small gains is due to something about
.. how the pair runs down the pipe as opposed to how the
.. single runs down the pipe.
>
>
Issue I had seen is mostly, "How often does it come up?":
Seemingly, around 100-150 or so instructions between each occurrence on
average (excluding cases where the constant is zero; comparing with zero
being more common).
>
What does it save:
Typically 1 cycle that might otherwise be spent loading the value into a
register (if this instruction doesn't end up getting run in parallel
with another prior instruction).
>
>
In the BGBCC output, the main case it comes up is primarily in "for()"
loops (followed by the occasional if-statement), so one might expect
this would increase its probability of having more of an effect.
>
But, seemingly, not enough tight "for()" loops and similar in use for it
to have a more significant effect.
>
So, in the great "if()" ranking:
   if(x COND 0) ...   //first place
   if(x COND y) ...   //second place
   if(x COND imm) ... //third place
>
However, a construct like:
   for(i=0; i<10; i++)
     { ... }
Will emit two of them, so they are not *that* rare either.
Since the compiler can see that the loop is always executed; the
first/top
checking CMP-BC should not be emitted; leaving only 1.

Yes, in theory.
I had tried this before, but seemingly the logic was sometimes triggering on loops where this was incorrect (the pattern matching wasn't particular enough).
IIRC, ended up disabling the logic as the entry-side test doesn't usually have much impact on the overall running time of the loop.
It was similar to loop unrolling:
Turning "for(i=0; i<3; i++) { x }" into 3 copies of 'X' with a constant for 'i' isn't that hard, in itself. Harder problem is verifying the absence of any wacky control flow or modification of 'i' within the loop body.
Though, I guess in theory one could add some more generic AST walking "InferCheckForProhibitedAction()" type function, probably with a structure that describes some categories of prohibited actions (say, disallowing 'break', 'continue', 'goto', or any assignment to 'i').
In the past, I had dealt with some similar cases by generating specialized sets of AST-walking infer-whatever functions; but this strategy doesn't scale very well.
In my BS2 language, I had instead used 'static' to explicitly request loop unrolling. Say, "static for(i=0; i<3; i++) { x }", where in this case the compiler doesn't have to infer anything to unroll the loop.
Kinda similar reasons for why BGBCC mostly lacks function inlining.
Though, IIRC, I did add a limited form of inlining for functions matching the pattern:
   type foo(args)
   { return expr_clean; }
Where an expression is "clean" if it can be verified to not have any potential side effects (and no function calls or explicit memory accesses).
While, say, "foo->x" or "*ptr" by themselves are unlikely to have direct side-effects, they can have side effects if the address is invalid.
Say, "if(ptr && *ptr)", the conditional expression can't be considered clean because the RHS depends on ptr being non-NULL.
Whereas: "if((x>0) && (x<10))", the conditional expression could be considered as clean (say, in this case, allowing the && to be turned into a bitwise operation).
Though, in the BGBCC frontend stages, it adds &&& and !&&& virtual operators to reflect cases where && has been turned into a boolean bitwise operator (can't use & as this would cause semantic ambiguity).
In the backend, the final operation will use normal bitwise AND, since this distinction is no longer relevant by the time it reaches 3AC form.
Where, as I can note, it is often faster to turn these sorts of compound-conditionals into bitwise operations with a single branch at the end, rather than use multiple branch operations.
So, in this case, the && and || short-circuit behavior is more limited to cases where:
The expression is unclean;
Or, the expression has exceeded a sensible size-limit (in which case short-circuiting semantics may be faster).
Is a minor annoyance with plain RV64 as it lacks some of the operations needed to do this well. I had added a few instructions via jumbo prefixes:
   SEQ, SEQI, SNE, SNEI, SGE, SGEI, STST, STSTI, SNTST, SNTSTI
Which, overload the SLT/SLTI base operation with extended opcode.
Where:
   SLT : Rn = Rs< Rt
   SGE : Rn = Rs>=Rt
   SEQ : Rn = Rs==Rt
   SNE : Rn = Rs!=Rt
   STST : Rn = (Rs&Rt)!=0;
   SNTST: Rn = (Rs&Rt)==0;
Where, say, to otherwise mimic SEQ and SNE in RV64, one needs, say:
   SEQ:
   SUB   Rs, Rt, R5
   SLTUI R5, 1, Rn
   SNE:
   SUB   Rs, Rt, R5
   SLTU R5, 1, R5
   XOR   R5, 1, Rn
So, the jumbo-form encodings still come out ahead.
In BJX2, these exist as 32-bit encodings (CMPxx).

Still, a lot rarer in use than:
   val=ptr[idx];
Though...
>
Have noted though that simple constant for-loops are a minority, far
more often they are something like:
   for(i=0; i<n; i++)
     { ... }
Which doesn't use any.
>
Or:
   while(i--)
     { ... }
Which uses a compare with zero (in RV, can be encoded with the zero
I should note:: I have a whole class of conditional branches that
include comparison to 0, {(signed), (unsigned), (float), (double)}
and all 6 arithmetic comparands and auxiliary comparisons for NaNs
and Infinities.

I generally only have them for 64-bit integer comparison.
Could almost leverage them for floating point operations (since relative comparison to zero still works the same with Binary64, or Binary32 if sign-extended), except that doing so violates IEEE rules regarding NaN and Inf.
Lazy option: could almost make a case for an "-ffpu-ignore-compare-nans" compiler option or similar, which tells the compiler to essentially pretend Inf and NaN don't exist.
Then again, I already basically have most of the logic in place to support floating-point branch and 3R floating compare immediate, would just need to define some encodings (probably via jumbo-prefixed instructions).

register; in BJX2 it has its own dedicated instruction due to the lack
of zero register; some of these were formally dropped in XG3 which does
have access to a zero register, and encoding an op using a ZR instead is
considered as preferable).
I choose not to waste a register to hold zero. Once you have universal
constants it is unnecessary.
------------------

Well, it is a tradeoff:
   Zero Register, wasting a register index;
   Dedicated zero instructions;
   Gluing an immediate onto an instruction
   (in both BJX2 and RV+Jx, these require a jumbo prefix).
   Burn 1 or 2 bits on every instruction to specify an immediate form.
BJX2 proper lacked an architectural zero register (but an implicit internal ZR exists, but may not be encoded directly). In some cases, this may require having instructions that exist merely because of the lack of a zero register.
RISC-V does have an explicit zero register;
XG3 gains a ZR mostly as a side effect of operating in the same register space as RISC-V.
Unlike RISC-V, using R0 as a base register is understood to mean PC.
But, because it has a zero register, some of the instructions that existed merely because of the lack of zero register, become essentially redundant.
In XG3, R32..R63 are usable as GPRs, but the RV ABI specifies them as, essentially 20 scratch / 12 callee save (in the F0..F31 mapping).
So, in the whole 64 GPR space, still only 24 callee save registers.
Not a great scenario for BGBCC.

Huawei had a "less bad" encoding, but they burnt basically the entire
User-1 block on it, so that isn't going to fly.
>
Generally, around 95% of the function-local branches can hit in a Disp9,
vs 98% for Disp12. So, better to drop to Disp9.
DISP16 reaches farther...

But...
Disp16 is not going to fit into such a 32-bit encoding...
But, say, 16+6+5+3 = 30.
Would have burned the entire 32-bit encoding space on BccI ...
Usual answer in RV land is that if the Disp12 (or Disp9 in this scenario) fails, the condition is inverted and the following instruction is a JAL to the label.
Say:
   BNEI Imm6, Rs, .L0
   JAL   Label, R0
   .L0:
Granted, given my 64-bit BccI encoding still only has a 12-bit displacement, it is not immune to the need for this trick.
In the no-immediate case, it is a Disp33s (so, no JAL needed, but one would need to load the value into a register). Granted, it drops to Disp23 if one tries to encode it with R32..R63.
In XG3's encoding scheme, a similar construct would give:
   Bcc Imm17s, Rs, Disp10s
Or:
   Bcc Rt, Rs, Disp33s
But, where Bcc can still encode R0..R63.
It is possible that a 96-bit encoding could be defined:
   Bcc Imm26s, Rs, Disp33 //RV+Jx
   Bcc Imm30s, Rs, Disp33 //XG3
But, debatable if it would matter enough to make sense to allow this case in the decoders (and would essentially require routing the contents of both jumbo prefixes into the Lane1 decoder).

------------------

>
I suggest a psychiatrist.
>
>
People are pointing to charts gathered by mining binaries and being
like: "X10 and X11 are the two most commonly used registers".
>
But, this is like pointing at x86 and being like:
"EAX and ECX are the top two registers, who needs such obscure registers
as ESI and EDI"?...
>
Quit listening to them, use your own judgement.
>
>
When I defined my own version of BccI (with a 64-bit encoding), how many
new instructions did I need to define in the 32-bit base ISA: Zero.
>
How many 64-bit encodings did My 66000 need:: zero.
{Hint the words following the instruction specifier have no internal
format}
>
>
I consider the combination of Jumbo-Prefix and Suffix instruction to be
a 64-bit instruction.
I consider a multi-word instruction to have an instruction-specifier
as the first 32-bits, and everything that follows is an attached
constant.
The only "prefixes" I have are CARRY and PREDication.

For both BJX2 and the RV_Jx scheme, they are handled as prefixes.
Granted, I understand a prefix as being fetched and decoded at the same time as the instruction it modifies.
Some people seem to imagine prefixes as executing independently and then setting up some sort of internal registers which carry state over to the following instruction.
Which is, not how it works in my case...
So, for example, an interrupt somehow happening between the prefix and the instruction it modifies... Just straight up can't happen.
Granted, I usually assume either 64 or 96 bit fetch, not multiple 32-bit fetches.

-----------------------

However, have noted that XG3 does appear to be faster than the original
Baseline/XG1 ISA.
>
>
Where, to recap:
   XG1 (Baseline):
     16/32/64/96 bit encodings;
       16-bit ops can access R0..R15 with 4b registers;
         Only 2R or 2RI forms for 16-bit ops;
         16-bit ISA still fairly similar to SuperH.
     5-bit register fields by default;
       6-bit available for an ISA subset.
     Disp9u and Imm9u/n for most immediate form instructions;
     32 or 64 GPRs, Default 32.
     8 argument registers.
   XG2:
     32/64/96 bit encodings;
       All 16-bit encodings dropped.
     6-bit register fields (via a wonky encoding);
     Same basic instruction format as XG1,
       But, 3 new bits stored inverted in the HOB of instr words;
     Mostly Disp10s and Imm10u/n;
     64 GPRs native;
     16 argument registers.
   XG3:
     Basically repacked XG2;
       Can exist in same encoding space as RISC-V ops;
       Aims for ease of compatibility with RV64G.
     Encoding was made "aesthetically nicer"
       All the register bits are contiguous and non-inverted;
       Most immediate fields are also once again contiguous;
       ...
     Partly reworks branch instructions;
       Scale=4, usually relative to BasePC (like RV);
     Uses RV's register numbering space (and ABI);
       Eg: SP at R2 vs R15, ...
       (Partly carried over from XG2RV, which is now defunct).
     64 GPRs, but fudged into RV ABI rules;
       Can't rebalance ABI without breaking RV compatibility;
         Breaking RV compatibility defeating its point for existing.
     8 argument registers (because of RV ABI).
       Could in theory expand to 16, but would make issues.
     Despite being based on XG2,
       BGBCC treats XG3 as an extension to RISC-V.
>
>
Then, RV:
   16/32; 48/64/96 (Ext)
   Has 16-bit ops:
     Which are horribly dog-chewed,
       and only manage a handful of instructions.
     Many of the ops can only access X8..X15;
     With GCC, enabling RVC saves around 20% off the ".text" size.
   Imm12s and Disp12s for most ops;
   Lots of dog-chew in the encodings (particular Disp fields);
     JAL is basically confetti.
   ...
My 66000
     32-bit instruction specifier
     if( inst[31..29] == 3b'001 )
         switch( inst[28..26] )
         { // groups with large constants
          case 3b'001 [Rbase+Rindex] memory reference instructions
          case 3b'010 2-operand calculation instructions
          case 3b'100 3-operand calculation instructions
          case 3b'101 1-operand instructions
         }
     else
         switch( inst[31..29] )
         { // 1 word instructions
          case 3b'010 LOOP instruction
          case 3b'011 Branch insructon
          case 3b'100 LD disp16
          case 3b'101 ST disp16
          case 3b'110 integer imm16
          case 3b'111 logical imm16
         }
Other than minor updates to the constant decoding patterns, this has
been
stable since 2012.

OK.
As noted, I now have XG1/XG2/XG3... with RISC-V glued on.
This is getting a little messy.

In its basic form, RV is the worst performing option here, but people
actually care about RISC-V, so supporting it is value-added.
Imagine that, an ISA that requires more instructions takes more cycles
!?!

Yeah...
Generally, fewer instructions, fewer cycles, assuming one doesn't have "high complexity" instructions (that take a lot of cycles to perform).
RV is annoying, as it often requires too many "simple instructions", but still ends up requiring the implementation to have some amount of big/slow instructions as well.
Where, an instruction doesn't save much if it is as slow or slower than it would be to call a runtime support function that does the same thing in software.
Ironically though, the GCC and Clang people, and RV people, are seemingly also adverse to scenarios that involve using implicit runtime calls.
Granted, looking at it, I suspect things like implicit runtime calls (or call-threaded code), would be a potential "Achilles's Heel" situation for GCC performance, as its register allocation strategy seems to prefer using scratch registers and then to spill them on function calls (rather than callee-save registers which don't require a spill).
So, if one emits chunks of code that are basically end-to-end function calls, they may perform more poorly than they might have otherwise.
One might think, "Of course not, their register allocator is clever enough to switch to using the callee save registers in this case".
But, then I have made an observation:
My emulator, which consists of a whole lot of this code, performs noticeably worse if compiled with GCC than if compiled with MSVC (where MSVC also leans more towards callee-save registers, and the Win64 ABI also has a more "callee save favorable" balance vs the SysV/AMD64 ABI).
Like: GCC and scratch-dominant ABIs:
Best case if hot path is mostly inside leaf functions;
Not so great if hot path is in a deeply tangled nest of function call dominated code (worse still if this call dominated code has large numbers of local variables in use).
But, if leaning the other way, the function calls don't necessarily mean needing to spill a bunch of stuff onto the stack and then reload it later (all of the relevant saving/restoring having mostly been consolidated into being one-off in the prologs and epilogs).
In this latter case though, this still leaves the relative "velocity" by which the program goes up and down the call stack.
Load/Store Pair mostly helps here. One could almost argue for a Load/Store Quad, except my core doesn't have enough register ports or a wide enough memory interface to make this possible (and Load/Store Triple would just be weird).

--------------
Seems like a mostly pointless uphill battle trying to convince anyone of
things that (at least to me) seem kinda obvious.
>
Do not waste you time teaching pigs to put on lipstick. ...
>
>
Theoretically, people who are working on trying to improve performance,
should also see obvious things, namely, that the primary issues
negatively effecting performance are:
   The lack of Register-Indexed Load/Store;
   Cases where immediate and displacement fields are not big enough;
   Lack of Load/Store Pair.
>
If you can fix a few 10%+ issues, this will save a whole lot more than
focusing on 1% issues.
>
Better to go to the 1% issues *after* addressing the 10% issues.
>
>
If 20-30% of the active memory accesses are for arrays, and one needs to
do, SLLI+ADD+Ld/St, this sucks.
>
If your Imm12 fails, and you need to do:
   LUI+ADDI+Op
This also sucks.
>
If your Disp12 fails, and you do LUI+ADD+Ld/St, likewise.
>
They can argue, but with Zba, we can do:
   SHnADD+Ld/St
But, this is still worse than a single Ld/St.
Imagine accessing an external array with 64-bit virtual address space::
RISC-V
       AUPIC   Rt,hi(GOT[#k])
       LDD     Rt,lo(GOT[#k])[Rt]
       SLL     Rs,Rindex,#3
       ADD     Rt,Rt,Rs
       LDD     Rt,0[Rt]
5 instruction words, 2 data words.
My 66000
       LDD     Rt,[IP,,GOT[#k]]
       LDD     Rt,[Rt,Ri<<3]
3 instruction words, 0 data words.
------------------------------

As is, XG2 and XG3 could also encode these cases in two instructions.
MOV.Q (PC, Disp33), Rt
MOV.Q (Rt, Ri), Rt
...

If these issues are addressed, there is around a 30% speedup, even with
a worse compiler.

Date	Sujet	#	Auteur
30 Jan 25	Misc: Ongoing status...	25	BGB
31 Jan 25	Re: Misc: Ongoing status...	19	MitchAlsup1
31 Jan 25	Re: Misc: Ongoing status...	18	BGB
31 Jan 25	Re: Misc: Ongoing status...	17	MitchAlsup1
1 Feb 25	Re: Misc: Ongoing status...	16	BGB
1 Feb 25	Re: Misc: Ongoing status...	15	MitchAlsup1
1 Feb 25	Re: Misc: Ongoing status...	14	BGB
2 Feb 25	Re: Misc: Ongoing status...	13	MitchAlsup1
2 Feb 25	Re: Misc: Ongoing status...	1	BGB
2 Feb 25	Caller-saved vs. callee-saved registers (was: Misc: Ongoing status...)	11	Anton Ertl
2 Feb 25	Re: Caller-saved vs. callee-saved registers	10	BGB
2 Feb 25	Re: Caller-saved vs. callee-saved registers	9	BGB
3 Feb 25	Re: Caller-saved vs. callee-saved registers	8	MitchAlsup1
3 Feb 25	Re: Caller-saved vs. callee-saved registers	7	BGB
3 Feb 25	Re: Caller-saved vs. callee-saved registers	6	MitchAlsup1
3 Feb 25	Re: Caller-saved vs. callee-saved registers	5	BGB
4 Feb 25	Re: Caller-saved vs. callee-saved registers	4	MitchAlsup1
4 Feb 25	Re: Caller-saved vs. callee-saved registers	3	BGB
4 Feb 25	Re: Caller-saved vs. callee-saved registers	2	MitchAlsup1
5 Feb 25	Re: Caller-saved vs. callee-saved registers	1	BGB
9 Mar 25	Instruction Parcel Size	5	Robert Finch
9 Mar 25	Re: Instruction Parcel Size	4	MitchAlsup1
9 Mar 25	Re: Instruction Parcel Size	1	Robert Finch
9 Mar 25	Re: Instruction Parcel Size	2	Robert Finch
9 Mar 25	Re: Instruction Parcel Size	1	MitchAlsup1