Newsportal USENET - Re: Misc: Ongoing status...

On Fri, 31 Jan 2025 6:50:24 +0000, BGB wrote:

On 1/30/2025 5:48 PM, MitchAlsup1 wrote:
On Thu, 30 Jan 2025 20:00:22 +0000, BGB wrote:
>
So, recent features added to my core ISA: None.
Reason: Not a whole lot that brings much benefit.
>
>
Have ended up recently more working on the RISC-V side of things,
because there are still gains to be made there (stuff is still more
buggy, less complete, and slower than XG2).
>
>
On the RISC-V side, did experiment with Branch-compare-Immediate
instructions, but unclear if I will carry them over:
   Adds a non-zero cost to the decoder;
     Cost primarily associated with dealing with a second immed.
   Effect on performance is very small (< 1%).
>
I find this a little odd--My 66000 has a lot of CPM #immed-BC
a) so I am sensitive as this is break even wrt RISC-V
b) But perhaps the small gains is due to something about
.. how the pair runs down the pipe as opposed to how the
.. single runs down the pipe.
>
>
Issue I had seen is mostly, "How often does it come up?":
Seemingly, around 100-150 or so instructions between each occurrence on
average (excluding cases where the constant is zero; comparing with zero
being more common).
>
What does it save:
Typically 1 cycle that might otherwise be spent loading the value into a
register (if this instruction doesn't end up getting run in parallel
with another prior instruction).
>
>
In the BGBCC output, the main case it comes up is primarily in "for()"
loops (followed by the occasional if-statement), so one might expect
this would increase its probability of having more of an effect.
>
But, seemingly, not enough tight "for()" loops and similar in use for it
to have a more significant effect.
>
So, in the great "if()" ranking:
   if(x COND 0) ...   //first place
   if(x COND y) ...   //second place
   if(x COND imm) ... //third place
>
However, a construct like:
   for(i=0; i<10; i++)
   { ... }
Will emit two of them, so they are not *that* rare either.

Since the compiler can see that the loop is always executed; the
first/top
checking CMP-BC should not be emitted; leaving only 1.

Still, a lot rarer in use than:
   val=ptr[idx];
Though...
>
Have noted though that simple constant for-loops are a minority, far
more often they are something like:
   for(i=0; i<n; i++)
   { ... }
Which doesn't use any.
>
Or:
   while(i--)
   { ... }
Which uses a compare with zero (in RV, can be encoded with the zero

I should note:: I have a whole class of conditional branches that
include comparison to 0, {(signed), (unsigned), (float), (double)}
and all 6 arithmetic comparands and auxiliary comparisons for NaNs
and Infinities.

register; in BJX2 it has its own dedicated instruction due to the lack
of zero register; some of these were formally dropped in XG3 which does
have access to a zero register, and encoding an op using a ZR instead is
considered as preferable).

I choose not to waste a register to hold zero. Once you have universal
constants it is unnecessary.
------------------

Huawei had a "less bad" encoding, but they burnt basically the entire
User-1 block on it, so that isn't going to fly.
>
Generally, around 95% of the function-local branches can hit in a Disp9,
vs 98% for Disp12. So, better to drop to Disp9.

DISP16 reaches farther...
------------------

>
I suggest a psychiatrist.
>
>
People are pointing to charts gathered by mining binaries and being
like: "X10 and X11 are the two most commonly used registers".
>
But, this is like pointing at x86 and being like:
"EAX and ECX are the top two registers, who needs such obscure registers
as ESI and EDI"?...
>

Quit listening to them, use your own judgement.

>
>
When I defined my own version of BccI (with a 64-bit encoding), how many
new instructions did I need to define in the 32-bit base ISA: Zero.
>
How many 64-bit encodings did My 66000 need:: zero.
{Hint the words following the instruction specifier have no internal
format}
>
>
I consider the combination of Jumbo-Prefix and Suffix instruction to be
a 64-bit instruction.

I consider a multi-word instruction to have an instruction-specifier
as the first 32-bits, and everything that follows is an attached
constant.
The only "prefixes" I have are CARRY and PREDication.
-----------------------

However, have noted that XG3 does appear to be faster than the original
Baseline/XG1 ISA.
>
>
Where, to recap:
   XG1 (Baseline):
   16/32/64/96 bit encodings;
   16-bit ops can access R0..R15 with 4b registers;
   Only 2R or 2RI forms for 16-bit ops;
   16-bit ISA still fairly similar to SuperH.
   5-bit register fields by default;
   6-bit available for an ISA subset.
   Disp9u and Imm9u/n for most immediate form instructions;
   32 or 64 GPRs, Default 32.
   8 argument registers.
   XG2:
   32/64/96 bit encodings;
   All 16-bit encodings dropped.
   6-bit register fields (via a wonky encoding);
   Same basic instruction format as XG1,
   But, 3 new bits stored inverted in the HOB of instr words;
   Mostly Disp10s and Imm10u/n;
   64 GPRs native;
   16 argument registers.
   XG3:
   Basically repacked XG2;
   Can exist in same encoding space as RISC-V ops;
   Aims for ease of compatibility with RV64G.
   Encoding was made "aesthetically nicer"
   All the register bits are contiguous and non-inverted;
   Most immediate fields are also once again contiguous;
   ...
   Partly reworks branch instructions;
   Scale=4, usually relative to BasePC (like RV);
   Uses RV's register numbering space (and ABI);
   Eg: SP at R2 vs R15, ...
   (Partly carried over from XG2RV, which is now defunct).
   64 GPRs, but fudged into RV ABI rules;
   Can't rebalance ABI without breaking RV compatibility;
   Breaking RV compatibility defeating its point for existing.
   8 argument registers (because of RV ABI).
   Could in theory expand to 16, but would make issues.
   Despite being based on XG2,
   BGBCC treats XG3 as an extension to RISC-V.
>
>
Then, RV:
   16/32; 48/64/96 (Ext)
   Has 16-bit ops:
   Which are horribly dog-chewed,
   and only manage a handful of instructions.
   Many of the ops can only access X8..X15;
   With GCC, enabling RVC saves around 20% off the ".text" size.
   Imm12s and Disp12s for most ops;
   Lots of dog-chew in the encodings (particular Disp fields);
   JAL is basically confetti.
   ...

My 66000
   32-bit instruction specifier
   if( inst[31..29] == 3b'001 )
   switch( inst[28..26] )
   { // groups with large constants
case 3b'001 [Rbase+Rindex] memory reference instructions
case 3b'010 2-operand calculation instructions
case 3b'100 3-operand calculation instructions
case 3b'101 1-operand instructions
   }
   else
   switch( inst[31..29] )
   { // 1 word instructions
case 3b'010 LOOP instruction
case 3b'011 Branch insructon
case 3b'100 LD disp16
case 3b'101 ST disp16
case 3b'110 integer imm16
case 3b'111 logical imm16
   }
Other than minor updates to the constant decoding patterns, this has
been
stable since 2012.

In its basic form, RV is the worst performing option here, but people
actually care about RISC-V, so supporting it is value-added.

Imagine that, an ISA that requires more instructions takes more cycles
!?!
--------------

Seems like a mostly pointless uphill battle trying to convince anyone of
things that (at least to me) seem kinda obvious.
>
Do not waste you time teaching pigs to put on lipstick. ...
>
>
Theoretically, people who are working on trying to improve performance,
should also see obvious things, namely, that the primary issues
negatively effecting performance are:
   The lack of Register-Indexed Load/Store;
   Cases where immediate and displacement fields are not big enough;
   Lack of Load/Store Pair.
>
If you can fix a few 10%+ issues, this will save a whole lot more than
focusing on 1% issues.
>
Better to go to the 1% issues *after* addressing the 10% issues.
>
>
If 20-30% of the active memory accesses are for arrays, and one needs to
do, SLLI+ADD+Ld/St, this sucks.
>
If your Imm12 fails, and you need to do:
   LUI+ADDI+Op
This also sucks.
>
If your Disp12 fails, and you do LUI+ADD+Ld/St, likewise.
>
They can argue, but with Zba, we can do:
   SHnADD+Ld/St
But, this is still worse than a single Ld/St.

Imagine accessing an external array with 64-bit virtual address space::
RISC-V
   AUPIC   Rt,hi(GOT[#k])
   LDD    Rt,lo(GOT[#k])[Rt]
   SLL    Rs,Rindex,#3
   ADD    Rt,Rt,Rs
   LDD    Rt,0[Rt]
5 instruction words, 2 data words.
My 66000
   LDD    Rt,[IP,,GOT[#k]]
   LDD    Rt,[Rt,Ri<<3]
3 instruction words, 0 data words.
------------------------------

If these issues are addressed, there is around a 30% speedup, even with
a worse compiler.

Date	Sujet	#	Auteur
30 Jan 25	Misc: Ongoing status...	25	BGB
31 Jan 25	Re: Misc: Ongoing status...	19	MitchAlsup1
31 Jan 25	Re: Misc: Ongoing status...	18	BGB
31 Jan 25	Re: Misc: Ongoing status...	17	MitchAlsup1
1 Feb 25	Re: Misc: Ongoing status...	16	BGB
1 Feb 25	Re: Misc: Ongoing status...	15	MitchAlsup1
1 Feb 25	Re: Misc: Ongoing status...	14	BGB
2 Feb 25	Re: Misc: Ongoing status...	13	MitchAlsup1
2 Feb 25	Re: Misc: Ongoing status...	1	BGB
2 Feb 25	Caller-saved vs. callee-saved registers (was: Misc: Ongoing status...)	11	Anton Ertl
2 Feb 25	Re: Caller-saved vs. callee-saved registers	10	BGB
2 Feb 25	Re: Caller-saved vs. callee-saved registers	9	BGB
3 Feb 25	Re: Caller-saved vs. callee-saved registers	8	MitchAlsup1
3 Feb 25	Re: Caller-saved vs. callee-saved registers	7	BGB
3 Feb 25	Re: Caller-saved vs. callee-saved registers	6	MitchAlsup1
3 Feb 25	Re: Caller-saved vs. callee-saved registers	5	BGB
4 Feb 25	Re: Caller-saved vs. callee-saved registers	4	MitchAlsup1
4 Feb 25	Re: Caller-saved vs. callee-saved registers	3	BGB
4 Feb 25	Re: Caller-saved vs. callee-saved registers	2	MitchAlsup1
5 Feb 25	Re: Caller-saved vs. callee-saved registers	1	BGB
9 Mar 25	Instruction Parcel Size	5	Robert Finch
9 Mar 25	Re: Instruction Parcel Size	4	MitchAlsup1
9 Mar 25	Re: Instruction Parcel Size	1	Robert Finch
9 Mar 25	Re: Instruction Parcel Size	2	Robert Finch
9 Mar 25	Re: Instruction Parcel Size	1	MitchAlsup1