Newsportal USENET - Re: Misc: Ongoing status...

On 1/31/2025 10:05 PM, MitchAlsup1 wrote:

On Sat, 1 Feb 2025 1:56:16 +0000, BGB wrote:

On 1/31/2025 1:30 PM, MitchAlsup1 wrote:
>
Generally, around 95% of the function-local branches can hit in a Disp9,
vs 98% for Disp12. So, better to drop to Disp9.
>
DISP16 reaches farther...
>
>
But...
>
Disp16 is not going to fit into such a 32-bit encoding...
It fit in mine !

Specifically, a "Bcc Imm, Reg, Disp" style instruction...
It is not so hard on other types of instructions.
XG2 (and XG3) has BRA/BSR with 23 bits.
And, BT/BF can be synthesized from ?T and ?F.
It is 20 bits in XG1.
Traditional way was done in BJX2:
   CMPxx Imm10, Rn
   BT/BF Disp20
Traditional way it is done in RV (with RV operand ordering):
   LI    Rt, Imm12
   Bcc Disp12, Rs, Rt
( Note, my sometimes inconsistent operand ordering being because RV mode in BGBCC uses the same ordering as for BJX2, which differs from that of traditional RV ordering, not always obvious which I should use. )
Unlike RISC-V, there was no register field for BRA/BSR, but IMHO this just burns bits, as the register for JAL is hardly ever anything other than X0 or X1.
Theoretically, the X5 register could be used as a temporary link register for runtime support functions but, seemingly, nothing does this AFAICT (all of the support functions tending to use the normal C ABI).

But, say, 16+6+5+3 = 30.
Would have burned the entire 32-bit encoding space on BccI ...
Which is why one does not do CMP-BC in one instruction !
The best you can do only covers 60%-odd if the cases.
-------------------

Or, at least possibly not one 32-bit instruction.
Possibly "jury still out" on using a 64-bit encoding.
   It would likely still be a value-add alongside a 32-bit encoding.
   Could still mop up some of the cases a 32-bit encoding would miss.
But, as I see it, the 64-bit encoding is likely to be "more useful" on average vs a 32-bit encoding, for performance, if one has to choose one or the other.
But, yeah, for global stats, Imm6s has around a 52% hit-rate, and Disp9 a 90% hit rate for local branches.
Would combine to a 47% hit-rate solely based on global stats.
Imm7s/Disp8s would be 66%, and 80%, combining to 53%; so possibly might be slightly better.
Granted, yeah, hitting for 47% or 53% of something that is around 0.7% of the instructions, isn't really much of a win...
Still a little better than Imm5u though, at ~ 39%...
The 3-bit Rs1' would reduce register hit rate to around 18%.
   In BGBCC output, heavyweights being X8, X9, X10, X11;
   X12 to X15, rapid drop off.
The Disp12 has around a 97% hit rate for local branches (pretty good, but doesn't matter if the other hit rates suck).
These combine to around 7%.
And, 7% 0f 0.7%, is a total of 0.049% of the instructions; straight up not worth bothering...
I, personally, would much rather have an instruction (if any at all) that hits 47% or 53% of the time, than one that hits 7% of the time.
An instruction that can only hit 7% of the cases that could theoretically apply to it, is clutter that would make the ISA worse.
Though, if one accounts:
Most of the X10 and X11 traffic are in the first 2 function arguments and return value;
BGBCC doesn't use these for variables (except in leaf functions), and for these it allocates the registers (for variables) starting at higher numbered registers, though leaf functions will use argument register registers as the arguments they represent.
Accounting for the percentage of leaf functions (16%), ..., this reduces the effective hit-rate for Rs1' down to 11%, and the hit-rate for that encoding to 4%. So, even more in the "useless clutter" territory.

>
In XG3's encoding scheme, a similar construct would give:
   Bcc Imm17s, Rs, Disp10s
Or:
   Bcc Rt, Rs, Disp33s
But, where Bcc can still encode R0..R63.
>
It is possible that a 96-bit encoding could be defined:
   Bcc Imm26s, Rs, Disp33 //RV+Jx
   Bcc Imm30s, Rs, Disp33 //XG3
Having not found a function that takes ¼GB of space, I remain
comfortable with 28-bit branch displacement range. I also have
CALL instructions that reach 32-bit or 64-bit VAS.
------------------------------

In both ISA's, the Jumbo_Imm prefix behavior is fairly naive:
   It glues what bits it has, onto the immediate field.
   XG2/XG3: 24 bits from Prefix, 9 bits from instruction
   It falls back to XG1's Imm9/Disp9 encoding for this.
   Theoretically, this leaves 4 extra bits remaining.
   The prior XG2 WO/Sign-Ext bit was repurposed to encode Scale=1.
   I stuck with 33 bits as it is a nice magic number of bits (1):
   33 bits hits significantly more constants than 32/31/30;
   The gains for going from 33 to 35 or 36 are negligible.
   Need to go all the way to 48 or 64 to see much gain.
Contrary to popular belief, the statistical distribution of immediate values is not a uniform bell curve, rather there seem to be bumps:
   1-7 bits, mostly follows a bell-curve;
   8/9 bits sees a bump;
   16/17 also sees a big bump;
   32/33 also sees a big bump;
   48 bits is a small bump (likely mostly due to MMIO addresses).
   64 bits is the final big bump.
Sign/zero extended 34-46 bits is basically a no-mans land where there is crap-all in terms of constants.
A direct native 64-bit immediate path would add cost and not gain much over the 33-bit path (and breaking any 64-bit immediate values across 2 lanes).
I haven't had much reason to specially tweak the handling of branch displacements to be different than the handling of normal immediate values.
So, yes, granted, a 33-bit branch displacement is technically overkill.
The Jumbo_Op and J21O prefix doesn't currently have any concept of extending two immediate fields, so the most obvious choice is to use 2 jumbo prefixes and have the decoder somehow understand that an Imm64/Disp64 encoding is N/A for the instruction so, to instead, give two immediate values.
However, if one follows the pattern of extending the Disp (to Disp33), and the other extends Rt/Ro to 30 bits, well, this is how it does.
While technically, it would make more sense to flip them and have, say:
   Bcc Imm33s, Rs, Disp30s
This would be a more convoluted decoding and the additional MUX'ing needed to swap the immediate fields in decode wont come free (in the current implementation, the branch displacement must come via Lane 1; so it isn't something as simple as swapping via the Rt vs Ry register ports).
Or, unlike normal memory displacements or ALU ops, there is some extra "secret sauce" that goes into PC+Disp handling, so it doesn't quite follow the same rules.
Though, could more cheaply swap the immediate and displacement fields for a 96-bit Store-with-Immediate encoding, though in this case it is more just a case that it would be convoluted.
As for whether a 96-bit Branch-with-Immediate or Store-with-Immediate "actually makes sense"... Yeah...

Granted, I understand a prefix as being fetched and decoded at the same
time as the instruction it modifies.
Instruction needs to be plural.

Not in BJX2 or RV+JX ...
The Jumbo prefix will only modify a single instruction, it is treated as a larger unit for fetch/decode; with no visible effect on any other architectural state.
So, more like a REX or VEX prefix in x86-64.

Some people seem to imagine prefixes as executing independently and then
setting up some sort of internal registers which carry state over to the
following instruction.
Instruction needs to be plural.
--------------
Ironically though, the GCC and Clang people, and RV people, are
seemingly also adverse to scenarios that involve using implicit runtime
calls.
>
Granted, looking at it, I suspect things like implicit runtime calls (or
call-threaded code), would be a potential "Achilles's Heel" situation
for GCC performance, as its register allocation strategy seems to prefer
using scratch registers and then to spill them on function calls (rather
than callee-save registers which don't require a spill).
I know of a senior compiler writer at CRAY who would argue that
callee-save registers are an anathema--and had a litany of reasons
thereto (now long forgotten by me).

IME, it depends more on the type of code.
   Leaf-dominant code, you want a lot of scratch/caller-save registers;
   Inner-function dominant code, you want lots of callee-save registers.
   And reasonably fast save/restore in prologs and epilogs (*1).
*1: Granted, some here had taken this as an argument for register windows, and while a register window could in theory be swapped quickly, if you are regularly needing to bulk-save/reload whatever is 4 levels up the call stack or so; this is gonna suck worse (both in hardware complexity and performance).
Though, a case could be made for giving Machine/Supervisor/User their own sets of registers, even then, I am not convinced this is a worthwhile tradeoff (even if bulk save/restore every time an interrupt occurs adds overhead).
Well, and with my current "optimization" of the ISR mechanism being to assume that TBR is always valid and then use this space to dump the registers, TBR needs to get up fairly early in the kernel.
But, the interesting quirk that, at least with a single global VAS, the relative cost of context switches over normal interrupt handling is "basically free". Main reason to limit fully preemptive task scheduling is more because (in the absence of good synchronization primitives nor any consistent use of them) rescheduling a task that was otherwise "busy with something" greatly increases the probability of memory or data structures being in an inconsistent state (and the whole OS exploding).
Rescheduling primarily on system-call operations seems to greatly reduce the probability of things being left in an inconsistent state (well, say, because code generally isn't going to use a syscall right in the middle of updating a linked list or allocation bitmap or similar).

So, if one emits chunks of code that are basically end-to-end function
calls, they may perform more poorly than they might have otherwise.
These lower-level supervisory routines are the ones least capable of
using callee save registers in a way that saves cycles--often trading
register MOV instructions for LD instructions setting up arguments
and putting (ST) results where they can be used later.

I think it depends...
Leaf functions:
   Scratch registers win.
   More so if running time is dominated by leaf functions.
One-level up, or sparse function calls:
Can go either way;
Scratch registers can be useful in any basic blocks that lack function calls.
Whereas, if performance is dominated by a piece of code that looks like, say:
   v0=dytf_int2fixnum(123);
   v1=dytf_int2fixnum(456);
   v2=dytf_mul(v0, v1);
   v3=dytf_int2fixnum(789);
   v4=dytf_add(v2, v3);
   v5=dytf_wrapsymbol("x");
   dytf_storeindex(obj, v5, v4);
   ...
With, say, N levels of call-graph in each called function, but with this sort of code still managing to dominate the total CPU ("Self%" time).
This seems to be a situation where callee-save registers are a big win for performance IME.
Similarly, a situation where it appears GCC does significantly worse performance wise if compared with MSVC.
Where, seemingly, a difference being that GCC likes to use scratch registers almost exclusively, and then inline things whenever possible, but will then start spilling things to the stack if this can't be done.
MSVC, like BGBCC, seems to go more often for callee-save registers first in these scenarios (using scratch registers primarily in leaf functions), which leads to a lot less spill-and-reload.
I may be wrong on some of this, but this seems to be my observations here.
For JX2VM, some of the top functions for CPU usage tend to follow a form like:
   op=ops[0]; op->Run(ctx, op);
   op=ops[1]; op->Run(ctx, op);
   op=ops[2]; op->Run(ctx, op);
   op=ops[3]; op->Run(ctx, op);
   ...
Where, a fairly common op might take the form:
   ...
   {
s64 val, base;
val=ctx->regs[op->rn];
base=ctx->regs[op->rm];
if(op->rn==VM_REG_IMMB)
val=op->immb; /* Store-with-Immediate */
VM_MemStoreQW(ctx, base+op->imm, val);
   }
VM_MemStoreQW: Generally a wrapper through a function-pointer/vtable, usually calls VM_MemStoreQW_DflI in this example (flexible to allow alternate semi-streamlined versions for Fast/NoMMU scenarios).
And, VM_MemStoreQW_DflI (somewhat simplified vs actual):
   {
VM_MemSpan *sp;
s64 addr1;
if(ctx->status)
return; /* do nothing if an exception has been triggered. */
/* handle if access crosses minimal-sized page boundary. */
if(((addr+0)>>12)!=((addr+7)>>12))
{
/* May crack recursively down to bytes if needed.
   Load works similarly, loading each half and recombining them.
   */
VM_MemStoreDW(ctx, addr+0, (s32)(val>> 0));
VM_MemStoreDW(ctx, addr+4, (s32)(val>>32));
return;
}
/* Translate address by the TLB, may raise TLB Miss exceptions. */
addr1=VM_MemTranslateTLB(ctx, addr);
if(ctx->status)
return;
/* Update virtual model of L1 and L2 caches.
   If virtual cache misses occur, add penalty cycles as needed.
   Generlly does a bunch of stuff, not a leaf.
   */
VM_MemUpdateCacheModelL1D(ctx, addr, addr1);
/* lookup memory-map span for the given physical address. */
sp=VM_MemGetSpanForAddr(ctx, addr1);
if(!sp)
{
VM_ThrowFaultStatus(ctx, EXC_BADADDRESS);
return;
}
/* call span handler to write value to span memory.
   Subtract a relative base and apply a mask,
   to allow wrapping address-modulo behaviors.
   For RAM: Actually a leaf function.
   For MMIO/etc: Typically not a leaf function.
   Will usually go down a few more levels.
   */
sp->SetQW(ctx, sp, (addr1-(sp->relbase))&(sp->relmask), val);
   }
...
Where, apart from the opcode-dispatch functions, the Mem Load/Store, Cache Modeling, TLB Handling, ... functions, are the majority of the rest of the hot path.
For a simpler interpreter, the Mem Load/Store handling can be made a lot cheaper than this, but a lot of this complexity is needed to have a correctly functional virtual memory system and some semblance of being cycle-accurate.
Have noted that my VM gets notably better performance when compiled with MSVC than with GCC. When compiled with GCC, kinda performs like dog crap in comparison.
Contrast, say, Dhrystone score where GCC wins over MSVC by a very significant margin (roughly 400%).
Or, basically, at least by my observations, GCC seems to do poorly in cases where the main parts of the hot path are dominated primarily by dense tangles of function calls (and this pattern appears to hold across multiple architectures, *).
*: At least within the limits of GCC vs MSVC on x86-64, or GCC vs BGBCC on RISC-V.
...

Date	Sujet	#	Auteur
30 Jan 25	Misc: Ongoing status...	25	BGB
31 Jan 25	Re: Misc: Ongoing status...	19	MitchAlsup1
31 Jan 25	Re: Misc: Ongoing status...	18	BGB
31 Jan 25	Re: Misc: Ongoing status...	17	MitchAlsup1
1 Feb 25	Re: Misc: Ongoing status...	16	BGB
1 Feb 25	Re: Misc: Ongoing status...	15	MitchAlsup1
1 Feb 25	Re: Misc: Ongoing status...	14	BGB
2 Feb 25	Re: Misc: Ongoing status...	13	MitchAlsup1
2 Feb 25	Re: Misc: Ongoing status...	1	BGB
2 Feb 25	Caller-saved vs. callee-saved registers (was: Misc: Ongoing status...)	11	Anton Ertl
2 Feb 25	Re: Caller-saved vs. callee-saved registers	10	BGB
2 Feb 25	Re: Caller-saved vs. callee-saved registers	9	BGB
3 Feb 25	Re: Caller-saved vs. callee-saved registers	8	MitchAlsup1
3 Feb 25	Re: Caller-saved vs. callee-saved registers	7	BGB
3 Feb 25	Re: Caller-saved vs. callee-saved registers	6	MitchAlsup1
3 Feb 25	Re: Caller-saved vs. callee-saved registers	5	BGB
4 Feb 25	Re: Caller-saved vs. callee-saved registers	4	MitchAlsup1
4 Feb 25	Re: Caller-saved vs. callee-saved registers	3	BGB
4 Feb 25	Re: Caller-saved vs. callee-saved registers	2	MitchAlsup1
5 Feb 25	Re: Caller-saved vs. callee-saved registers	1	BGB
9 Mar 25	Instruction Parcel Size	5	Robert Finch
9 Mar 25	Re: Instruction Parcel Size	4	MitchAlsup1
9 Mar 25	Re: Instruction Parcel Size	1	Robert Finch
9 Mar 25	Re: Instruction Parcel Size	2	Robert Finch
9 Mar 25	Re: Instruction Parcel Size	1	MitchAlsup1