Newsportal USENET - Re: "Mini" tags to reduce the number of op codes

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
BGB wrote:
>
Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for performance.
>
Where, 16 GPRs isn't really enough (lots of register spills), and 128 GPRs is wasteful (would likely need lots of monster functions with 250+ local variables to make effective use of this, *, which probably isn't going to happen).
>
16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part of GPRs AND you have good access to constants.
>

On the main ISA's I had tried to generate code for, 16 GPRs was kind of a pain as it resulted in fairly high spill rates.

Though, it would probably be less bad if the compiler was able to use all of the registers at the same time without stepping on itself (such as dealing with register allocation involving scratch registers while also not conflicting with the use of function arguments, ...).

My code generators had typically only used callee save registers for variables in basic blocks which ended in a function call (in my compiler design, both function calls and branches terminating the current basic-block).

On SH, the main way of getting constants (larger than 8 bits) was via PC-relative memory loads, which kinda sucked.

Also the blob of constants needed to be within 512 bytes of the load instruction, which was also kind of an evil mess for branch handling (and extra bad if one needed to spill the constants in the middle of a basic block and then branch over it).
Usually they were spilled between basic-blocks, with the basic-block needing to branch to the following basic-block in these cases.
Also 8-bit branch displacements are kinda lame, ...
And, if one wanted a 16-bit branch:
   MOV.W (PC, 4), R0 //load a 16-bit branch displacement
   BRA/F R0
   .L0:
   NOP // delay slot
   .WORD $(Label - .L0)
Also kinda bad...

This is slightly less bad on x86-64, since one can use memory operands with most instructions, and the CPU tends to deal fairly well with code that has lots of spill-and-fill. This along with instructions having access to 32-bit immediate values.
Yes, x86 and any architecture (IBM 360, S.E.L. , Interdata, ...) that have
LD-Ops act as if they have 4-6 more registers than they really have. x86
with 16 GPRs acts like a RISC with 20-24 GPRs as does 360. Does not really
take the place of universal constants, but goes a long way.

Yeah.

>
The vast majority of leaf functions use less than 16 GPRs, given one has
a SP not part of GPRs {including arguments and return values}. Once one starts placing things like memove(), memset(), sin(), cos(), exp(), log()
in the ISA, it goes up even more.
>

Yeah.

Things like memcpy/memmove/memset/etc, are function calls in cases when not directly transformed into register load/store sequences.
My 66000 does not convert them into LD-ST sequences, MM is a single inst-
ruction.

I have no high-level memory move/copy/set instructions.
Only loads/stores...
For small copies, can encode them inline, but past a certain size this becomes too bulky.
A copy loop makes more sense for bigger copies, but has a high overhead for small to medium copy.
So, there is a size range where doing it inline would be too bulky, but a loop caries an undesirable level of overhead.
Ended up doing these with "slides", which end up eating roughly several kB of code space, but was more compact than using larger inline copies.
Say (IIRC):
   128 bytes or less: Inline Ld/St sequence
   129 bytes to 512B: Slide
   Over 512B: Call "memcpy()" or similar.
The slide generally has entry points in multiples of 32 bytes, and operates in reverse order. So, if not a multiple of 32 bytes, the last bytes need to be handled externally prior to branching into the slide.
Though, this is only used for fixed-size copies (or "memcpy()" when value is constant).
Say:
__memcpy64_512_ua:
   MOV.Q (R5, 480), R20
   MOV.Q (R5, 488), R21
   MOV.Q (R5, 496), R22
   MOV.Q (R5, 504), R23
   MOV.Q R20, (R4, 480)
   MOV.Q R21, (R4, 488)
   MOV.Q R22, (R4, 496)
   MOV.Q R23, (R4, 504)
__memcpy64_480_ua:
   MOV.Q (R5, 448), R20
   MOV.Q (R5, 456), R21
   MOV.Q (R5, 464), R22
   MOV.Q (R5, 472), R23
   MOV.Q R20, (R4, 448)
   MOV.Q R21, (R4, 456)
   MOV.Q R22, (R4, 464)
   MOV.Q R23, (R4, 472)
...
__memcpy64_32_ua:
   MOV.Q (R5), R20
   MOV.Q (R5, 8), R21
   MOV.Q (R5, 16), R22
   MOV.Q (R5, 24), R23
   MOV.Q R20, (R4)
   MOV.Q R21, (R4, 8)
   MOV.Q R22, (R4, 16)
   MOV.Q R23, (R4, 24)
   RTS

Did end up with an intermediate "memcpy slide", which can handle medium size memcpy and memset style operations by branching into a slide.
MMs and MSs that do not cross page boundaries are ATOMIC. The entire system
sees only the before or only the after state and nothing in between. This
means one can start (queue up) a SATA disk access without obtaining a lock
to the device--simply because one can fill in all the data of a command in
a single instruction which smells ATOMIC to all interested 3rd parties.

My case, non-atomic, polling IO.
Code fragment:
while(ct<cte)
{
P_SPI_QDATA=0xFFFFFFFFFFFFFFFFULL;
P_SPI_CTRL=tkspi_ctl_status|SPICTRL_XMIT8X;
v=P_SPI_CTRL;
while(v&SPICTRL_BUSY)
v=P_SPI_CTRL;
*(u64 *)ct=P_SPI_QDATA;
ct+=8;
}
Where the MMIO interface allows sending/receiving 8 bytes at a time to avoid bogging down at around 500 K/s or so (with 8B transfers, could theoretically do 4 MB/s; though it is only ~ 1.5 MB/s with 12.5 MHz SPI).
Though, this is part of why I had ended up LZ compressing damn near everything (LZ4 or RP2 being faster than sending ~ 3x as much data over the SPI interface).
Hadn't generally used Huffman as the additional compression wasn't worth the fairly steep performance cost (with something like Deflate, it would barely be much faster than the bare SPI interface).
Did recently come up with a "pseudo entropic" coding that seems promising in some testing:
   Rank symbols by probability, sending the most common 128 symbols;
   Send the encoded symbols as table indices via bytes, say:
   00..78: Pair of symbol indices, 00..0A;
   7F: Escaped Byte
   80..FF: Symbol Index
Which, while it seems like this would likely fail to do much of anything, it "sorta works", and is much faster to unpack than Huffman.
Though, if the distribution is "too flat", one needs to be able to fall back to raw bytes.
Had experimentally written a compressor based around this scheme, and while not as fast as LZ4, it did give compression much closer to Deflate.
Where, IME, on my current main PC:
   LZMA: ~ 35 MB/s
   Bitwise range coder.
   Deflate: ~ 200 MB/s
   Huffman based, symbols limited to 15 bits.
   TKuLZ: ~ 350 MB/s
   Resembles a Deflate / LZ4 Hybrid.
   Huffman based, symbols limited to 12 bits.
   TKFLZH: ~ 500 MB/s
   Similar to a more elaborate version of TKuLZ.
   Huffman symbols limited to 13 bits.
   TKDELZ: ~ 700 MB/s
   Similar to the prior to, but:
   Splits symbols into separately-coded blocks;
   Uses an interleaved encoding scheme, decoding 4 symbols at a time.
   PSELZ: ~ 1.0 GB/s
   Uses separate symbol blocks, with the "pseudo entropic" encoding.
   RP2: ~ 1.8 GB/s
   Byte oriented
   LZ4: ~ 2.1 GB/s
Though, RP2 and LZ4 switch places on BJX2, where RP2 is both slightly faster and gives slightly better compression.
I suspect this is likely because of differences in the relative cost of byte loads and branch mispredicts.
Note that TKuLZ/TKFLZH/TKDELZ/PSELZ used a similar structure for encoding LZ matches:
   TAG
   (Raw Length)
   (Match Length)
   Match Distance
   (Literal Bytes)
Where, TAG has a structure like:
   (7:5): Raw Length (0..6, 7 = Separate length)
   (4:0): Match Length (3..33, 34 = Separate length)
Though, the former 3 were using a combination nybble-stream and bitstream.
Had considered a nybble stream for PSELZ, but ended up using bytes as bytes are faster.

As noted, on a 32 GPR machine, most leaf functions can fit entirely in scratch registers.
Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without getting
totally screwed.

OK.

On a 64 GPR machine, this percentage is slightly higher (but, not significantly, since there are few leaf functions remaining at this point).

If one had a 16 GPR machine with 6 usable scratch registers, it is a little harder though (as typically these need to cover both any variables used by the function, and any temporaries used, ...). There are a whole lot more leaf functions that exceed a limit of 6 than of 14.
The data back in the R2000-3000 days indicated that 32 GPRs has a 15%+
advantage over a 16 GPRs; while 84 had only a 3% advantage.

Probably true enough.

But, say, a 32 GPR machine could still do well here.

Note that there are reasons why I don't claim 64 GPRs as a large performance advantage:
On programs like Doom, the difference is small at best.

It mostly effects things like GLQuake in my case, mostly because TKRA-GL has a lot of functions with a large numbers of local variables (some exceeding 100 local variables).

Partly though this is due to code that is highly inlined and unrolled and uses lots of variables tending to perform better in my case (and tightly looping code, with lots of small functions, not so much...).

>
Where, function categories:
   Tiny Leaf:
     Everything fits in scratch registers, no stack frame, no calls.
   Leaf:
     No function calls (either explicit or implicit);
     Will have a stack frame.
   Non-Leaf:
     May call functions, has a stack frame.
>
You are forgetting about FP, GOT, TLS, and whatever resources are required
to do try-throw-catch stuff as demanded by the source language.
>

Yeah, possibly true.

In my case:
   There is no frame pointer, as BGBCC doesn't use one;
Can't do PASCAL and other ALOGO derived languages with block structure.

Nothing prevents having a frame pointer, just BGBCC doesn't do so as it doesn't really gain anything if one has fixed-size stack frames (and is another register to be saved/restored).
Granted, it would make stack-walking easier.
As is, one needs to use a similar strategy to how one does stack unwinding in the Win64 ABI (namely looking stuff up in a table, and then parsing the instruction sequence).

All stack-frames are fixed size, VLA's and alloca use the heap;
longjump() is at a serious disadvantage here. desctructors are sometimes hard to position on the stack.

Yeah... If you use longjmp, any VLA's or alloca's are gonna be leaked...
Nothing really mandates that longjmp + alloca not result in a memory leak though (or that alloca can't be implemented as a fancy wrapper over malloc).

   GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
   TLS, accessed via TBR.

Try/throw/catch:
   Mostly N/A for leaf functions.

Any function that can "throw", is in effect no longer a leaf function.
Implicitly, any function which uses "variant" or similar is also, no longer a leaf function.
You do realize that there is a set of #define-s that can implement try-throw-catch without requiring any subroutines ?!?

?...
Throw was implemented as a runtime call in my case.
Though, try/catch involves arcane metadata and compiler magic (basically borrowing a similar design to the WinCE and Win64 exception handling).
Though, could have in theory gone the SEH route and implemented it using hidden runtime calls and a linked list. There are pros/cons either way (and SEH would have an on-average lower overhead for C, since when it is not used, its cost would effectively drop to 0; unlike the cost of PE/COFF exception-handling tables, and needing to provide dummy exception catch/continue blobs in the off-chance exceptions were used and passed through C land).

Need for GBR save/restore effectively excludes a function from being tiny-leaf. This may happen, say, if a function accesses global variables and may be called as a function pointer.
------------------------------------------------------

One "TODO" here would be to merge constants with the same "actual" value into the same register. At present, they will be duplicated if the types are sufficiently different (such as integer 0 vs NULL).
In practice, the upper 48-bits of a extern variable is completely shared
whereas the lower 16-bits are unique.

For functions with dynamic assignment, immediate values are more likely to be used. If the code-generator were clever, potentially it could exclude assigning registers to constants which are only used by instructions which can encode them directly as an immediate. Currently, BGBCC is not that clever.
And then there are languages like PL/1 and FORTRAN where the compiler
has to figure out how big an intermediate array is, allocate it, perform
the math, and then deallocate it.

I don't expect BJX2 and FORTRAN to cross paths...

Or, say:
y=x+31; //31 only being used here, and fits easily in an Imm9.
Ideally, compiler could realize 31 does not need a register here.

Well, and another weakness is with temporaries that exist as function arguments:
If static assigned, the "target variable directly to argument register" optimization can't be used (it ends up needing to go into a callee-save register and then be MOV'ed into the argument register; otherwise the compiler breaks...).

Though, I guess possible could be that the compiler could try to partition temporaries that are used exclusively as function arguments into a different category from "normal" temporaries (or those whose values may cross a basic-block boundary), and then avoid statically-assigning them (and somehow not cause this to effectively break the full-static-assignment scheme in the process).
Brian's compiler finds the largest argument list and the largest return
value list and merges them into a single area on the stack used only
for passing arguments and results across the call interface. And the
<static> SP points at this area.

The issue isn't with stack space, this part is straightforward.
Rather, that in the IR stage, one has something like (pseduocode):
   t4 := t0 + 1;
   t5 := t1 + 2;
   t6 := func(t4, t5);
   t7 := t6 + 3;
Where, at the ASM level, one could do, say:
   ADD R8, 1, R4
   ADD R9, 2, R5
   BSR func
   ADD R2, 3, R10
But... Pulling this off without the compiler and/or compiled program exploding in the process, is easier said than done.
OTOH:
   ADD R8, 1, R11
   ADD R9, 2, R12
   MOV R11, R4
   MOV R12, R5
   BSR func
   MOV R2, R11
   ADD R11, 3, R10
Is, not quite as efficient, but despite some efforts, is closer to the current situation than to the one above. There are some cases where these optimizations can be performed, but only if the variables in question are not static-assigned, which forces the latter scenario.

Though, IIRC, I had also considered the possibility of a temporary "virtual assignment", allowing the argument value to be temporarily assigned to a function argument register, then going "poof" and disappearing when the function is called. Hadn't yet thought of a good way to add this logic to the register allocator though.

But, yeah, compiler stuff is really fiddly...
More orthogonality helps.

These parts of my compiler are a horrible mess, and rather brittle...
Partial reason there is no RISC-V support in BGBCC, is because the ABI is different, and the current design of the register allocator can't deal with a different ABI design.

Date	Sujet	#	Auteur
3 Apr 24	"Mini" tags to reduce the number of op codes	81	Stephen Fuld
3 Apr 24	Re: "Mini" tags to reduce the number of op codes	8	Anton Ertl
15 Apr 24	Re: "Mini" tags to reduce the number of op codes	6	MitchAlsup1
15 Apr 24	Re: "Mini" tags to reduce the number of op codes	5	Terje Mathisen
15 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	Terje Mathisen
15 Apr 24	Re: "Mini" tags to reduce the number of op codes	3	MitchAlsup1
16 Apr 24	Re: "Mini" tags to reduce the number of op codes	2	Terje Mathisen
16 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	MitchAlsup1
17 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	Stephen Fuld
3 Apr 24	Re: "Mini" tags to reduce the number of op codes	3	Thomas Koenig
17 Apr 24	Re: "Mini" tags to reduce the number of op codes	2	Stephen Fuld
17 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	BGB-Alt
3 Apr 24	Re: "Mini" tags to reduce the number of op codes	12	BGB-Alt
3 Apr 24	Re: "Mini" tags to reduce the number of op codes	9	MitchAlsup1
4 Apr 24	Re: "Mini" tags to reduce the number of op codes	7	Terje Mathisen
4 Apr 24	Re: "Mini" tags to reduce the number of op codes	3	Michael S
4 Apr 24	Re: "Mini" tags to reduce the number of op codes	2	Terje Mathisen
4 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	Michael S
5 Apr 24	Re: "Mini" tags to reduce the number of op codes	3	BGB-Alt
5 Apr 24	Re: "Mini" tags to reduce the number of op codes	2	MitchAlsup1
5 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	BGB
17 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	Stephen Fuld
3 Apr 24	Re: "Mini" tags to reduce the number of op codes	2	MitchAlsup1
4 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	BGB
5 Apr 24	Re: "Mini" tags to reduce the number of op codes	54	John Savard
5 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	BGB-Alt
5 Apr 24	Re: "Mini" tags to reduce the number of op codes	52	MitchAlsup1
7 Apr 24	Re: "Mini" tags to reduce the number of op codes	51	John Savard
7 Apr 24	Re: "Mini" tags to reduce the number of op codes	6	MitchAlsup1
8 Apr 24	Re: "Mini" tags to reduce the number of op codes	5	John Savard
8 Apr 24	Re: "Mini" tags to reduce the number of op codes	2	Thomas Koenig
17 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	John Savard
8 Apr 24	Re: "Mini" tags to reduce the number of op codes	2	MitchAlsup1
17 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	John Savard
7 Apr 24	Re: "Mini" tags to reduce the number of op codes	44	Thomas Koenig
7 Apr 24	Re: "Mini" tags to reduce the number of op codes	43	MitchAlsup1
8 Apr 24	Re: "Mini" tags to reduce the number of op codes	42	Thomas Koenig
8 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	Anton Ertl
9 Apr 24	Re: "Mini" tags to reduce the number of op codes	40	Thomas Koenig
9 Apr 24	Re: "Mini" tags to reduce the number of op codes	38	BGB
9 Apr 24	Re: "Mini" tags to reduce the number of op codes	37	MitchAlsup1
10 Apr 24	Re: "Mini" tags to reduce the number of op codes	36	BGB-Alt
10 Apr 24	Re: "Mini" tags to reduce the number of op codes	31	MitchAlsup1
10 Apr 24	Re: "Mini" tags to reduce the number of op codes	23	BGB
10 Apr 24	Re: "Mini" tags to reduce the number of op codes	22	MitchAlsup1
10 Apr 24	Re: "Mini" tags to reduce the number of op codes	3	BGB-Alt
10 Apr 24	Re: "Mini" tags to reduce the number of op codes	2	MitchAlsup1
11 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	BGB
10 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	BGB-Alt
11 Apr 24	Re: "Mini" tags to reduce the number of op codes	16	MitchAlsup1
11 Apr 24	Re: "Mini" tags to reduce the number of op codes	15	Michael S
11 Apr 24	Re: "Mini" tags to reduce the number of op codes	14	BGB
11 Apr 24	Re: "Mini" tags to reduce the number of op codes	13	MitchAlsup1
11 Apr 24	Re: "Mini" tags to reduce the number of op codes	9	BGB-Alt
12 Apr 24	Re: "Mini" tags to reduce the number of op codes	8	MitchAlsup1
12 Apr 24	Re: "Mini" tags to reduce the number of op codes	7	BGB
12 Apr 24	Re: "Mini" tags to reduce the number of op codes	6	MitchAlsup1
12 Apr 24	Re: "Mini" tags to reduce the number of op codes	5	BGB
13 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	MitchAlsup1
13 Apr 24	Re: "Mini" tags to reduce the number of op codes	3	MitchAlsup1
13 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	BGB
15 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	BGB-Alt
12 Apr 24	Re: "Mini" tags to reduce the number of op codes	3	Michael S
12 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	Michael S
15 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	MitchAlsup1
11 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	Terje Mathisen
11 Apr 24	Re: "Mini" tags to reduce the number of op codes	7	Paul A. Clayton
11 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	BGB
11 Apr 24	Re: "Mini" tags to reduce the number of op codes	2	BGB-Alt
12 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	MitchAlsup1
12 Apr 24	Re: "Mini" tags to reduce the number of op codes	2	MitchAlsup1
21 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	Paul A. Clayton
21 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	Paul A. Clayton
10 Apr 24	Re: "Mini" tags to reduce the number of op codes	4	Chris M. Thomasson
10 Apr 24	Re: "Mini" tags to reduce the number of op codes	3	BGB
10 Apr 24	Re: "Mini" tags to reduce the number of op codes	2	Chris M. Thomasson
10 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	BGB-Alt
13 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	Brian G. Lucas
15 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	MitchAlsup1
17 Apr 24	Re: "Mini" tags to reduce the number of op codes	2	Stephen Fuld
17 Apr 24	Re: "Mini" tags to reduce the number of op codes	1	MitchAlsup1