On 8/10/2024 5:18 AM, Anton Ertl wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
One thing these instruction traces would frequently report is that integer
multiply and divide instructions were not so common, and so could be
omitted and emulated in software, with minimal impact on overall
performance. We saw this design decision taken in the early versions of
Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.
Alpha and IA-64 have no integer division. IIRC IA-64 has no FP
division.
One interesting aspect of RISC-V is that they put multiplication and
division in the same extension (which is included in RV64G, i.e., the
General version of RISC-V).
Initially, BJX2 didn't have division, but now it does.
I will put the blame on trying to support a RISC-V decoder as well...
One can leave out both and the performance effect is fairly modest.
Some programs need faster integer divide, but usual workaround is to use a lookup table and multiply by reciprocal (often needed because on many machines with a divide instruction, it was often still slow).
There is an optimization which can reduce common-case integer divide down to around 3 cycles, but if the code already uses lookup tables (to sidestep slow or absent integer divide), this doesn't save much (so only really has benefit if one assumes that the code already assumes fast integer divide).
So, seemingly, in practice the only thing to really notice/care about this is Dhrystone (which significantly over-represents the value of integer divide).
As-is, leaving out hardware divide would save ~ 2K LUTs, as this is mostly the cost of the Shift-ADD unit.
But, this unit also implements another (arguably slightly more useful) feature: 64-bit multiply. Was also able to route floating-point divide through this unit.
Theoretically, I could extend the Shift-Add unit to 128 bits, and potentially add:
128-bit integer multiply and divide;
Binary128 FMUL and FDIV.
For Binary128, FADD/FSUB and FCMP are cheaper than FMUL, so this approach could potentially make Binary128 support in hardware "viable".
But, debatable if worth the LUTs.
The BJX2 core is already expensive...
Though, a few big/expensive features being:
The FP-SIMD unit (supports 4x Binary32 with a 3-cycle latency);
The stuff needed for the LDOP / RISC-V 'A' extension (*);
The main FPU (Binary64);
...
A little over 1/4 of the LUT cost of the core goes into the L1 caches.
*: The LDOP extension adds x86 style Load-Op and Op-Store instructions for basic ALU instructions, because the RISC-V 'A' extension already requires one to pay most of the cost of doing so (even if the 'A' extension has slightly different behavior, most of the difference is in the decoder).
The "cheaper option" would have been:
Don't bother with doing ALU ops against memory;
Don't bother with LL/SC or CAS;
Just add a SWAP/XCHG instruction, with non-caching variants.
Non-Caching XCHG is sufficient to implement a Spinlock/Mutex.
I guess the selling point of atomic operations is that one only sees the before or after of an operation, but I am not sold on its merits.
The design of the 'A' extension also seems to assume a memory subsystem where a core can 'reserve' a cache line in a single-writer sense. I just sort of "winged it", as my memory subsystem was designed around volatile/non-volatile access:
Volatile:
The L1 cache flushes the line (if needed)
Fetches;
Does operation;
Flushes line back to memory shortly afterwards.
Non-volatile:
Default, fetch line and keep it around;
Memory may become stale if not flushed.
So, the AQ/RL flags mostly just serve as hints for whether to use Volatile access.
As-is, the RV LL/SC instructions wont actually work as described.
Also FENCEI will just trap, ...
Did see some comments online where people were also saying that 'A' semantics and LL/SC can't be implemented on an AXI bus, but not really looked into this.
Outside the main CPU, there is the rasterizer module, which uses about as many LUTs as a small 32-bit CPU core (and roughly the same number of DSP48's as the main CPU core).
There is a feature that is "kinda expensive", namely the "LDTEX" instruction (Load Texture / Texel Load), which is less needed with the rasterizer module. It was mostly relevant to software-rasterization performance in TKRA-GL. But, annoyingly, would be relevant if/ever I get an ARB-ASM or GLSL compiler implemented.
But, then I would also need to figure out how to approach GL's behavior towards Nearest vs Linear fetch in shaders. Don't necessarily want to handle it dynamically in software.
Though, I guess one option would be to have a multi-stage compiler:
First stage, compile to an IR, likely similar to a modified/extended form of ARB-ASM (ARM-ASM would still require basic translation, mostly to map symbolic names to internal register numbers and likely to unpack some complex instructions into simpler ones);
As needed, JIT to machine code, with some variation based on (among other things) the parameters of the bound textures.
Later, it seems, the CPU designers realized that instruction traces were
not the final word on performance measurements, and started to include
hardware integer multiply and divide instructions.
When you invest more hardware to increase performance per cycle, at
one point the best return on investment is to have multiplication and
division instructions. What is interesting is that the multipliers
have than soon been fully pipelined. Or, as Mitch Alsup reports, in
cases where that was cheaper, have two half-pipelined multipliers.
Apparently there are enough applications that require a huge number of
multiplications; my guess is that the NSA won't tell us what they are.
Multiply is probably 1 or 2 orders of magnitude more common than divide.
My rough ranking of instruction probabilities (descending probability, *):
Load/Store (Constant Displacement, ~30%);
Branch (~ 14% of ops);
ALU, ADD/SUB/AND/OR (~ 13%);
Load/Store (Register Indexed, ~10%);
Compare and Test (~ 6%);
Integer Shift (~ 4%);
Register Move (~ 3%);
Sign/Zero Extension (~ 3%);
ALU, XOR (~ 2%);
Multiply (~ 2%);
...
*: Crude estimate based on categorizing the dynamic execution probabilities (which are per-instruction rather than by category).
Meanwhile, DIV and friends are generally closer to 0.05% or so...
You can leave them out and hardly anyone will notice.
For the most part, something like RISC-V makes sense, except that omitting Indexed Load/Store is basically akin to shooting oneself in the foot (and does result in a significant increase in the amount of Shift and ADD instructions used).
With RISC-V, one may see ~ 25% Load/Store followed by ~ 20% ADD and 15% Shift, ...
Some of this is because ADD and Shift end up over-represented by their need to be used in compound operations (indexed load/store and sign/zero extension).
I have some unusual instructions in the ~ 0.5% to 1% category:
MOVLD/MOVLLD and (more recently) MOVLW/MOVLLW.
These basically being special cases that perform, essentially:
MOVLD: Rn = (Rs<<32) | ((u32)Rt);
MOVLW: Rn = ((u32)(Rs<<16)) | ((u16)Rt);
Which in my case seemed to be fairly common (and might be more common if BGBCC had good pattern matching for them).
And, at 0.4%, things like PMORT (Morton Shuffle), but mostly because I potentially "over" use Morton-Shuffle in things like 3D rendering tasks (for its weirdness and limitations, it is very useful for texel fetch).
In my recent experiments with HDR, PLDCM8UH rises to 1.4%, which is the instruction used mostly for unpacking 4x FP8U to 4x Binary16. This is still with an implementation of the renderer which primarily shoves HDR pixels through the LDR path (if it were "properly HDR", it would be higher).
Most of the usage of the instruction in this case is when converting the HDR framebuffer to LDR for display.
Performance of HDR is weak vs LDR (as it ends up spending ~ 20% of the CPU time in pixel conversion), but probably doesn't matter too much if one assumes HDR is niche.
Might be better if I had "better" SIMD conversion ops (which did not require manual biasing and range-clamping), but, alas...
I guess a case could be made for having PCVTH2UW and similar do their own range clamping:
< 1.0: Clamp to 0x0000;
1.0 .. 1.999: Copy mantissa bits;
>= 2.0: Clamp to 0xFFFF.
And, for PCVTH2SW:
< 2.0: Clamp to 0x8000;
2.0 .. 3.999: Copy mantissa bits (inverting high bit);
>= 4.0: Clamp to 0x7FFF.
...
Meanwhile, I am once again reminded of an annoying edge case bug in my Verilog implementation:
If a TLB Miss happens on an inter-ISA branch, it can leave the CPU core in an inconsistent state.
Seemingly, the captured PC is for the branch instruction which generated the TLB miss, but the captured CPU state is for the destination ISA. So, upon from returning the TLB Miss ISR, the branch instruction is decoded in the wrong ISA mode. May be because the pipeline's snapshot of the SR state-bits for each pipeline stage only really covers the "status bits" but not the bits encoding the ISA mode and similar (which are assumed not to change from one instruction to the next; and they generally don't, except in inter-ISA branches).
Hacky "fix" in software is to do a load from the function pointer or destination whenever in a place where an inter-ISA branch is likely (to "pre-warm" the page in the TLB), but, this kinda sucks.
TODO: Fix this.
Well, and/or continue hacking around it in software because inter-ISA branches are uncommon (mostly effects the syscall mechanism and DLL imports if the EXE and DLL's were compiled in different ISA modes).
Might also come up if I try to do thunk-free dynamic linking between RISC-V and BJX2 code, but this seems unlikely ATM (most cases are likely to continue to involve thunks, and the thunks will be aware of the inter-ISA branch).
Where, say, if one wants to call a BJX2 function pointer from within RISCV mode, it is necessary to have a blob of wrapper code to try to hack over the ABI differences, and the only way this would likely go away is if the compilers could understand the other's ABI (say, BGBCC knowing the RV64 ABI, or GCC knowing the BJX2 ABI).
Could auto-generate thunks, but at present in the general case this will require type signatures and an understanding of any complex types.
Simple case only addresses 0-8 integer or pointer arguments, with an integer or pointer return (these can be handled without needing to know a function's type signature).
Most likely case is treating RV64 ELF/SO's and BJX2 EXE's/DLL's as two independent ecosystems, and discouraging trying to mix/match them (but, may detect/hack/auto-thunk things in "dlsym()").
Maybe also worth figuring out:
How to allow code on the RV64G side of things to use GLIBC (and then try to fake the Linux syscall mechanism to allow GLIBC to work).
...
- anton