On 2/17/2025 8:55 PM, MitchAlsup1 wrote:
On Tue, 18 Feb 2025 1:00:18 +0000, BGB wrote:
On 2/14/2025 3:52 PM, MitchAlsup1 wrote:
------------
It would take LESS total man-power world-wide and over-time to
simply make HW perform misaligned accesses.
>
>
I think the usual issue is that on low-end hardware, it is seen as
"better" to skip out on misaligned access in order to save some cost in
the L1 cache.
>
Though, not sure how this mixes with 16/32 ISAs, given if one allows
misaligned 32-bit instructions, and a misaligned 32-bit instruction to
cross a cache-line boundary, one still has to deal with essentially the
same issues.
Strategy for low end processors::
a) detect misalignment in AGEN
b) when misaligned, AGEN takes 2 cycles for the two addresses
c) when misaligned, DC is accessed twice
d) When misaligned, LD align is performed twice to merge data
Possibly.
I had done it at basically full speed with sets of even and odd addressed cache-lines, but some mechanism to crack the Load/Store into two parts internally could be a different strategy.
Possible cracking might only need to be done though if the misaligned access also crosses a line boundary.
Another related thing I can note is internal store-forwarding within the
L1 D$ to avoid RAW and WAW penalties for multiple accesses to the same
cache line.
IMHO:: Low end processors should not be doing ST->LD forwarding.
Possibly true.
This feature adds a bit of cost, and is one of the things I ended up needing to turn off in attempts to boost the clock speed to 75MHz.
But, my existing core is currently a little too bulky to try pushing to 75MHz.
Using staggered stores in prologs and memcpy does significantly decrease the performance of disabling this forwarding (but does put some hurt on the speed of LZ4 and RP2 decoding).
I am left half-thinking it might make sense to try doing something lighter. But, would need to decide on specifics.
A full soft-reboot is unlikely. But, might make sense to design a core for a subset of my current design.
One possibility could be to design a 2-wide core around a subset of XG3.
And, possibly try aiming for a 75MHz target.
May drop to 32/64 bit instructions and 64-bit fetch.
May not try for RV64G, as some things in RV64G add too much complexity and would likely make a 75MHz target harder.
Some things would be TBD, like whether to stay with full compare-and-branch, or drop back to cheaper compare-with-zero-and-branch. Would likely (once again) axe some things that needed to be added for RV64G support (but which remain debatable in terms of hardware cost, 1).
1: Say, for example, 64-bit integer multiply and divide.
It being cheaper to do a 64-bit CPU but only provide a 32-bit multiplier (falling back to software for 64-bit multiply).
XG2 is also possible, but arguably, XG3 does have a cleaner encoding scheme. Currently, either can be decoded in terms of the other, but there are some amount of special cases (and it might be cleaner to switch to XG3 as the native encoding scheme).
I guess another open question is if there is a way to make my Binary64 FPU cheaper and with less timing impact. Not sure, it was already a bit of an exercise in corner cutting.
There is also an idle thought of trying to lengthen the pipeline enough to allow fully pipelined FPU ops. But, the issue is doing so cheaply (and without negatively effecting the cost of branch-predictor misses).
Say: PF IF ID RF E1 E2 E3 E4 E5 E6 WB
Would have steeper cost and increased branch latency.
Though, one could possibly only allow forwarding from certain stages, say: E2, E3, and E5
Whereas, if the result is in E1 or E4, it generates an interlock stall, and E6 stalls until WB completes (may or may not allow forwarding from WB). Though, possibly, there could be "pseudo-forwarding" from E4/E5/E6, where if an instruction completed in a prior stage, these stages can still forward the result, but no new results may "arrive" at these stages (dunno how much difference this would make for forwarding cost, could still be expensive to have this many EX stages).
Dropping EX1, as-is, mostly effects the performance of Reg-Reg and Imm-Reg MOV (pretty much everything else of note already has a 2-cycle latency), but these instructions are more sensitive to latency (so, 2 cycle MOV is not ideal).
With 6 pipeline stages, this could be enough to allow pipelining a Binary64 FMUL or FADD, or a Binary32 FMAC.
But, would mean 13 cycle branch miss, ... And possibly also turn the CPU into a turd.
Another option could be keep 3 primary EX stages, but have mechanism for registers to be marked as "not yet available" and then to allow longer latency operations to finish at some later stage.
Some cores I had looked at had done this (for things like memory accesses, which were put into a FIFO), but this leaves the issue of how to best get results back into the register file (don't want to be handing out register file write ports to function-units, and there is an issue that there is a high probability of multiply FU's wanting to submit results at the same time, which would need to be dealt with).
Best option I can think of is that these FUs have a mechanism to hold 1 or 2 values, and a mechanism exists to MUX these over a shared write port, generating pipeline stalls if the port gets backlogged. But, this seems like it would suck.
Moving instructions along one stage at a time, and then having the final value appear on the pipeline (for be forwarded back to RF, or eventually reach WB), is cleaner and simpler.
Nevermind the issue of needing to stall the pipeline whenever the L1 cache misses or similar.
...
But, I guess the more immediate question would be more of coming up with something that has a decent/fast ISA, can run at 75MHz, and fit more easily onto an XC7S50 or similar.
Though, the most conservative option is to keep a design similar to my existing core, just try to strip it down a fair bit.
---------------------
>
Say, it less convoluted to do, say:
MOV.X R24, (SP, 0)
MOV.X R26, (SP, 16)
MOV.X R28, (SP, 32)
MOV.X R30, (SP, 48)
These still look like LDs to me.
My ASM notation is "OP Src, Dst".
Which is, granted, backwards from Intel and RV notation.
But, it evolved fairly directly out of the SuperH GAS notation.
Which is in a similar category as M68K and MSP430.
Some other variants had decorated things further:
% on registers
@ on memory references
# on immediates
...
I had dropped most of these sort of decorator characters, instead using the parenthesis to indicate memory references.
Someone could I guess write:
MOV.X %R24, @(SP,0)
If they really wanted to...
Well, and/or make a case for x86 style notation:
MOV [SP+0], R24Q
...
-----------------
Then again, I have heard that apparently there are libraries that rely
on the global-rounding-mode behavior, but I have also heard of such
libraries having issues or non-determinism when mixed with other
libraries which try to set a custom rounding mode when these modes
disagree.
>
>
I prefer my strategy instead:
FADD/FSUB/FMUL:
Hard-wired Round-Nearest / RNE.
Does not modify FPU flags.
It takes Round Nearest Odd to perform Kahan-Babashuka Summation.
That is:: comply with IEEE 754-2019
If an algorithm depends on accurate rounding, probably not a good idea with my existing FPU...
Maybe, if future implementations were done, they could do a more IEEE correct FPU, but I guess there is always a possible risk that software could end up depending on some of the wonky corner cutting.
It is likely that in a few corner cases, like the "FLDCH Imm16, Rn" instruction (Load Binary16 to Binary64), it may end up being required though that it be done in a way that gives bit-identical results.
Though, it is not entirely settled which exact variant would be better (regarding the specific handling of Exponents 0 and 31).
General case converters:
Special case Inf/NaN and Zero such that the expected values map as expected (this form is currently canonical);
Cheaper Version (Immediate Path): Don't bother, they are treated as slight extensions of normal-range, at the expense that literal Inf/NaN or 0 are N/E in this case.
The cheaper version is what one is currently liable to get with many FPU immediate instructions.
I recently looked into the idea of adding special case encodings to deal with scaling by 2/3 and 4/5 (likely hidden in the NaN range). These could be decoded within a "reasonable" cost, but statistically wouldn't increase the coverage of FPU immediate values by all that much.
Where:
B.E5.M4 or 1.B.E4.M4:
B=0: 2/3
B=1: 4/5
Decoding as ((1.M*{(2/3)|(4/5)})*pow(2, E-7)), just expressed as bit-unpacking via a lookup table (all of these cases would fill the low-order bits with a repeating 4 bit pattern).
However, only a statistically small percentage of the FPU immediate values matched this pattern.
Would only increase hit-rate for Binary16 immediate from around 73% to around 77%, so not likely worth the added logic complexity.
Of those that miss Binary16, ~ 15% hit with Binary32 (with most of the misses here due to insufficient mantissa bits).
Someone elsewhere had mentioned BF16, but can note:
Exponents miss ~ 2.6%
Had mentioned I was getting 0%, but this was a BGBCC bug.
End result is similar though
Few of these cases would also hit with a 7-bit mantissa.
And, 0.3% isn't worth it.
Could almost get more elaborate and use a format where the number of exponent and mantissa bits varies for more dynamic range, but likely also not worth it.
Based on current stats, I suspect likely the best-case format (in terms of hit-rate for FPU constants) might actually be S.E6.M9 (which could gain slightly more due to dynamic range than it loses due to shorter mantissa). But, still debatable (around 73 -> 75%).
FADDG/FSUBG/FMULG:
Dynamic Rounding;
May modify FPU flags.
>
Can note that RISC-V burns 3 bits for FPU instructions always encoding a
rounding mode (whereas in my ISA, encoding a rounding mode other than
RNE or DYN requiring a 64-bit encoding).
Oh what fun, another RISC-V encoding mistake...
Yeah...
Usual RISC-V design practice:
Well, we have this 25-bit encoding block, why not burn *all* of it?...
But, ironically, there are still enough corner cases if the F/D block encodings that one can also fit an SSE-style FPU-SIMD ISA in there while not technically adding all that many new instructions.
Then 'P' and 'V' come along, and burn two more 25-bit blocks.
I had assigned encodings more one off, but ended up with a few cases for 32-bit encodings:
Fixed RN/RNE rounding (ignores FPSCR);
FADD/FSUB/FMUL/FDIV
Dynamic Rounding Mode (uses FPSCR);
FADDG/FSUBG/FMULG
Approximate (Binary64 with truncated mantissa, ignores FPSCR).
FADDA/FSUBA/FMULA
There are longer 64-bit encodings with explicit rounding modes and FPU immediate values.