On 2/14/2025 3:52 PM, MitchAlsup1 wrote:
On Fri, 14 Feb 2025 21:14:11 +0000, BGB wrote:
On 2/13/2025 1:09 PM, Marcus wrote:
-------------
>
The problem arises when the programmer *deliberately* does unaligned
loads and stores in order to improve performance. Or rather, if the
programmer knows that the hardware supports unaligned loads and stores,
he/she can use that to write faster code in some special cases.
>
>
Pretty much.
>
>
This is partly why I am in favor of potentially adding explicit keywords
for some of these cases, or to reiterate:
__aligned:
Inform compiler that a pointer is aligned.
May use a faster version if appropriate.
If a faster aligned-only variant exists of an instruction.
On an otherwise unaligned-safe target.
__unaligned: Inform compiler that an access is unaligned.
May use a runtime call or similar if necessary,
on an aligned-only target.
May do nothing on an unaligned-safe target.
None: Do whatever is the default.
Presumably, assume aligned by default,
unless target is known unaligned-safe.
It would take LESS total man-power world-wide and over-time to
simply make HW perform misaligned accesses.
I think the usual issue is that on low-end hardware, it is seen as "better" to skip out on misaligned access in order to save some cost in the L1 cache.
Though, not sure how this mixes with 16/32 ISAs, given if one allows misaligned 32-bit instructions, and a misaligned 32-bit instruction to cross a cache-line boundary, one still has to deal with essentially the same issues.
Another related thing I can note is internal store-forwarding within the L1 D$ to avoid RAW and WAW penalties for multiple accesses to the same cache line.
More expensive option:
Detect and forward the stored data back into the Load side so that the Load or Store has an up-to-date view of the cache line;
Cheaper option:
Stall the pipeline until the prior store is able to complete and write its data back into the L1 cache arrays.
This partly effects the structure of prologs and memcpy:
Simple case: Just store or copy in sequential order.
May take a significant penalty if the cache does not forward stores.
Stagger the store order to avoid WAW penalties:
Reduces penalties (if accesses are properly aligned);
More convoluted logic.
Say, it less convoluted to do, say:
MOV.X R24, (SP, 0)
MOV.X R26, (SP, 16)
MOV.X R28, (SP, 32)
MOV.X R30, (SP, 48)
Than, say:
MOV.X R24, (SP, 0)
MOV.X R28, (SP, 32)
MOV.X R26, (SP, 16)
MOV.X R30, (SP, 48)
This would be a much bigger headache though with 32-byte cache lines than with 16. Though, if switching to 32B lines, might also make sense to switch to half-line addressing.
But, yeah, I have recently gotten caught up in lots of bug hunting, which seems to be negatively effecting my mood.
Did eventually find a few bugs that were holding up XG3 in my Verilog core.
MOV.X was misbehaving, as XG3 had addressed R32..R63 in a different way than XG1 and XG3:
XG1/XG3: Even Numbers encode R0..R30, Odd encodes R32..R62.
XG3:
Just uses plain register numbers from R0..R62.
The outer logic for dealing with register pairs wasn't aware of the difference, so was incorrectly decoding references to R32..R62 as R0..R30.
The "CMPxx 3RI Imm6s" instructions were also decoding incorrectly when in XG3 mode when a jumbo-prefix was used and the immediate was negative, partly again due to an XG2/XG3 rules difference:
XG2 had switched to the EI bit for Sign, using WI to select unsigned comparison, XG3 continues using WI for sign in the presence of a Jumbo prefix. So was decoding as an unsigned compare rather than a signed compare.
Where, the unsigned case can't currently be encoded in XG3 with a full 33-bit immediate (but, could be encoded with a 17-bit immediate).
This crap has taken me several months to hunt down, I am not feeling very productive.
For RISC-V + Jumbo prefixes, there was another bug that was being a problem for a while.
But, it appears I might have figured out this one:
GBR was not previously allowed to be fetched via the Ru port (Lane 2).
However, the offending encoding:
ADDI Xd, X3, Imm33s
Needed the ability to fetch GBR from this port. This being because it goes through the ALU rather than the AGU, and the ALU also has a quirk that (unlike most everything else) it puts the low half in Lane 2 and the high half in Lane 1 (most other instructions put the low half in Lane 1 and high half in Lane 2). This was mostly because of the way the signal routing needed to work for ALUX (only the Lane 1 ALU could update the S/T bits, and this needs to be done from the high-half ALU for operations like CMPxx and ADC/SBB).
Didn't seem to think that the issue might have been in the ID2/RF stage.
I can note it isn't still allowed from the 'Rv' port, but likely this shouldn't matter unless one wants to use (in XG3):
RSUB R3, Imm33s, Rn
Or: Rn = Imm33s - GBR;
But, this case is likely obscure enough that it might be better to "just leave it broken" to save some LUTs (or make it disallowed).
Sometimes, it does seem like I might be too dumb for a lot of this stuff.
Otherwise:
Have now added an option to widen superscalar fetch for XG3 and RISC-V to 3 instructions.
So, now things like:
ADD, ADD, ADD
Can use all 3 lanes...
However, can note that Shift ops are still not allowed in Lane3, and the fetch isn't smart enough to shuffle instructions into valid lanes (this is theoretically possible, would likely add too much cost).
Seemingly, it adds around 1k LUT to the cost of the core to enable the logic to detect/handle 3-wide superscalar (vs 2-wide) which is a little steep (though, timing does seemingly improve in this case).
Also went and widened GBR to being a full 64 bits (to match emulator behavior), however, the high bits still remain fused with FPSCR. So, using the high bits of GBR may effect GPU behavior (with dynamic rounding mode ops), and dynamic rounding mode ops may magically change bits in the high part of GBR (cough, GP/X3).
Previously, fetching GBR via the normal GPR ports would give a version with the high 16-bits zeroed.
But, arguably, a full-width 64 bits is probably "more correct" even with the glued-on FPSCR.
Implicitly, this makes things like the rounding mode and similar effectively callee save rather than global (so, if the rounding mode is set in a called function, it will revert when this function returns).
Then again, I have heard that apparently there are libraries that rely on the global-rounding-mode behavior, but I have also heard of such libraries having issues or non-determinism when mixed with other libraries which try to set a custom rounding mode when these modes disagree.
I prefer my strategy instead:
FADD/FSUB/FMUL:
Hard-wired Round-Nearest / RNE.
Does not modify FPU flags.
FADDG/FSUBG/FMULG:
Dynamic Rounding;
May modify FPU flags.
Can note that RISC-V burns 3 bits for FPU instructions always encoding a rounding mode (whereas in my ISA, encoding a rounding mode other than RNE or DYN requiring a 64-bit encoding).
Proper RV has some "user accessible CSRs" here, but these are not yet supported. How to best deal with CSRs is an open issue, as most don't have a 1:1 mapping with those in BJX2. In theory, a mechanism could be added mostly for trying to deal with CSRs (possibly as some weird appendage to be glued onto the part of the register file that deals with control registers).
...
Well, also, some amount of internal emotional conflicts.
Generally feeling kinda worthless recently.
Well, and some amount of ongoing social issues, but I don't really want to go into my thoughts here right now.
But, alas...