On 2/17/2025 11:07 PM, Robert Finch wrote:
On 2025-02-17 8:00 p.m., BGB wrote:
On 2/14/2025 3:52 PM, MitchAlsup1 wrote:
On Fri, 14 Feb 2025 21:14:11 +0000, BGB wrote:
>
On 2/13/2025 1:09 PM, Marcus wrote:
-------------
>
The problem arises when the programmer *deliberately* does unaligned
loads and stores in order to improve performance. Or rather, if the
programmer knows that the hardware supports unaligned loads and stores,
he/she can use that to write faster code in some special cases.
>
>
Pretty much.
>
>
This is partly why I am in favor of potentially adding explicit keywords
for some of these cases, or to reiterate:
__aligned:
Inform compiler that a pointer is aligned.
May use a faster version if appropriate.
If a faster aligned-only variant exists of an instruction.
On an otherwise unaligned-safe target.
__unaligned: Inform compiler that an access is unaligned.
May use a runtime call or similar if necessary,
on an aligned-only target.
May do nothing on an unaligned-safe target.
None: Do whatever is the default.
Presumably, assume aligned by default,
unless target is known unaligned-safe.
>
It would take LESS total man-power world-wide and over-time to
simply make HW perform misaligned accesses.
>
>
I think the usual issue is that on low-end hardware, it is seen as "better" to skip out on misaligned access in order to save some cost in the L1 cache.
>
I always include support for unaligned accesses even with a ‘low-end’ CPU. I think it is not that expensive and sure makes some things a lot easier when handled in hardware. For Q+ it just runs two bus cycles if the data spans a cache line and pastes results together as needed.
I had went aligned-only with some 32-bit cores in the past.
Whole CPU core fit into less LUTs than I currently spend on just the L1 D$...
Granted, some of these used a very minimal L1 cache design:
Only holds a single cache line.
The smallest cores I had managed had used a simplified SH-based design:
Fixed-length 16 bit instructions, with 16 registers;
Only (Reg) and (Reg, R0) addressing;
Aligned only;
No shift or multiply;
...
Where, say:
SH-4 -> BJX1-32 (Added features)
SH-4 -> B32V (Stripped down)
BJX1-32 -> BJX1-64A (64-bit, Modal Encoding)
B32V -> B64V (64-bit, Encoding Space Reorganizations)
B64V ~> BJX1-64C (No longer Modal)
Where, BJX1-64C was the end of this project (before I effectively did a soft-reboot).
Then transition phase:
B64V -> BtSR1 (Dropped to 32-bit, More Encoding Changes)
Significant reorganization.
Was trying to get optimize for code density closer to MSP430.
BtSR1 -> BJX2 (Back to 64-bit, re-adding features from BJX1-64C)
A few features added for BtSR1 were dropped again in BJX2.
The original form of BJX2 was still a primarily 16-bit ISA encoding, but at this point pretty much mutated beyond recognition (and relatively few instructions were still in the same places that they were in SH-4).
For example (original 16-bit space):
0zzz:
SH-4: Ld/St (Rm,R0); also 0R and 1R spaces, etc.
BJX2: Ld/St Only (Rm) and (Rm,R0)
1zzz:
SH-4: Store (Rn, Disp4)
BJX2: 2R ALU ops
2zzz:
SH-4: Store (@Rn, @-Rn), ALU ops
BJX2: Branch Ops (Disp8), etc
3zzz:
SH-4: ALU ops
BJX2: 0R and 1R ops
4zzz:
SH-4: 1R ops
BJX2: Ld/St (SP, Disp4); MOV-CR, LEA
5zzz:
SH-4: Load (Rm, Disp4)
BJX2: Load (Unsigned), ALU ops
6zzz:
SH-4: Load (@Rm+ and @Rm), ALU
BJX2: FPU ops, CMP-Imm4
7zzz:
SH-4: ADD Imm8, Rn
BJX2: (XGPR 32-bit Escape Block)
8zzz:
SH-4: Branch (Disp8)
BJX2: Ld/St (Rm, Disp3)
9zzz:
SH-4: Load (PC-Rel)
BJX2: (XGPR 32-bit Escape Block)
Azzz:
SH-4: BRA Disp12
BJX2: MOV Imm12u, R0
Bzzz:
SH-4: BSR Disp12
BJX2: MOV Imm12n, R0
Czzz:
SH-4: Some Imm8 ops
BJX2: ADD Imm8, Rn
Dzzz:
SH-4: Load (PC-Rel)
BJX2: MOV Imm8, Rn
Ezzz:
SH-4: MOV Imm8, Rn
BJX2: (32-bit Escape, Predicated Ops)
Fzzz:
SH-4: FPU Ops
BJX2: (32-bit Escape, Unconditional Ops)
For the 16-bit ops, SH-4 had more addressing modes than BJX2:
SH-4: @Reg, @Rm+, @-Rn, @(Reg,R0), @(Reg,Disp4) @(PC,Disp8)
BJX2: (Rm), (Rm,R0), (Rm,Disp3), (SP,Disp4)
Although it may seem like it, I didn't just completely start over on the layout, but rather it was sort of an "ant-hill reorganization".
Say, for example:
1zzz and 5zzz were merged into 8zzz, reducing Disp by 1 bit
2zzz and 3zzz was partly folded into 0zzz and 1zzz
8zzz's contents were moved to 2zzz
4zzz and part of 0zzz were merged into 3zzz
...
A few CR's are still in the same places and SR still has a similar layout I guess, ...
Early on, there was the idea that the 32-bit ops were prefix-modified versions of the 16-bit ops, but early on this symmetry broke and the 16 and 32-bit encoding spaces became independent of each other.
Though, the 32-bit F0 space still has some amount of similarity to the 16-bit space.
Later on I did some testing and performance comparisons, and realized that using 32-bit encodings primarily (or exclusively) gave significantly better performance than relying primarily or exclusively on 16-bit ops. And at this point the ISA transitioned from a primarily 16-bit ISA (with 32-bit extension ops) to a primarily 32-bit ISA with a 16-bit encoding space. This transition didn't directly effect encodings, but did effect how the ISA developed from then going forward (more so, there was no longer an idea that the 16-bit ISA would need to be able to exist standalone; but now the 32-bit ISA did need to be able to exist standalone).
But, now newer forms of BJX2 (XG2 and XG3) have become almost unrecognizable from early BJX2 (as an ISA still primarily built around 16-bit encodings).
Except that XG2's instruction layout still carries vestiges of its origins as a prefix encoding. But, XG3 even makes this part disappear (by reorganizing the bits to more closely resemble RISC-V's layout).
Well, and there is:
ZnmX -> ZXnm
But:
F0nm_ZeoX
I prefer my strategy instead:
FADD/FSUB/FMUL:
Hard-wired Round-Nearest / RNE.
Does not modify FPU flags.
FADDG/FSUBG/FMULG:
Dynamic Rounding;
May modify FPU flags.
>
Can note that RISC-V burns 3 bits for FPU instructions always encoding a rounding mode (whereas in my ISA, encoding a rounding mode other than RNE or DYN requiring a 64-bit encoding).
>
>
Q+ encodes rounding mode the same way as RISCV as there are lots of bit available in the instruction. Burning bits on the rounding mode seems reasonable to me when bits are available.
Initially:
3 bits of entropy were eaten by the 16-bit space;
2 more bits were eaten by predication and WEX.
So, the initial ISA design for 32-bit ops had 5 less bits than in RISC-V land.
XG2 reclaimed the 16-bit space, but used the bits to expand all the register fields to 6 bits.
Not many bits left to justify burning on a rounding mode.
And, my Imm/Disp fields were generally 3 bits less than RV.
Modified the PRED modifier in Q+ to take a predicate bit from one of three registers used to supply bits. Previously an array of two-bit mask values encoded in the instruction indicated to 1) ignore the predicate bit 2) execute if predicate true or 3) execute if predicate false.
Since there were three reg specs available in the PRED modifier, it seemed to make more sense to specify three regs instead of one. So now it works 1) as before 2) execute if bit in Ra is set, 3) execute if bit in Rb is set, 3) execute if bit in Rc is set.
The same register may be specified for Ra, Rb, and Rc. Since there is sign inversion available, the original operation may be mimicked by specifying Ra, ~Ra.
In BJX2, all 32-bit instructions encode predication in 2 bits in each instruction.
In XG3, the space that would have otherwise encoded WEX was instead left to RISC-V (to create a conglomerate ISA).
But, there is also the possibility to use XG3 by itself without any RISC-V parts in the mix.