On 2/18/2025 1:57 PM, Marcus wrote:
Den 2025-02-18 kl. 20:25, skrev Brett:
BGB <cr88192@gmail.com> wrote:
[snip]
The smallest cores I had managed had used a simplified SH-based design:
Fixed-length 16 bit instructions, with 16 registers;
Only (Reg) and (Reg, R0) addressing;
Aligned only;
No shift or multiply;
>
You mean no variable shift, or no large shifts, you have to support divide
by 2, right?
>
Yes, LSL 1 can be implemented by ADD, but LSR/ASR 1 needs a dedicated
instruction, right?
Yeah, I meant no variable shift.
There were no generic SHAD/SHLD instructions.
I forgot to mention it, but I had also (along the path) also dropped the auto-increment Load/Store instructions.
Say:
MOV.L @R4+, R8
Can be cheaply enough decomposed into:
MOV.L @R4, R8
ADD #4, R4
But, this freed up a lot of holes in the encoding space, and (in the absence of pressure to retain binary compatibility) I ended up starting to reorganizing the encoding space into ways I found more pleasing.
Say, for example, what later ended up as BtST1 and BJX2 having the 0nmZ/0Znm block filled entirely with Load/Store ops.
0..3: Store, (Rm)
4..7: Store, (Rm,R0)
8..B: Load, (Rm)
C..F: Load, (Rm,R0)
Where, SuperH had had Load/Store ops scattered all over the place.
Though, for BtSR1, moving the opcode bits and register fields was a mistake in retrospect, as it made it so that Rn was not always in the same place:
10nm ADD Rm, Rn
Cnii ADD Imm8, Rn
Whereas, 1nm0 vs Cnii would have been more consistent.
At the time, I didn't realize the relatively high cost of having register fields that can move around.
The BRA/BSR Disp12 Delay-Slot ops ended up replaced with Loading a 12-bit value into R0, as generally this was more useful (granted, Disp8 doesn't reach very far).
A better tradeoff in retrospect might have been a compromise, say:
Azzz: BRA Disp12 (but, no delay slot)
Bzzz: MOV Imm12u, Rn
Noting how large positive constants are a lot more common than large negative constants. And, "BRA Disp12" is still useful, whereas "BSR Disp12" was kind of useless as function calls were almost always outside a 4K window.
But, probably not going to do another SH style ISA as I don't have a strong use-case for a 32-bit microcontroller.
But, technically, a small SuperH style core can be made for less resource budget than what is needed for RV32I or RV32E (and is easier and cheaper to decode than the 'C' encoding).
One limitation of a 32-bit core is that it would have a harder time working with 64-bit data or a 48-bit address space. This meant that even if I could make them small, there were not great IO co-processors, and a usable small core, while significantly smaller than a full-featured BJX2 core, is still not "tiny" either.
I would still need a small 64-bit core to be able to effectively do DMA style IO tasks (such as framebuffer copying or managing IO to/from the SDcard).
But, harder to get a 64-bit core below around 1/4 the size of the BJX2 Core (even with an SH-like ISA design and minimal L1 cache).
Or, say:
1-wide 32-bit core: ~ 5 or 6 kLUT;
1-wide 64-bit core: ~ 8 or 9 kLUT.
With no MMU, FPU, and single-cache-line L1 caches.
Or, for an SH-like ISA, an L1 I$ miss for every 8 instructions, ...
IIRC, a past attempt at RV32I came out at around 7 kLUT.
Granted, I think other people have managed smaller cores.
But, I can note that, for the BJX2 core, ~ 1/4 of the LUT budget goes just into the L1 caches (also partly why they remain direct-mapped, ...).
IIRC the SuperH has some power-of-two shift instructions, e.g:
shlr Rn
shlr2 Rn
shlr4 Rn
shlr8 Rn
shlr16 Rn
It was 1,2,8,16.
They skipped 4, but adding 4 helps here.
Also, a 1-bit SHAR.
And, some SHLL2/8/16 cases.
Though, one could maybe go for, say:
1,2,3,4,8,12,16,20
Which allows all shifts between 1..24 to be encoded within 2 instructions. Though, 25..31 may require 3 instructions.
Or:
1,2,3,4,8,16,24
Where:
1..15: All reachable in 1/2 instructions.
16..31: All reachable in 2/3 instructions.
Though, going too far, and one almost may as well just have a full shifter.
It takes up some encoding space and costs extra cycles/instructions to
do a full shift (e.g. 7=4+2+1), but I guess you can make relatively
cheap shift hardware that way? Maybe you can get away with even fewer
instructions (e.g. only 1, 4, 16)?
Pretty much.
But, they were 1R ops, reducing how much encoding space was needed.
It was a little annoying, but mostly worked OK for such an ISA.
If you wanted a variable shift, the usual strategy was to branch into a shift slide.