On 7/24/2024 3:37 PM, MitchAlsup1 wrote:
Just before Google Groups got spammed to death; I wrote::
--------------------------------------------------------
MitchAlsup
Nov 1, 2022, 5:53:02 PM
In a thread called "Arguments for a Sane Instruction Set Architecture"
Aug 7, 2017, 6:53:09 PM I wrote::
-----------------------------------------------------------------------
Looking back over my 40-odd year career in computer architecture,
I thought I would list out the typical errors I and others have
made with respect to architecting computers. This is going to be
a bit long, so bear with me:
When the Instruction Set architecture is Sane, there is support
for:
A) negating operands prior to an arithmetic calculation.
Not seen many doing this, and might not be workable in the general case.
Might make sense for FPU ops like FADD/FMUL.
Maybe 'ADD'. Though, "-(A+B)" is the only case that can't be expressed with traditional ADD/SUB/RSUB.
B) providing constants from the instruction stream;
..where constant can be an immediate a displacement, or both.
Probably true.
My ISA allows for Immediate or Displacement to be extended, but doesn't currently allow (in the base ISA) any instructions that can encode both an immediate and displacement.
At present:
Baseline allows Imm33s/Disp33s via a 64-bit encoding;
There is optional support for Imm57s, which in XG2 is now extended to Imm64.
There are special cases that allow immediate encodings for many instructions that would otherwise lack an immediate encoding.
C) exact floating point arithmetics that get the Inexact flag
..correctly unmolested.
Dunno. I suspect the whole global FPU status/control register thing should probably be rethought somehow.
But, off-hand, don't know of a clearly better alternative.
D) exception and interrupt control transfer should take no more
..than 1 cache line read followed by 4 cache line reads to the
..same page in DRAM/L3/L2 that are dependent on the first cache
..line read. Control transfer back to the suspended thread should
..be no longer than the control transfer to the exception handler.
Likely expensive...
Granted, "glorified branch with some twiddling" is probably a little too far in the other direction. Interrupt and syscall overhead is fairly high when the handler needs to manually save and restore all the registers each time.
A fast, but more expensive, option would be to have multiple copies of the register file which is then bank-switched on an interrupt.
One possibility here could be, rather than being hard-wired to specific modes, there are 4 assignable register banks controlled by 2 status register bits.
Then, say:
0: User Task 1
1: User Task 2
2: Reserved for Kernel / Syscall Task;
3: Reserved for interrupts.
Possibly along with instructions to move between the banked registers and the currently active register file.
Though, likely cost would be that it would require putting the GPR register file in Block-RAM and possibly needing to increase pipeline length.
In an OS, the syscall and interrupt bank would likely be assigned statically, and the others could be assigned dynamically by the scheduler (though, as-is, would likely increase task-switch overhead vs the current mechanism).
This situation could potentially be "better" if there were 8 dynamic banks, with the scheduler potentially able to be clever and reuse banks if they haven't been evicted and the same process is run again (but could otherwise reassign them round-robin or similar).
Say, 8-bank configuration:
0..5: User Task 1..6
6: Reserved for Kernel / Syscall Task;
7: Reserved for interrupts.
And/or have 0/1 as reserved, and 2..7 for user tasks.
Though, either the 4 or 8 configuration would do well if there is 1 active task and a lot of sleeping tasks. The 8 configuration would mostly have a benefit if there were 2-4 active tasks and the rest are sleeping. The 4-bank configuration is at least "slightly less absurd".
Either could also be made to mimic the RISC-V per-mode register banking (if a hardware page walker were added, could potentially be made to mimic the RISC-V privileged spec; though I would also need to add banked versions of several of the CRs in this case).
Say:
LR0..LR3, GBR0..GBR3, TBR0..TBR3, SP0..SP3.
With the normal LR/GBR/TBR/SP pointing to whichever of the banks is currently active.
This bank change mechanism would likely be specific to interrupt handling and context switching though (it would not be valid to modify the active register bank in other contexts).
VBR(49:48): Repurpose to encoding register bank (currently MBZ). Setting these to something other than 00 could signal the use of a bank-swap mechanism.
For task-switching, would be handled by the usual SR<->EXSR bit copying;
May reserve SR(19:18) for these bits (but, this will use up nearly all of the remaining bits in the low 32 bits of SR; the high 32-bits of SR only able to encode global state).
Likely, would keep the banking invisible to much of the pipeline, but would need to be handled in the register-file logic. Would likely add an internal pseudo-register to access out-of-bank registers, likely passing the register number via the immediate field (may or may not overlap with the RISC-V CSR space).
Though, can note that as-is, in my case, in some programs, system call overhead is high enough that all this could be worth looking into (Say: Quake 3 manages to spend nearly 3% of the clock-cycle budget in the SYSCALL ISR; mostly saving/restoring registers).
Granted, it also seems to be using glBegin/glEnd, and could be worth looking into trying to get it to use vertex arrays (I am guessing it is probably looking for something in the GL_EXTENSIONS string, would need to look more into it).
It appears like it is doing most of its rendering via:
glBegin(GL_TRIANGLES);
...
glEnd();
E) Exception control transfer can transfer control directly to a
..user privilege thread without taking an excursion through the
..Operating System.
? Putting the scheduler in hardware?...
Could make sense for a microcontroller, but less so for a conventional OS as pretty much the only things handling interrupts are likely to be supervisor-mode drivers.
Doesn't really seem worth it to me.
F) upon arrival at an exception handler, no state needs to be saved,
..and the "cause" of the exception is immediately available to the
..Exception handler.
G) Atomicity over a multiplicity of instructions and over a
..multiplicity of memory locations--without losing the
..illusion of real atomicity.
Memory consistency is hard...
Then again, in recent days I have been trying to hunt down a memory consistency bug in my CPU core.
Found a stall-timing bug in the L1 D$, but there seems to be another type of bug that is only manifesting in the full simulation and FPGA builds (but mostly seems to act up when using virtual memory).
There is still some sensitivity to the timing of stall signals, so it seems likely that something in the core is stall-timing sensitive.
Meanwhile, despite the bug currently only manifesting in the full simulation, it is not effected by things that change the timing or message ordering in the L2 ring.
Don't really want to poke at it on the software side of things, as these sorts of bugs are prone to disappear and then reappear later if one changes anything on the software side of things.
Currently, there are also some invalid ringbus requests (at the time of fault) pointing in the direction of the L1 instruction cache (after fixing a bug with spurious requests being sent from the L1 data cache, which were related to flush handling and the L1 cache failing to generate a stall when trying to flush a cache-line, and thus failing to record that it had ever sent the request; and then being confused when a response came back to a request it did not remember having sent).
Ironically also, stalling when flushing cache lines also made things slightly faster.
Adding similar logic to the L1 I$, however, has not fixed whatever bug is happening here.
Seems like a fully cache-coherent model would be harder as there are more places things could go wrong (vs in a weak consistency model).
At least with a weak model, software knows that if it doesn't go through the rituals, the memory will be stale.
Though, might be better if my emulator could detect or emulate stale memory contents; but annoyingly this is harder to emulate in software than a cache-consistent model. The main point of trying to emulate this would likely be mostly to try to lint cases where the program accesses stale memory.
Even as-is, there is a significant performance penalty in the emulator for trying to emulate and keep track of memory access latency, and trying also to model and detect stale memory accesses may well push it beyond what can be emulated in real-time (this already happens if I build it with debug settings).
H) Elementary Transcendental function are first class citizens of
..the instruction set, and at least faithfully accurate and perform
..at the same speeds as SQRT and DIV.
... Yeah...
In my case, they don't exist, and FDIV and FSQRT are basically boat anchors.
Well, I guess it could be possible to support them in the ISA if they were all boat anchors.
Say:
FSIN Rm, Rn
Raises an TRAPFPU exception, whereupon the exception handler decodes the instruction and performs the FSIN operation.
Though, undesirable, as as-is the exception handling would have significant overhead (say, doing FDIV via a trap would take around 16x-20x longer than doing it via a Shift-Add unit).
I) The "system programming model" is inherently:
..1) Virtual Machine
..2) Hypervisor + Supervisor
..3) multiprocessor, multithreaded
If the system-mode architecture is low-level enough, the difference between normal OS functionality and emulation starts to break down.
Like, in both cases one has:
Software page table walking;
Needing to keep track of a virtual model of the TLB;
Software needing to care about and manually inject address translations into the TLB;
...
Then, it doesn't add too much more to add a nesting layer...
But, maybe a person could add helper instructions for hashed PC lookup and trampolining:
Take VM_PC and turn it into an index into a hash table;
Fetch trace from hash table;
See if trace VPC matches VM_PC;
If true, branch to trace's "Run()" handler;
Else, decode and execute trace.
Say, if we had a "PCHASH4K" instruction that did, say:
H=((VMPC>>1)^(VMPC>>13))&4095;
Etc...
And/or run code natively, but have some mechanism to transform privileged instructions into a sort of special supervisor call (or a modified form of the existing trap mechanism).
Say:
Performs a branch relative to a VMTEVBR register or similar, with an offset depending on the specific trapped instruction encountered;
Instruction word is loaded into TEA;
...
Which could potentially save some overhead vs a general-purpose fault exception, if there were a weigh to pull off light-weight interrupts (such as having multiple register banks).
Could potentially also make sense as a way to try to run supervisor mode RISC-V code on my CPU core (potentially faking most of the privileged functionality in software, while otherwise running it mostly natively).
But, RISC-V performance is still "kinda meh" (*1).
Could maybe also get back around to trying to do x86 emulation on this, but had mostly put it off as I don't expect to be able to get usable performance. Nor is x86 really viable for AOT translation.
*1: Though, I guess there is a low activity group that is trying to address some of this. I guess Qualcomm had added a bunch of instructions, a fair number of them overlapping with functionality that exists in BJX2.
Though, also via a lot of 48-bit encodings.
Charter said they were going to look into the addressing-mode issue, but no instructions have been defined for this yet (I would have prioritized this; in my case it seems like the main "elephant in the room" regarding performance; but instead they seemingly prioritized "larger displacements for local conditional branches").
Then again, I guess "if()" being limited to around +/- 2K or so IIRC, is probably a bit of an annoyance (generally requires a double branch).
Well, and apparently also a 48 bit encoding for:
LI Xd, Imm32s
...
Which I guess is arguably more compact as:
MOV Imm33s, Rn
Currently requires a 64-bit encoding in BJX2.
Though, I slightly prefer the use of "jumbo prefixes" here, as "jumbo prefix makes immediate or displacement bigger" is preferable, as I see it, to "define a whole new unrelated encoding which is basically a previous instruction with a bigger immediate".
J) Simple applications can run with as little as 1 page of Memory
..Mapping overhead. An application like 'cat' can be run with
..an total allocated page count of 6: {MMU, Code, Data, BSS, Stack,
..and Register Files}
Hmm.
I guess one could make a case for a position-independent version of an "a.out" like format, focused on low-footprint binaries.
Then, have a standardized layout like, say (Base Relative):
0000..003F: Header
0040..1XXX: Code
2XXX..7XXX: Data/BSS
8XXX..FXXX: Stack
The format could then save potentially hundreds of bytes on headers vs the minimal-size PE (or multiple kB for a more conventional PE). Though, to be useful, would need a low-footprint C library.
Or, possibly could also do it with split executable and data sections (with a 64K limit for read-only sections, and another 64K limit for data+bss+stack).
Say, for example, all read-only data is PC-rel, and the data sections are accessed via the global pointer.
But, much past 64K, may as well just use PE...
The strategy I had used (also used in some Linux configurations) is to fold all of these small command-line tools into the shell.
In my case, they were built-in to the shell directly (so, invoking many of these commands will not spawn a process at all).
In some other forms (like BusyBox or ToyBox), it is an "omni-binary" that has all of the command names symlinked to it; and will behave as a given command depending on what name it was invoked as.
So, for example, "/bin/cat", "/bin/ls", ..., will all point at the same binary.
...
--------------------------------------------------------------------
<
I though it might be fun to have a review of what came out of this::
<
At the time of that writing My 66000 ISA was still gestating in my
head--I was pretty much following the Mc 88000 Architecture in scope
and in format.
<
So; point by point::
<
A) negating operands prior to an arithmetic calculation.
1-operand instructions have sign control over result and of operand
2-operand instructions have sign control over both operands
3-operand instructions have sign control over two operands
So: check
>
B) providing constants from the instruction stream;
1-operand instructions have one <optional> immediate
2-operand instructions have one register and one <optional> immediate
3-operand instructions have two registers and one <optional> immediate
Loads have base register, index register and <optional> displacement
Stores have the same addressing, but the value being stored can be
....either from a register or from an immediate.
Many immediates have auto-expanding characteristics::
one can FADD Rd,Rs1,#3 to add 3.0D0 using a single 1-word
instruction. 32-bit immediates for (double) FP calculations are auto-
expanded to 64-bits in operand delivery.
Similarly, integer instructions have ±5-bit immediates, signed 16-bit
immediates, 32-bit immediates and 64-bit immediates.
Memory references have 16-bit, 32-bit, and 64-bit displacements.
When Rbase = R0 IP is inserted for easy access to data relative to the
code stream.
So, big Check
<
C) exact floating point arithmetics that get the Inexact flag
..correctly unmolested.
While CARRY provides access to these features (and the inexact bit
....gets set correctly; it my current assessment that DBLE will be
....greater use and utility than the exact FP arithmetics.
So, little check
D) exception and interrupt control transfer should take no more
..than 1 cache line read followed by 4 cache line reads to the
..same page in DRAM/L3/L2 that are dependent on the first cache
..line read.
While the above is TRUE it is different than expected. Yes, a context
switch still takes 5-cache line reads, and context switch can transpire
from any thread under any GuestOS to any other thread under any other
GuestOS, all of this is "perpetrated" by a "fixed function unit" far
from the cores of the chip.
This fixed function unit combines the thread being scheduled, the
customer thread asking for service and the <appropriate> HyperVisor
data "assembled" into a single message that effects a context switch.
<
E) Exception control transfer can transfer control directly to a
..user privilege thread without taking an excursion through the
..Operating System.
This remains illusive--while it is technically possible to setup
"state" such that the above happens; it requires each such thread
run under its unique GuestOS. However, one can configure a rather
normal GuestOS so that the exception dispatcher transfers control
to a user level exception handler in 15-ish instructions.
So, medium check
-------------------------------------------------------------------
Update: My 66000 new interrupt architecture can now allow interrupts
or exceptions to be directed at Application privilege level. And this
is how Linux would deliver signal() to Applications.
In addition, VMexit()s need no diddling with interrupt or PCI control
registers.
So, memory is virtualized, devices are virtualized, device DMA is
virtualized, device interrupts are virtualized, and the relation-
ship between cores and interrupt is virtualized; not needing any
diddling when one traverses up and down the privilege levels. The
only overhead is <rather static> mapping tables.
-------------------------------------------------------------------
<
F) upon arrival at an exception handler, no state needs to be saved,
..and the "cause" of the exception is immediately available to the
..Exception handler.
This above TRUE and also comes with the property that multiple
exceptions can be logged onto a handler without Interrupt or
Exception disablement.
No state needs to be saved: Check
No state needs to be loaded: Check
Pertinence arrives with control: Check
Control arrives on affinitizxed core: Check
--------
unCheck
--------
Control arrives at proper priority: Check
Control arrives with proper "privilege": Check
Hard Real Time supported: Maybe
---------------------------
Closer to check than maybe.
---------------------------
Moderate Real Time Supported: Check
No extraneous excursions though OS: Check.
Overall: Big check
<
G) Atomicity over a multiplicity of instructions and over a
..multiplicity of memory locations--without losing the
..illusion of real atomicity.
Up to 8 cache lines participate in an ATOMIC event.
Multiple locations in each line may have state altered.
There is direct access to whether interference has transpired.
Software can use interference to drive down future interference.
Hardware can transfer control is ATOMICITY has been violated.
Essentially ANY atomic-primitive studied in academia or provided
by industry can be synthesized.
So, medium check
<
H) Elementary Transcendental function are first class citizens of
..the instruction set, and at least faithfully accurate and perform
..at the same speeds as SQRT and DIV.
Transcendental functions operate at about the latency of FDIV
ln2, ln2P1, exp2, exp2M1 14 cycles
ln, ln10, exp, exp10 <and cousins> 18 cycles
sin, sinpi, cos, cospi 19 cycles {including Payne and Hanek argument
reduction}
tan, atan 19 or 38 cycles
power 35 cycles
23 Transcendental instructions are available in (float) and (double)
forms.
(float will be around 9 cycles)
So, reasonable check.
<
I) The "system programming model" is inherently:
..1) Virtual Machine
..2) Hypervisor + Supervisor
..3) multiprocessor, multithreaded
It is not only the above, but even moderately hard real time is built
in.
Interrupts are directed at threads not cores
------------------------------------------------------------------------
Turns out that Linux thinks interrupts are directed at cores and there
is essentially nothing anyone can do about that. My 66000 new system
model is much more Linus friendly at the cost of hard real time.
------------------------------------------------------------------------
Deferred Procedure Calls are single instruction events
--------
Check.
-------
Most handler->handler control transfers do not need an excursion though
the OS scheduler.
-----------------------------------------------------------------------
ISR schedules a softIRQ and then when it SVRs the softIRQ gains control
before what originally got interrupted, transitively, without having SW
traverse schedule queues.
-----------------------------------------------------------------------
Basically, if you have less than 1024 processes in a Linux system, the
lower level scheduler consumes no cycles on a second by second basis.
Context switch between threads under different hypervisors is the same
10-cycles as context switch between threads under the same GuestOS (10).
Conventional machines might take 1,000 cycles for a within GuestOS
context switch and 10,000 cycles on a between Guest OS context switch;
given 1,000 context switches per second, this accounts for a fraction
of a percent speed up.
So, moderate-big check
<
J) Simple applications can run with as little as 1 page of Memory
..Mapping overhead.
Achievable even when different areas {.text, .data, .bss, .stack, ...}
are separated by GB or even TB.
So, check
-------------------------------------------------------------------
That is all.