On 7/25/2024 5:07 PM, MitchAlsup1 wrote:
On Thu, 25 Jul 2024 20:09:06 +0000, BGB wrote:
On 7/24/2024 3:37 PM, MitchAlsup1 wrote:
Just before Google Groups got spammed to death; I wrote::
--------------------------------------------------------
MitchAlsup
Nov 1, 2022, 5:53:02 PM
>
In a thread called "Arguments for a Sane Instruction Set Architecture"
Aug 7, 2017, 6:53:09 PM I wrote::
-----------------------------------------------------------------------
Looking back over my 40-odd year career in computer architecture,
I thought I would list out the typical errors I and others have
made with respect to architecting computers. This is going to be
a bit long, so bear with me:
>
When the Instruction Set architecture is Sane, there is support
for:
A) negating operands prior to an arithmetic calculation.
>
Not seen many doing this, and might not be workable in the general case.
Might make sense for FPU ops like FADD/FMUL.
>
Maybe 'ADD'. Though, "-(A+B)" is the only case that can't be expressed
with traditional ADD/SUB/RSUB.
a) one does not need a SUB or NEG instruction as one has:
ADD Rd,R1,R2
ADD Rd,R1,-R2
ADD Rd,-R1,R2
ADD Rd,-R1,-R2
Which basically gets rid of the unary NEG instruction.
Possibly, but "RSUB Rm, 0, Rn" could also indirectly be used to encode "NEG Rm, Rn".
In terms of clock-cycles, as-is "NEG" is less than 0.01% of the total cycle budget, so eliminating it from being used likely wouldn't have any real noticeable effect on performance.
The main place NEG was being used in the past was for encoding "x>>y" via the "SHAD" instruction, but then I added a "SHAR" instruction which implicitly reverses the direction of the shift (and NEG is now only rarely used).
It does end up also used for encoding:
"ptr-=size;"
As, say:
NEG Rs, Rt
LEA.Q (Rb, Rt), Rd
But, this is also relatively infrequent.
>
>
B) providing constants from the instruction stream;
..where constant can be an immediate a displacement, or both.
>
Probably true.
>
My ISA allows for Immediate or Displacement to be extended, but doesn't
currently allow (in the base ISA) any instructions that can encode both
an immediate and displacement.
ST #3.14159265358927,[IP,R3<<3,#0x123456789abcd]
Here we have 5 instruction words storing 2 words anywhere in memory in
one instruction and one decode cycle; we waste no registers with the
constants. Looks to be 7 instructions in RISC-V including 2 LDDs...
Assuming the displacement were limited to 33 bits:
MOV 3.14159265358927, R5 //Assuming this was valid ASM...
LEA.Q (PC, R3), R4
MOV.Q R5, (R4, 0x12345678)
Could in theory be reduced to 2 instructions via RiDisp, but only if the displacement is within 11 bits.
MOV 3.14159265358927, R5
MOV.Q R5, (PC, R3, 0x123)
The RiDisp extension is not generally enabled though; failing to cross a "makes enough of a difference to be worth the cost" metric in my testing.
Assuming one really needs the full 64 bit displacement:
MOV 3.14159265358927, R5
LEA.Q (PC, R3), R4
ADD 0x123456789ABCD, R4
MOV.Q R5, (R4)
A full 64-bit displacement could be encoded in XG2 Mode as-is, but at present isn't valid in the ISA rules.
It could be made allowed to use an Imm57s encoding with a 48-bit displacement assuming the extension to allow 48-bit load/store displacements was enabled. But, as-is, doesn't make much difference and is bad for timing (not much point in 48-bit displacements when 0% of the displacements exceed 33 bits).
>
At present:
Baseline allows Imm33s/Disp33s via a 64-bit encoding;
There is optional support for Imm57s, which in XG2 is now extended to
Imm64.
>
There are special cases that allow immediate encodings for many
instructions that would otherwise lack an immediate encoding.
>
>
C) exact floating point arithmetics that get the Inexact flag
..correctly unmolested.
>
Dunno. I suspect the whole global FPU status/control register thing
should probably be rethought somehow.
>
But, off-hand, don't know of a clearly better alternative.
>
>
D) exception and interrupt control transfer should take no more
..than 1 cache line read followed by 4 cache line reads to the
..same page in DRAM/L3/L2 that are dependent on the first cache
..line read. Control transfer back to the suspended thread should
..be no longer than the control transfer to the exception handler.
>
Likely expensive...
Tread "thread state" and its register file as a write back cache.
But, how to pull this off?...
If it requires building the whole register file out of FF's, this would be worse than building it out of Block-RAM, at least on FPGA.
As-is, I need to build the CRs out of FF's, and this is already rather expensive.
The only real other option is to have something that loads or stores the registers at 2 or so registers per clock-cycle, but this is what is already being done in software.
A hardware state-machine whose sole purpose is to bulk copy registers two/from a block of BRAM or similar upon interrupt entry/return is possible, but kinda lame.
>
>
Granted, "glorified branch with some twiddling" is probably a little too
far in the other direction. Interrupt and syscall overhead is fairly
high when the handler needs to manually save and restore all the
registers each time.
>
>
A fast, but more expensive, option would be to have multiple copies of
the register file which is then bank-switched on an interrupt.
Under My 66000 a low end implementation can choose the write back cache
version, while the GBOoO implementation can choose the bank switcher.
In both cases, the same model is presented to executing SW.
OK.
I guess, a question is how much BRAM I would need for a bank-switcher...
Technically, need 6x3 copies of the registers internally (for 6R3W), and 2 BRAMs for 256x 64-bits.
...
Checks, this would eat around 1/3 of the Block-RAM on the XC7A100T, and likely require reducing the size of the L2 Cache.
Though, the 4-bank and 8-bank case would likely use the same amount of Block-RAM (for the 256x 64-bit case, it would waste roughly half of each BRAM).
Based on resource use estimates, it looks like it might actually be better to try to build the thing by MUX'ing LUTRAM.
Or, increase the number of tag-bits for controlling the register file, say:
0000: Bank 0, Lane 1
0001: Bank 0, Lane 2
0010: Bank 0, Lane 3
0100: Bank 1, Lane 1
0101: Bank 1, Lane 2
0110: Bank 1, Lane 3
...
With a set of 4 backing arrays:
If the register is tagged as being in the current bank;
Use as normal;
Else:
Stall the pipeline
Attempt to store the register to the backing array;
Attempt to load the register from the other array.
With a mechanism to do this for each register port, only continuing once all ports are in a consistent state with the current register bank.
This could potentially use 22 internal copies of the register arrays, rather than 72. Where 18 of these are for the 6x3 register ports.
I guess, if I were do go the route of adding register banks, first order of business being to figure out some way to implement the register file in a way that is not horridly expensive...
>
One possibility here could be, rather than being hard-wired to specific
modes, there are 4 assignable register banks controlled by 2 status
register bits.
>
Then, say:
0: User Task 1
1: User Task 2
2: Reserved for Kernel / Syscall Task;
3: Reserved for interrupts.
>
Possibly along with instructions to move between the banked registers
and the currently active register file.
Just memory map everything into MMI/O space where you have access to
memorymove(to, from, count) capabilities and can move an entire
thread state in 1 instruction.
MMIO would require figuring out how to implement the register file in such a way that it is accessible both as registers and as MMIO, and is consistent in both cases.
A hacky instruction seems like it may well be easier to implement.
Say:
LDRBSQ Imm10, Rn //Load register from Banked Set (64-bit)
STRBSQ Imm10, Rn //Store register to Banked Set (64-bit)
LDRBSX Imm10, Xn //Load register from Banked Set (128-bit)
STRBSX Imm10, Xn //Store register to Banked Set (128-bit)
>
>
Though, likely cost would be that it would require putting the GPR
register file in Block-RAM and possibly needing to increase pipeline
length.
Just MMI/O
>
In an OS, the syscall and interrupt bank would likely be assigned
statically, and the others could be assigned dynamically by the
scheduler (though, as-is, would likely increase task-switch overhead vs
the current mechanism).
For SYSCALL in particular, you want at least 6 of the callers registers
to pass arguments to the service provider, and at least 1 register to
return the result.
I had put the SYSCALL arguments and similar in a shared memory array.
Though, yeah, the syscall may need to be able in some cases to read/write the registers in the other tasks, which would be potentially a little more complicated if using banked registers (one would then need to use a wrapper which behaves different based on whether the task is present in one of the register banks, vs having had all its registers spilled to memory).
>
This situation could potentially be "better" if there were 8 dynamic
banks, with the scheduler potentially able to be clever and reuse banks
if they haven't been evicted and the same process is run again (but
could otherwise reassign them round-robin or similar).
The Write Back Cache model works easier.
<snip>
>
Though, can note that as-is, in my case, in some programs, system call
overhead is high enough that all this could be worth looking into (Say:
Quake 3 manages to spend nearly 3% of the clock-cycle budget in the
SYSCALL ISR; mostly saving/restoring registers).
My SVC overhead is about 10 cycles.
VM exit overhead is also about 10 cycles.
When it is 1K cycle, it adds up.
E) Exception control transfer can transfer control directly to a
..user privilege thread without taking an excursion through the
..Operating System.
>
? Putting the scheduler in hardware?...
Policy remains in SW, the ability to manifest a SW choice fast is in HW.
>
Could make sense for a microcontroller, but less so for a conventional
OS as pretty much the only things handling interrupts are likely to be
supervisor-mode drivers.
Signal handlers.
Seems like most of these either originate from or would otherwise need to go through the OS.
>
F) upon arrival at an exception handler, no state needs to be saved,
..and the "cause" of the exception is immediately available to the
..Exception handler.
G) Atomicity over a multiplicity of instructions and over a
..multiplicity of memory locations--without losing the
..illusion of real atomicity.
>
Memory consistency is hard...
It is simply a fully pipelined version of LL/SC
>
H) Elementary Transcendental function are first class citizens of
..the instruction set, and at least faithfully accurate and perform
..at the same speeds as SQRT and DIV.
>
.... Yeah...
>
In my case, they don't exist, and FDIV and FSQRT are basically boat
anchors.
>
>
Well, I guess it could be possible to support them in the ISA if they
were all boat anchors.
>
Say:
FSIN Rm, Rn
Raises an TRAPFPU exception, whereupon the exception handler decodes the
instruction and performs the FSIN operation.
The trap is likely more cycles than FSIN().
Probably...
Though, can note that the C library I am using originally came with a "sin()"/"cos()" implementation that did this absurd thing of calculating the exponentials and factorials live within the body of a loop, dividing them, and adding them up.
It was *slow*...
Then replaced it with the much faster option of using an unrolled Taylor-series expansion.
There is the faster-still option of using a lookup table and a spline, but this is a lot less accurate; but still not as fast as a bare lookup table with no interpolation..
>
I) The "system programming model" is inherently:
..1) Virtual Machine
..2) Hypervisor + Supervisor
..3) multiprocessor, multithreaded
>
If the system-mode architecture is low-level enough, the difference
between normal OS functionality and emulation starts to break down.
>
Like, in both cases one has:
Software page table walking;
How does one walk a nested page table when HV does not want OS to see
its mapping tables, and vice versa ??
Intercept loads into the guest's TLB, and as-needed, retranslate them to the actual physical addresses on the host side of things.
The guest OS's TLB is faked.
Needing to keep track of a virtual model of the TLB;
TLB is an association of host.PTE with guest.virtual-address.
You can't have host or guest perform the TLB update !!
Not on the real TLB at least...
You can have a fake TLB though, with fake physical addresses (and a fake physical memory map). All MMIO access would also be faked, etc.
Like in an emulator:
The TLB inside the emulator is unrelated to the page-table or TLB in the host system.
On a TLB miss (in the host), one can pull from the guest TLB, run it though another level of page translation, and then load it into the real TLB. Or, if it is missing in the guest TLB, initially do a dummy load and then forward the TLB miss to the guest.
One could also in premise trap the loads into the guest TLB and use them to update a host-level page-table or similar.
Essentially in a configuration with 3 pages tables:
Guest page table (virtual);
Host VM physical to true physical (Logical, or when guest MMU is disabled);
Active guest-virtual to host physical (maintained on the host, driven by updates to guest TLB, essentially a double-translated page-table).
Though, one would want to avoid the host remembering "too much" (if something no longer exists in a guest's TLB, the host's mappings should no longer remember it either; as doing so may result in inconsistencies in the virtual memory system).
One other option is to have the host tell the guest that there is a hardware page walker, and then the host can do a nested page-table walk (whether or not the actual hardware supports a page-table walker).
As is though, for software TLB it is needed for the virtual memory code to keep track of a model of the TLB, for sake of knowing whether a given virtual memory page is currently loaded into the TLB.
In SH4, it would have been possible to access the TLB via MMIO to check for this, but in BJX2 currently there is no way to do this.
However, code can query the size and associativity of the TLB via CPUID, and then use this to model the TLB in software (however, this disallows some optimizations to the HW TLB, like moving accessed pages to the front, as these would not be reflected in the software's model of the TLB and lead to a discrepancy).
In theory though, software could mimic this optimization if it has some way to know (or accurately guess) the relative access probabilities of various memory pages. One option here is to model an additional set of TLB ways (so, the model essentially reflects an 8-way TLB), and then when pages are absent from the 4-way TLB but is still present in the 8-way model, they can be marked as conflict-missed.
Then, if a page with a high conflict-miss rate gets knocked out of the TLB, the TLB-miss handler can automatically reload it into the TLB if the oldest/last entry has a relatively lower conflict-miss probability.
Granted, I do have mixed feelings about the whole thing of needing to keep track of a running model of the hardware TLB in software. But, it seemed needed to be able to avoid some issues in the virtual memory code.
A faked guest TLB shouldn't be too much different though, it just needs to obey the same logical model as the hardware TLB. But, it could be allowed to be bigger, say 1024x 4-way, or maybe even 4096x 8-way (it will just need to report this fake size when CPUID is asked).
Logical model isn't that complicated at least, conceptually (16K page mode):
void ModelLdTlb(u64 ptel, u64 pteh)
{
u64 *tlbl, *tlbh, *tpl, *tph;
u64 vpn, i0l, i0h, i1l, i1h, i2l, i2h;
int vix, vi4;
tlbl=model_tlb_lo;
tlbh=model_tlb_hi;
vpn=pteh>>14; //get virtual page number
vix=vpn&255; //TLB size (modulo indexed based on virtual address)
vi4=vix<<2; //4-way
tpl=tlbl+vi4; tph=tlbh+vi4;
i0l=tpl[0]; i0h=tph[0];
i1l=tpl[1]; i1h=tph[1];
i2l=tpl[2]; i2h=tph[2];
tpl[0]=ptel; tph[0]=pteh;
tpl[1]=i0l ; tph[1]=i0h ;
tpl[2]=i1l ; tph[2]=i1h ;
tpl[3]=i2l ; tph[3]=i2h ;
}
Actual logic is more complicated, but this should give the general idea.
Basically, loading a new entry pushes everything back by one position, and the oldest entry falls off the end.
In the "actual" model, it also keeps track of a reverse-lookup table, so for physical page numbers it is possible to check if they are represented by an entry that is still live in the TLB (and, if the page is being unloaded, this allows the page to be manually evicted from the TLB).
As can be noted, the 96-bit address mode works in a similar way, just double-pumping the TLB loads (and effectively halving the TLB associativity).
But, yeah, there are various options here.
Letting the guest drive the actual hardware TLB, was not one of them...
>
J) Simple applications can run with as little as 1 page of Memory
..Mapping overhead. An application like 'cat' can be run with
..an total allocated page count of 6: {MMU, Code, Data, BSS, Stack,
..and Register Files}
>
Hmm.
>
>
I guess one could make a case for a position-independent version of an
"a.out" like format, focused on low-footprint binaries.
For the record, My 66000 code is PIC, including GOT, method calls, and
switch tables.
Yeah.
BJX2 code is also mostly position independent.
Traditional a.out, however, was not designed for being position independent...
Though, realistically, the main thing one needs is a mechanism to fixup/initialize any global variables that hold pointers.
Most likely mechanism would be a set of base-relocs, probably using 16-bits per reloc.
But, still needing base relocs would be another point in favor of just using a PE variant.