On 2/2/2025 8:24 PM, EricP wrote:
EricP wrote:
BGB wrote:
On 2/2/2025 12:10 PM, Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>
The OS must also be able to keep both pages in physical memory until
the access is complete, or there will be no progress. Should not be a
problem these days, but the 48 pages or so potentially needed by VAX
complicated the OS.
>
48 pages? What instruction would need that?
>
Hmm...
>
>
I ended up with a 4-way set associative TLB as it ended up being needed to avoid the CPU getting stuck in a TLB-miss loop in the worst-case scenario:
An instruction fetch where the line-pair crosses a page boundary (and L1 I$ misses) for an instruction accessing a memory address where the line-pair also crosses a page boundary (and the L1 D$ misses).
>
One can almost get away with two-way, except that almost inevitably the CPU would encounter and get stuck in an infinite TLB miss loop (despite the seeming rarity, happens roughly once every few seconds or so).
>
....
>
>
That is because you have a software managed TLB so all PTE's
referenced by an instruction must be resident in TLB for success.
If three PTE are required by an instruction and they map to
the same 2-way row and conflict evict then bzzzzt livelock loop.
>
So you need at least as many set assoc TLB ways as the worst case VA's
referenced by any instruction.
And this just accounts for the instruction that TLB-miss'ed.
If the TLB-miss handler code or data itself can possibly conflict
on the same TLB row then you have to add 2, 3 or 4 more ways for it.
The interrupt handlers are always run with MMU disabled.
In this case, interrupt handlers may not have any TLB misses.
But, any memory accesses into the virtual address space need to be emulated in software (via page-walks and a soft-TLB).
If the interrupt handlers ran with MMU enabled, the CPU would also need to be able to deal with recursive interrupts. At present, this is not a thing, and the design of the interrupt mechanism can't currently allow for this (and other interrupts are effectively blocked until the handler finishes, with a "General Fault" that happens within an interrupt handler stalling the CPU core until an external RESET signal is asserted).
In most cases, the interrupt handlers are short lived, with more general long-lived operations (such as syscall handling) being performed via a context switch.
Currently, page-fault handling does occur within the TLB-miss interrupt, had gone back and forth as to whether to handle page-fault similar to a system call, and initiate a context switch to a dedicated page-fault handler tasks.
Isn't great, but basically works.
Also assumes FIFO or LRU reuse of ways in a row. If victim way is
random selected then you need extra ways to add some spare pad and
the odds in succeeding become statistical.
I ended up with a relatively naive TLB scheme:
Normal access is simply Mod-N;
May be XOR'ed by bits from the ASID for part of the VAS range.
Because, yeah, hashing the address may lead edge-case scenarios that exceed the capabilities of a 4-way TLB (would need 8-way to fully deal with this).
Mod-N and Mod-N XOR ASID, can be statically known that no two adjacent pages will map to the same index in the TLB.
Where, for Addr(47:32):
* 0001..3FFF: Mod-N (Global VAS)
* 4000..7FFF: Mod-N or Mod-N ^ ASID (Local, *1);
* 8000..BFFF: Mod-N (Kernel Space)
* C000..CFFF: Physical (NOMMU)
* D000..DFFF: Physical (NOMMU+NoCache)
* E000..EFFF: Reserved, probably PCIe stuff.
* F000..FFFF: MMIO Space
There are no MMU+NoCache ranges, though this may be specified via the PTE's.
*1: Not currently used by TestKern, but the thinking is that process-local memory could be allocated in this address range.
Possibly, the global parts of the page table could be shared across every process, whereas the top-level of the page-table and local areas would be local to each process.
Not sure how this would be mapped to the B-Tree page-tables, but for 48-bit addressing, conventional page-tables have a lower oeverhead. Conventional page tables don't scale well to a 96-bit sparse VAS, but the use of a 96-bit address space likely isn't worth the hassle at this point in time.
Decided to leave out going into the specifics of the 96-bit VAS wonk, but for now I am not bothering with it, as it is too far overkill relative to what I am doing here.
Well, and I had managed to get RV ELF binaries working in TestKern's existing 48-bit VAS by coercing GCC into building a sort of makeshift "static PIE" binaries.
Nevermind if getting stuff working with "actual glibc" is a harder problem (eg, might be nicer if I could just pretend all this stuff was a RV64G Linux build, not really gonna work though if "ld-linux-so" just instantly explodes though).
The other "sorta almost works" strategy being to have BGBCC fake GCC's interface enough that one can coerce GNU autoconf into using it as a cross compiler (had worked at least for some fairly trivial programs).
Doesn't get that far in a general sense though, as BGBCC doesn't really support C++, and even C code almost invariably contains "blatant GCCisms"...
With a HW table walker you can just let it evict and reload.
I have on/off considered a HW page walker a few times, but it is mostly inertia and cost concern at this point.
The average time spent in the TLB Miss handler is low enough that it isn't too much of an issue.
Though, the 256x 4-way TLB (1024 total TLBEs) is apparently "abnormally large" for this class of processor.
But, this was because:
256x 4 with 16K pages: TLB miss rate tends to be pretty low.
64x 4-way with 16K or 4K pages: TLB miss rate is drastically higher.
Main factor being that the TLB needs to be big enough to cover the main part of the working set to keep the rate low, and most of my test programs tend to have less than 16MB in the core working set.
But, can also note that currently 32MB is allocated towards virtual memory pages, so exceeding 32MB of working set would also lead to a sharp increase in page faults (with most of the kernel operating in physically mapped pages).
Generally, executable code is using direct-mapping (a part of the virtual address space is used, but directly assigned to pages without being mapped to the page file)
But, this was mostly because (for semi-unknown reasons) trying to put ".text" sections into pagefile backed memory is prone to cause stuff to explode (*1).
*1: This behavior occurs in both the Verilog implementation and emulator and doesn't seem to care which ISA is used. It is most likely a software issue as there is little reason the CPU (or emulator) should actually care about this (and the I$ is virtual tagged, ...). I remember I tested in the past and it didn't matter whether the region was above or below the 4GB mark, ...
Generally, the ".data"/".bss" sections, stack, and heap, can be successfully put into pagefile backed memory. Which works, as these are the main "memory eating" parts.
But, would still need to mostly eliminate the kernel's reliance of physically and direct-mapped memory though if I wanted to expand the pagefile backed region to using a larger part of RAM.
Note that at present, TestKern low-4GB looks like:
00000000..0000BFFF: Boot ROM
0000C000..0000DFFF: Boot SRAM (Post Boot: interrupt stack)
00010000..000FFFFF: Specialized ROM areas
Mostly 64K ROM pages filled with all-zeroes or all NOPs or similar;
Used mostly for dummy pages for virtual memory handling.
01000000..3FFFFFFF: Reserved for RAM (typically wraps)
40000000..7FFFFFFF: Direct-Mapped virtual addresses.
80000000..FFFFFFFF: Mostly reserved.
Memory below the 512MB mark is generally set up as identity-mapped and supervisor only.
Typically virtual address ranges are allocated using an RNG:
Generate a random number virtual address;
Check if this space is available (by scanning the page table);
If not, try again.
...