Sujet : Re: Tonights Tradeoff - Background Execution Buffers
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.archDate : 04. Oct 2024, 18:28:49
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vdp8kk$a94i$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14
User-Agent : Mozilla Thunderbird
On 10/3/2024 11:04 PM, Robert Finch wrote:
Today I am wondering how many predicate registers are enough. Scanning webpages reveals a variety. The Itanium has 64-predicates, but they are used for modulo loops and rotated. Rotating register is Itaniums method of register renaming, so it needs more visible registers. In a classic superscalar design with a RAT where registers are renamed, it seems like 64 would be far too many. Cray had eight vector mask registers. I think the RISCV- Hwatcha has 16 if I looked at the diagram correctly.
I cannot see the compiler making use of very many predicate registers simultaneously. Since they are not used simultaneously, and register renaming is in effect, there should not be a great need for predicate registers.
Suppose one wants predicated logic in a loop with the predicate being set outside of the loop. It may be desirable to have several blocks of logic predicated by different predicates in the loop. It is likely desirable to have more than one predicate then.
->Reserved four bits in the instruction for predicates. Do not want to waste bits though. Using a 64-bit instruction.
I was getting along OK with a single predicate bit flag.
Had considered supporting an alternate predicate bit, but it didn't seem to gain enough to be worthwhile. Similar for possibly supporting 8 predicate registers, with dedicated logic-ops.
I had originally designed a predicate bit-stack where operations would push/pop the bits in a similar way to x87, but ended up not using it. But, a later idea would have used logic ops with effectively 3-bit register fields.
But, ironically, mostly ended up instead using GPRs for cases where conditional logic ops were needed, as both would end up needing roughly a similar number of instructions (and there were already ways of getting between the T bit and GPRs).
Ironically, have been internally debating whether to try to glue predication onto my tweaked extension of RISC-V (not done so yet, still debating it). With my recent jumbo-prefix extension, had defined a few bits for this, but as-is would mean that any conditional ops would effectively need a 64-bit encoding. Granted, predicated branches are typically only a few instructions.
BGBCC had limited the if/else branches to a single statement with a limit on the types and number of operators in the predicated expression (generally, as past a certain number of operators it becomes more efficient to branch rather than predicate). But, with the scope of predication being so limited, it does also limit the need for more than a single predicate bit.
In things like GL software rasterization, it is useful for things like Z and Alpha testing, which are often prone to eat a lot of cycles on unpredictable branches.
Though, ironically, had recently disabled branch hit/miss modeling in my emulator mostly to make it easier for the emulator to keep up with real-time. Things like modeling branch hit/miss and hit/miss in the cache hierarchy are not ideal for emulator performance (could try to model/detect stale memory accesses from the weak memory model, but this would make things slower than they are already; and by thins point I almost may as well implement a full mockup of the memory subsystem in the emulator, and performing memory accesses by shuffling cache lines around, ...).
As-is, emulator spends more time with the memory subsystem and similar than it spends actually running instructions (but, then again, this part is in turn limited by counting clock cycles and moderating things to try to keep it from running faster than it would in the FPGA version, unless options are given to disable this).
If let to run at full-speed with the cache modeling disabled, the interpreter can run at around 200-250 mips or so (getting much faster than this will require a JIT, but this part had atrophied and doesn't currently work; I don't really need it when trying to model a CPU that runs at 50 MHz...).
Had noted before when trying to run it on a RasPi that my interpreter does still seem to be somewhat faster than the one in DOSBox (which is too slow in this case to really run Doom effectively; but x86 is difficult here due to things like nearly every instruction potentially touching EFLAGS, etc).
Faster emulation is also possible if one can leverage the underlying hardware's address translation, but this depends on a lot of OS-specific stuff (OS isn't going to just give an application access to the underlying page tables or similar, ...). But, a few do go this route.
If doing an OS, one option here being to potentially allow an API to support "virtual nested page tables", where say the application could request that part of its virtual address range be mapped through a logical page-table controlled by the application; with logical page-faults transferred back to the program via the "signal()" mechanism or similar (this would likely be handled independently of however the OS/target implements virtual memory handling at the hardware level). May need to be a way to signal to the OS whenever the user-managed page-table has been updated though (well, and/or require syscalls for each PTE, but this may be slower than simply allowing user-code to update it and using a syscall to be like "hey, the page table has been updated" and let the OS figure out what needs to be invalidated/updated/etc).
...