On 2/19/2025 11:38 PM, BGB wrote:
On 2/19/2025 9:02 PM, MitchAlsup1 wrote:
On Wed, 19 Feb 2025 22:42:04 +0000, BGB wrote:
>
...
Maybe I should just go add the test case to the Boot ROM and fire up a 5th simulation. I will at least not have to wait until tomorrow to see the results on this one (eg, whether the debug-prints will happen to reveal enough clues to locate the decoding bug...).
...
Which promptly made quick work of finding that bug...
Or, such is the merit of not having around around a day for each debug cycle...
I guess, now I can restart the Doom run, and maybe know in a day or so if anything else is still broken in XG3 Mode...
Turns out the outer decoder was failing to enable some XG3 related logic for XG3 mode, and so was failing to correctly decode Jumbo prefixes that didn't match the encoding pattern for XG1; causing the instructions in question to be decoded without the jumbo prefix being applied...
Where, the main decoder sees XG3 as it looks after all the bits are shuffled back into XG2's bit ordering and similar (with tag bits for each instruction word to indicate whether they are a repacked XG3 instruction or a RISC-V instruction; since after repacking, there is no longer an unambiguous way to tell them apart).
The inner decoder was being like XG2 or XG3, treat both as XG2.
The outer decoder was being more fussy, only decoding XG2 Mode as XG2 (and thus missing jumbo prefixes that didn't start with 0xFE ...).
Can't use this strategy for the RV+Jx debugging at the moment, as I don't yet know of an offending code sequence (in the case of the bug that was in XG3, I had identified the code sequence, the problem was that Doom had to get almost entirely started up before the code was triggered).
Or, would be nice if this stuff wasn't running at 500x slower than real-time.
Still faster than the full-hardware simulation though, which is around 2000x slower than realtime.
Does make me half wonder if there could be any way to simulate Verilog faster. Say, hypothetically, if someone could have something partway between a general purpose CPU and an FPGA. Could then maybe be, hopefully, "less dead slow", but still have the ability to give debug messages via "$display()" statements.
Well, alternatively, a PC with a whole lot of cores, and the ability to spread the Verilog simulation across a large number of threads. Or, maybe run it on the GPU or something (though, it doesn't seem like a great fit).
My thinking here would be that one could have big buffers representing all of the FF's and similar, and a number of threads for some number of LE's. Each clock, the simulation advances to the next buffer in the sequence, and each thread sees this, updates for all the LEs that it manages, and flags that it has completed. Once all of the threads finish, the main thread runs any test-bench logic and advances to the next buffer, and so on. Each clock-domain could then be given its own group of threads.
Alternatively, could have a work queue, but work queues have poor scalability regarding the number of worker threads. Though, potentially, an intermediate option could be that each thread has its own queue, and then initially items from the main queue are then distributed among the worker threads (at which point, the main queue no longer matters, apart from possible load-balancing; say, if one worker is lagging behind, the main thread could asynchronously signal for that it should give up items, the work item could be given to a faster thread).
But, maybe moot, would need something with a lot of cores to make this worthwhile, vs just using a 8/16-core machine and running multiple instances.
Granted, Verilator supports multi-threaded simulation, but I am not sure the specifics of how it works (just that seemingly I don't have enough computer to both run N instances and use multithreaded simulation without unreasonably bogging down my PC; with 4-6 instances being a reasonable limit).
...
Also earlier, once again noted the "power" of LZ compressing the binaries:
Had started the loading for two different versions of Doom at the same time (in the two instances). Noted that the RV+Jx build was notably behind the XG3 instance; then remembered that this is because I had switched the RV+Jx build back to uncompressed PE/COFF.
This is with zero added latency for the SPI interface (which in this case can move 64 bits at a time), which for the partial simulation is operating at the equivalent of around 5MB/sec (or, the bottleneck is more how quickly it can send MMIO requests over the bus).
...