On 3/11/2025 7:33 PM, MitchAlsup1 wrote:
On Tue, 11 Mar 2025 21:20:22 +0000, BGB wrote:
On 3/11/2025 12:57 PM, MitchAlsup1 wrote:
--------------
My whole space is mapped by BAR registers as if they were on PCIe.
>
>
Not a thing yet.
>
But, PCIe may need to exist for Linux or similar.
>
But, may still be an issue as Linux could only use known hardware IDs,
and it is a question what IDs it would know about (and if any happen to
map closely enough to my existing interfaces).
>
Otherwise, would be necessary to write custom HW drivers, which would
add a lot more pain to all of this.
There is already a driver in BOOT that reads config headers for
manufacture
and model, and use those to look up an actual driver for that device.
I simply plan on having My 66000 BOOT up code indexed by Mfg:Dev.
OK.
--------------
Some read-only CSRs were mapped over to CPUID.
>
I don't even have a CPUID--if you want this you go to config space
and read the configuration lists and extended configuration lists.
>
>
Errm, so vendor/Hardware ID's for each feature flag...
No, a manufacture:device for every CPU-type on the die. Then all of
the core identification is found on the [extended] configuration
lists.
core kind
fetch width
decode width
execute width
retire width
cache sizes
TLB sizes
predictor stuff
..
In practice, I expect some later phase in BOOT will read all this out
and package it for user consumption (and likely another copy for
supervisor consumption.) Then it is accessed as fast as any cached
chunk of memory.
As-is, a lot of this information is not available...
A lot of the feature flags are more things like whether or not the CPU supports Load/Store Pair or SIMD or similar.
30 and 31 give the microsecond timer and HW-RNG, which are more relevant
to user-land.
The timer running in virtual time or the one running in physical time ??
In the emulator, it is internally based on the virtual clock-cycle count (so, from its point of view, it seems accurate).
The emulator speed is then kept at roughly real-time based selectively running the VM based on whether or not the number of emulated clock-cycles is larger than the amount of clock-cycles that would have been executed given the external wall-clock time at the specified clock-speed.
Granted, this part isn't super exact regarding how closely things remain in-sync (but, on-average, it works).
There is also a separate "wallspeed" mode which basically runs the emulator as fast as it will go and uses the external wall-clock time instead (though with some internal trickery to estimate the microsecond values, albeit with the constraint that a later given time can't be less than a previously given time).
Partly this is because "actual wall-clock time" isn't super fine-grained (roughly millisecond territory on Windows), and the time within emulator isn't too closely tied to real-time (it roughly extrapolates and guesses based on how quickly the emulator is running in this mode).
This mode also disables some of the modeling usually used to try to keep things cycle-accurate (such as modeling L1 and L2 cache hits and misses).
For the Verilog partial simulation, it is based on clock cycles. So, it will drive the main clock at 50MHz, but also drive a 1MHz clock.
For the full simulation (and FPGA), there is a module that turns the input clock into various lower-speed clocks, such as 1MHz (for the microsecond timer, and 2Hz for the cursor and blinking text effect).
Note that the main PLL turns the 100MHz master clock into 100, 50 and 75MHz and similar.
Most other lower-speed soft clocks are based on fractional accumulation timers.
Well, except for things like the SPI clock, which are based on decrement-division, which give a limited range of values on the fast-end:
50, 25, 16.7, 12.5, ...
But, faster clocks via an accumulation timers gives very jittery output (so, they are better for lower-speed clocks, such as for driving audio output or RS232).
32..63: Currently unused.
>
>
There is also a cycle counter (along vaguely similar lines to x86
RDTSC), but for many uses a microsecond counter is more useful (where
the timer-tick count updates at 1.0 MHz, and all cores would have the
same epoch).
>
On x86, trying to use RDTSC as a timer is rather annoying as it may jump
around and goes at a different rate depending on current clock speed.
By placing the timers in MMI/O memory address space*, accesses from
different cores necessarily get different value--so the RTC can be
used to distinguish "who got there first".
MMI/O space is sequentially consistent across all cores in the system.
When it is driven by a 1MHz clock with a shared epoch across all cores, MMIO is unnecessary.
Granted, if one wants to time the relative ordering of events between CPU cores, a 1MHz clock may be insufficient, even at 50MHz.
------------
This scheme will not roll over for around 400k years (for a 64-bit
microsecond timer), so "good enough".
So at 1GHz the roll over time is 400 years. Looks good enough to me.
The microsecond timer is ideally independent of clock-speed.
Conceptually, this time would be in UTC, likely with time-zones handled
by adding another bias value.
What is UTC time when standing on the north or south poles ??
UTC doesn't care what the local timezone is.
Ideally, it shouldn't care about leap-seconds either, but seemingly the world has decided that they would rather keep UTC in sync with the calendar year than have a completely monotonic time.
Granted, the scope of a purely monotonic time would be limited by local variations due to relativistic effects; but for the most part, everyone on Earth can agree roughly how much time has passed.
Might start to matter in space colonization, say, when people on the Moon or Mars have their clocks running slightly faster than those on Earth due to the difference in the local gravity well.
Or, say, if people were out in the asteroid belt, their clocks running slightly faster than those in the inner solar system due to the gravity-well of the sun.
I guess it is a question then if Earth time remains the standard, or if people accept that clock-drift is inevitable. Well, and/or maybe Earth can broadcast a signal that can be used to keep the various clocks in sync (say, a 100MHz carrier which in turn regularly transmits the current time in in microseconds). Then, by picking up this transmission, it can be used to calibrate clocks to Earth time, and measuring the exact frequency of the signal can used to adjust for local timing skew. Granted, would also be effected by doppler-shifts, possibly leaving a question as to how much redshift or blueshift effects the question of "what time it is".
Say, for example, if the signal mostly sat at the carrier frequency, and then every 10us encodes the current time by QAM modulating the signal (possibly at 25mbps or so, using a similar encoding scheme to RS232).
Well, and/or people don't bother, and just live with a clock drift of a few microseconds per month or so...
Then again, the level of harm done by relativistic effects would still be smaller than the harm regularly done by the addition of leap seconds.
And, possibly, the merit of accurate time goes away once one is outside the range of where something like NTP could work (like, if one has a 20 minute ping time, maybe it doesn't matter if the clocks are a few seconds out of sync?...).
This can in turn be used to derive the output from "clock()" and
similar.
>
>
Also, there are relatively few software timing tasks where we have much
reason to care about nanoseconds. For many tasks, milliseconds are
sufficient, but there are some things where microseconds matters.
We used to run a benchmark 1,000,000 times in order to get accurate
information using a time with 1 second resolution. We do not want
to continue on that level.
OK.
>
Of which, all of the CPUID indices were also mapped into CSR space.
>
CPUID is soooooo pre-PCIe.
>
>
Dunno.
>
Mine is different from x86, in that it mostly functions like read-only
registers.
x86 uses a narrow bus that runs around the whole chip so it can access
all sorts of stuff (only some it is available to users) and some of
these
accesses take 1,000's of cycles.
OK.
Most of what my CPUID reports is entirely local to the module in question. The time is fed in externally (via a 1MHz clock).
The RNG is fed by entropy gathering, with a sort of "entropy path" that moves along through the ringbus and can basically deliver entropy to anything with a ringbus connection. This is basically fed into a sort of specialized structure of free-running LFSR's.
Externally, there is some logic (out near the L2 cache) which mostly serves the purpose of gathering entropy and basically injecting it back into the ringbus. Granted, how much "true entropy" exists here is subject to interpretation (one will only get true entropy so far as the sources being probed are non-deterministic).
However, the CPU itself is in a much better place to scavenge entropy than the OS is.
There were some amount of delays added to the 1 MHz path, as otherwise timing was tight along this path, and of there is a few extra cycles of delay here, it does not matter.
RISC-V land seemingly exposes a microsecond timer via MMIO instead, but
this is much less useful as this means needing to use a syscall to fetch
the current time, which is slow.
Or a generous MMU handler that lets some trusted low privilege level
processes direct access.
Probably a bit much, unless it is by itself.
Ideally, one doesn't want userland to have access to anything outside of what is "safe".
Granted, can also note that on RasPi, userland can easily get access to the MMIO range for GPIO, which also exposes a lot of other "sensitive" parts of the chipset (like, the application could easily bypass the OS and start directly poking at the SDcard or similar).
But, OTOH, RasPi didn't provide any good/fast way to access the SDcard in a way that didn't punch a big hole in the systems' protection scheme.
Doom manages to fetch the current time frequently enough that doing so
via a syscall has a visible effect on performance.
I had an old Timex a long time ago that I had to adjust the time
about 3 times a day to have any chance of accuracy. Solution--
quit wearing a watch.
Some of it was that the Doom port ends up needing to probe multiple times per frame for things like audio and MIDI updates.
Say, to keep the loop-buffer in sync, and for MIDI which may need to maintain a 72Hz update for music to sound OK, but with frame-rates that are generally nowhere near the needed 72Hz. If the MIDI is updated too slowly it might start sounding a bit "derpy" (I think in the DOS ports, things like MIDI updates would have been driven off a timer interrupt).
But, to some extent this means, for every drawn BSP leaf or similar, it is also needed to probe whether we need to update the MIDI state, which (if done via a system call) will eat a lot of cycles (though, the actual rate of issuing new MIDI commands is far less than this).
>
My 66000 does not even have a 32-bit space to map into.
You can synthesize such a space by not using any of the
top 32-address bits in PTEs--but why ??
>
>
32-bit space is just the first 4GB of physical space.
But, as-is, there is pretty much nothing outside of the first 4GB.
>
>
The actually in use MMIO space is also still 28 bits.
You are not trying to access 1,000 ACHI disks on a single rack, either;
each disk supporting several hundred GuestOSs.
Granted.
For an FPGA implementation, the 28 bit MMIO space is plenty...
PC's got along pretty well for a while with 16-bit IO port addresses.
The VRAM maps 128K in MMIO space, but in retrospect probably should have
been more. When I designed it, I didn't figure there would have been
more than 128K. The RAM backed framebuffer can be bigger though, but not
too much bigger, as then screen refresh starts getting too glitchy (as
it competes with the CPU for the L2 cache, but is more timing
sensitive).
One would think the very minimum to be do 32-bit color (8,8,8,8)
on an 8K display/monitor.
That framebuffer would need more RAM than exists on the main FPGA board I am using (and 8GB/sec of memory bandwidth just for the screen refresh).
It can basically barely manage enough bandwidth for 640x400 or 640x480 256-color.
But, when I designed the HW interface, I went with a 128K reservation originally imagining 16K or 32K for a color-cell display (and also partly because this is what VGA had reserved for itself, and it was seemingly sufficient for VGA).
Eventually ended up with 320x200 hi-color though. This uses up the whole region.
640x480 in 16-color mode exceeds this (but, apparently, VGA had dealt with this issue by using bit plane graphics).
But, can note that, with the RAM backed framebuffer, going much over 256K requires too much bandwidth for the refresh. But, staying below 128K is better for image stability.
800x600 at 2bpp color-cell needs 120K, so this works.
1024x768 needs 197K, which works, but is pushing it.
Other options for 1024x768 being:
Monochrome: 98K
1.25bpp CC: 124K (possible, doesn't exist yet)
Text Mode : 50K (at 4 bytes per character cell)
(Has more colors and a larger glyph set vs CGA/VGA text mode).
Can note though that these higher-res modes need to be driven at a 50 MHz clock, vs 320x200 and 640x480 and similar using a 25 MHz clock; leading to non-standard timings. But, at least the LCD monitors I had tried had tolerated a lot of my non-standard timings (even if they would often misidentify the intended mode and then display the image in wonky ways).
Likely, doing this timing properly while also supporting multiple modes would either require a freely adjustable clock (not really a thing here). The operating frequencies for pixel clocks are generally too fast to get good results from an accumulation timer (at least one of the LCD monitors I had tried looks like epic dog crap if the pixel clock isn't stable).
-----------
>
My interconnect bus is 1 cache line (512-bits) per cycle plus
address and command.
>
>
My bus is 128 bits, but MMIO operations are 64-bits.
>
Where, for MMIO, every access involves a whole round-trip over the bus
(unlike for RAM-like access, where things can be held in the L1 cache).
>
In theory, MMIO operations could be widened to allow 128-bit access, but
haven't done so. This would require widening the data path for MMIO
devices.
>
Can note that when the request goes onto the MMIO bus, data narrows to
64-bit and address narrows to 28 bits. Non-MMIO range requests (from the
ringbus) are not allowed onto the MMIO bus, and the MMIO bus will not
accept any new requests until the prior request has either finished or
timed out.
I see a big source of timing problems here.
We are approaching 128-cores on a die and more than 256-devices down
the PCIe tree. Does allowing only 1 access from one core to one device
at a time make any sense ?? No, you specify virtual channels, accesses
down a PCIe segment remain ordered while on the Tree and serialize at
the device (function) itself.
On the main FPGA I am using, it is single core.
If the FPGA can afford more than 1 or maybe 2 cores, and a small number of relatively limited MMIO devices, maybe I could have reason to care about the MMIO bus interface becoming system-level bottleneck.
Even the ring-bus design itself was more me trying to come up with a bus that was both moderately fast but also cheap. Faster bus designs are possible, but are not cheaper...
And, the MMIO bus has 28-bit address and 64-bit data because, it doesn't really need more than this.
Well, and the devices that need more than this, generally sit on the ringbus.
Well, except the SPI device, but it is odd, being one of the higher bandwidth devices still on the MMIO bus. And, along with the display module, was one of the original motivations for widening the MMIO data path to 64 bits.
But, the SPI device more just has the working goal of being fast enough to not bottleneck access to the SDcard (not an issue at present, but if I ran the SDcard in UHS mode, it would be a bit more of an issue; as it would get potentially roughly 8x faster at the same clock speed).
Contrast to x86 getting mostly OK results for a while with 16-bit port numbers and 8|16 bit data access.
But, say, not particularly bandwidth limited accessing a serial UART or PS2 keyboard interface or similar...