On 2/19/2025 9:02 PM, MitchAlsup1 wrote:
On Wed, 19 Feb 2025 22:42:04 +0000, BGB wrote:
On 2/19/2025 11:31 AM, MitchAlsup1 wrote:
On Wed, 19 Feb 2025 16:35:41 +0000, Terje Mathisen wrote:
>
------------------
sign+ULP+Gard+sticky is all you ever need for any rounding mode
IEEE or beyond.
>
That's what I believed all through the 2019 standards process and up to
a month or two ago:
>
In reality, the "NearestOrEven" rounding rule has an exception if/when
you need to round the largest possible fp number, with guard=1 and
sticky=0:
>
I.e. exactly halfway to the next possible value (which would be Inf)
>
In just this particular case, the OrEven part is skipped in favor of not
rounding up, so leaving a maximum/odd mantissa.
>
In the same case but sticky=1 we do round up to Inf.
>
This unfortunately means that the rounding circuit needs to be combined
with an exp+mant==0b111...111 input. :-(
>
You should rename that mode as "Round but stay finite"
>
>
So, does it overflow?...
Based on how IEEE 754 wo9rked throughout its history::
If the calculation overflows without the need for rounding;
yes, it overflows. It is just that rounding all by itself does
not overflow that is different.
----------------
OK.
It almost is kinda sad in a way that IEEE-754 lacks the same sort of wonky overflow behaviors that we accept as standard in integer land.
Like, what if, say:
There were no Inf or NaN, and FPU just quietly overflowed and wrapped around (probably back down to near-zero range, but probably with the opposite sign).
Though, IIRC, this was sort of a thing (at least at one point) for Binary16 on ARM. Like, it wasn't until later that they switched to having Inf and NaN and similar.
Though, in my case, I discard Inf/NaN for Fp8, but make it saturating (0x7F and 0xFF representing the maximum and minimum values). Then 0x00/0x80 are "usually" understood as 0. Sorta makes more sense for FP8 as they are small enough that it is practical to also deal with the entire mantissa in these cases.
But, it is a tradeoff, without Inf/NaN, 99968.0 can be expressed with Binary16, but with Inf/NaN in existence, this is out of range...
>
Admittedly part of why I have such mixed feelings on full
compare-and-branch:
Pro: It can offer a performance advantage (in terms of per-clock);
Con: Branch is now beholden to the latency of a Subtract.
Con: it can't compare to a constant
Con: it can't compare floating point
I have experimented with encodings in my RV+Jx mode and XG3 that can allow for constants...
However, the performance delta is pretty small.
Meanwhile:
SLT X6, X18, 0x123
BNE .L0, X6, X0
Isn't that much different than:
LI X6, 0x123
BNE .L0, X18, X6
In my case, I also have BTST/BNTST cases; but these can be done with lower latency...
Though, the relative gain from BTST/BNTST is also fairly modest.
I wouldn't expect much savings from FPU compare here, as they tend not to be all that high on the clock-cycle rankings. Theoretically, they wouldn't cost too much to add (though, potentially, with fairly loose adherence to IEEE semantics).
Though, one could argue that IEEE rules for comparison are a bit too complicated, and one could simplify:
<, >, <=, >=: Behave as equivalent to a sign/magnitude integer comparison.
==, !=: NaN special case merely makes == false, and thus != is true by extension.
Does implicitly mean NaN>Inf is true, but, probably fine in practice...
Main useful case is using "if(!(x==x)) ..." to detect NaN.
Otherwise, which side of the branch ones' comparison falls down in the case of a NaN depends mostly on the whims of the compiler and similar.
----------------
Where, detecting all zeroes is at least cheaper than a subtract. But,
detecting all zeroes still isn't free (for 64b, ~ 10 LUTs and 3 LUTs
delay).
1 gate 4-inputs inverted
2 gates 16-inputs true
3-gates 64-inputs inverted
I was thinking here:
6 bits in, 1 bit out;
But, can't be done in 2-levels of 6-bits, so needs to split it up and use 3 levels.
With ASIC logic, presumably one could just construct a 64-input OR gate. Though, maybe this would get fiddly as one would have to balance pull-up and pull-down strength against transistor leakage; so 2-levels of 8-input OR's might make more sense.
Though, on FPGA, one could combine it with AND logic: (A & B) != 0
Which could still be handled in 3 levels of LUT6 (albeit with 26 LUTs).
I am admittedly a little annoyed as Windows had recently rebooted my PC on its own for Windows Update, which now means I probably need to wait until tomorrow to see the results of a few Verilog tweaks (might have otherwise seen them today if not for the reboot).
Mostly because the crashes happen after Doom does its whole
"[..... ]"
Thing, which doesn't exactly happen quickly.
Could almost makes sense to set up a decently fast PC probably running Linux, probably with minimal or no GPU, mostly just to run Verilator simulations.
While I have an old Xeon E5410 based rack server, this is not ideal:
Uses a lot of power;
Sounds like one is running a vacuum cleaner;
The E5410 (at 2.3 GHz) is slower than my main PC at running Verilator.
Would probably want enough RAM and CPU that it could run, say, 8 simulations at the same time.
Though, reminds me of seeing a lot of people complaining online about CPU fan noise from PCs and similar...
Would be funny to see how this type of person would respond to rack-server levels of fan noise.
...
Maybe I should just go add the test case to the Boot ROM and fire up a 5th simulation. I will at least not have to wait until tomorrow to see the results on this one (eg, whether the debug-prints will happen to reveal enough clues to locate the decoding bug...).
The first parts of the Boot-ROM and also TestKern shell/kernel, mostly just being a bunch of sanity test code to verify whether or not various parts of the ISA and similar are behaving as expected.
But, in this case, can't put it in the shell, since this is still built in XG1 mode, though I did add the ability to boot the Boot-ROM in RISC-V and XG3 Modes (via an ugly hack), partly for this sort of testing.
Technically, the kernel could be built in XG2 or XG3 mode, but this would add hassle (or, at least, beyond that already spent building Doom for multiple ISA modes).
Though, in part, the kernel running in XG1 mode mostly works as a "confirm I haven't broken XG1" test.
...