On 9/1/2024 1:34 AM, Terje Mathisen wrote:
MitchAlsup1 wrote:
On Fri, 30 Aug 2024 22:42:19 +0000, BGB wrote:
>
On 8/30/2024 1:11 PM, MitchAlsup1 wrote:
On Thu, 29 Aug 2024 19:07:29 +0000, BGB wrote:
Integer Overflow
>
Not usually a thing. Pretty much everything seems to treat integer
overflow as silently wrapping.
>
ADA wants these.
>
>
>
Bad Instruction encoding--OpCode exists but not as this
  instruction uses it. Random code generation can use
  every instruction without privilege.
>
Hit or miss.
>
Will usually fault on invalid instructions.
>
Must be 100% to guarantee upwards compatibility.
>
There is logic in place to reject privileged instructions in user-mode,
if the CPU is actually run in user-mode. Some of this is still TODO
(currently, TestKern is still running everything in Supervisor Mode).
>
Yes, it is a pain--but a pain that is absolutely worth it.
>
>
The alternative is to treat them as UB, so they may be one of:
Trap;
Do something else (like, if an instruction was added);
Do something wonky / unintended.
>
In practice, this seems to be more how it works.
>
Bad practice == not industrial quality.
>
>
> Bad address--address exists but you are not allowed to touch it> Â Â
with LD or ST instruction or to attempt to execute it.
>
If the MMU is enabled, it should fault on bad memory accesses.
>
In physical addressing mode, it does not trap.
>
YOU FAIL TO UNDERSTAND--there is an area in memory where the
preserved registers are stored--stored in a way that only 3
instructions can access--and the PTE is marked RWE=000
This prevents damaging the contract between callee and caller.
3 instructions can access these pages ENTER, EXIT and RET
nothing else.
>
>
IIRC, there was a mechanism on the bus to deal with accesses to bad
physical addresses (returning all zeroes). Otherwise, trying to access
an invalid address would cause the CPU to deadlock.
>
It is NOT a BAD address--it is a good but inaccessible address
outside those 3 instructions.
>
>
>
As I understand it, you don't even get FMUL correctly rounded.
To get it properly rounded you have to compute the full 53*53
product.
>
AFAICT, this wasn't required for the 1985 spec...
>
You Cannot get rounding correct unless you "compute as if to
infinite precision" and then follow the rules of rounding
(all modes).
This rule is in fact really simple:
In all versions of the standard, from the very first up to the upcoming 2029, the core instructions (FADD/FSUB/FMUL/FDIV/FSQRT) MUST result in the correctly rounded result, according to whatever the current rounding mode is/was.
This does mean that you have to act as if you did the calculation to arbitrary/infinite precision, which really means "enough bits so that any following bits do not matter for the rounding result".
It was a revelation to me when I wrote my first fp emulation code and grok'ed how having a single guard bit followed by a sticky bit was sufficient to do this for all rounding modes.
At that point I only needed to maintain enough intermediate bits to guarantee I would still have those rounding bits after normalization.
This doesn't mean that I could skip calculating all the bits of the full NxN->2N mantissa product, only that I didn't need to keep them all around after normalization.
OK.
It seemed like when I looked over the 1985 spec initially, it only required that the result be larger than that of the destination (seemingly missed the point of it also requiring infinite precision).
Say, 54*54 => 68 bits, where 68 > 52, under this interpretation, it would have worked. Granted, this does turn it into a probability game whether the result is correct or off by 1.
But, have now since noticed that it did specify computing to infinite precision (in this version of the standard), which, my FPU does not do.
There was mention of some operations that I have generally not seen in the ISA in real-world FPUs:
An FP remainder operator;
Converters to/from ASCII strings;
An FP->Int truncate operator with the result still in FP format;
Usually, one goes round-trip FP->Int->FP;
...
Seems like pretty much everyone offloaded these tasks to the C library.
I had ended up with coverage of most of the rest, albeit still lacking a "trap on denormal" handler (seemingly worked for MIPS and friends, *).
So, it seemed like it was getting pretty close to "could maybe pass the 1985 spec if one lawyers it...". Maybe not so much it seems, unless I fix the FMUL issue (TBD if it can be done without significantly increasing adder-chain latency).
It is possible I could also add a check to detect and trap multiplies for cases where both values have non-zero low-order bits (allowing these to also be emulated in software).
So, went and added a flag for "Trap as needed to emulate full IEEE semantics" to FPSCR, where the idea is that enabling this will cause it to trap in cases where the FPU detects that the results would likely not match the IEEE standard (if using FADDG/FSUBG/FMULG/..., generally if fenv_access is enabled).
Might make sense to have a compiler option to assume fenv_access is always enabled.
*: Though, from what I can gather, most of the N64 games and similar had operated with this disabled (giving DAZ/FTZ semantics) which apparently posed an annoyance for later emulators (things like moving platforms in games like SMB64 would apparently slowly drift upwards or away from the origin if the map was left running for long enough, etc; due to SSE and similar tending to operate with denormals enabled).
But, I guess there was also fun that emulated textures don't look quite the same either, as N64 used an approximation of bilinear filtering that only sampled 3 points rather than the standard 4.
Though, in my rasterizer module, I also copied this trick (since it allows saving 1 block-texture decoder and can use cheaper interpolation logic). Well, and the recent extra wonk of shoving HDR (as E4.F4 FP8U) though this pathway (and falling back to software rendering based on the blending mode).
This, leading to extra wonk, like still using linear alpha blending, and essentially turning the color modulation into E3.F5 unit range, as multiplying an E3.F5 unit range value with an E4.F4 FPU value (and taking the high result, as in a normal LDR multiply) gives approximately the desired result.
Lots of cheap hacks, but allowed for some semblance of HDR in a unit designed primarily for LDR...
FMAC (with single rounding, which is the interesting one) you can of course get catastrophic cancellation, so you need all the 2N mantissa bits of the multiplication plus the N bits from the addend, then you either need a normalizer wide enough to take in any possibly alignments of the two parts, or you must have separate logic for each of the major cases.
Yeah, for the 2008 spec onward, would also need this...
It is possible to provide it as a library call, but granted this makes it slower.
There are FMAC instructions, but they are currently both slow and double-rounded (so, not so useful). Well, except for Binary16 and Binary32 which appear single-rounded mostly because they happen to be performed internally as Binary64 (but are still slow).
I guess, might be trying to look into trying to design a "proper" Binary64 FMAC unit, though it seems like such a unit would likely be fairly expensive and have a fairly long latency.
Though, in this case, might need a way to specify whether one wants slower but accurate operations, or faster but less accurate (and separate faster/slower Binary64 units). But, multiple sets of Binary64 units would add resource cost (and, the fast path option of "use Binary64 format but at ~ Binary32 precision" is not likely to be usable as a general-case for "double"; worked for "float" though).
Though, likely option would be to leave FMUL as-is, and specifying the slow-but-accurate FMUL would involve using FMULG (which also checks the rounding mode and updates the FPSCR flag bits; vs 'FMUL' which does not update FPSCR flags and is hard-wired at round-nearest).
Well, and there is also FMULA:
Binary64 format, with roughly Binary32 precision;
Output is basically the raw output of a Binary32 truncated multiply;
Only uses the high-order parts of the mantissa,
and skips low-order results, with no rounding or similar.
Still TBD.
Would also be annoying if I wanted to support the RISC-V 'V' extension:
Would need to supply 32x128 bit additional registers;
Would need to support 8-wide Binary16 SIMD;
Meaning, 4 additional low-precision FMUL+FADD units.
Also SIMD FMAC, which was not done mostly because I wasn't sure how to fit it into a 3-cycle latency. But is a desirable operation for things like both NN math and matrix-multiply; provided it can be faster than doing it via separate ops (and, not so useful if it is slower than using separate ops, as-is presently the case).
Terje