Newsportal USENET - Re: Cost of handling misaligned access

On 2/18/2025 5:31 PM, MitchAlsup1 wrote:

On Tue, 18 Feb 2025 22:34:48 +0000, BGB wrote:

Say, one could imagine an abstract model where Binary64 FADD works sort
of like:
   sgnA=valA>>63;
   sgnB=valA>>63;
   expA=(valA>>52)&2047;
   expB=(valB>>52)&2047;
   fraA=(valA&((1ULL<<52)-1));
   fraB=(valB&((1ULL<<52)-1));
   if(expA!=0)fraA|=1ULL<<52;
   if(expB!=0)fraB|=1ULL<<52;
   fraA=fraA<<9; //9 sub ULP bits
   fraB=fraB<<9;
   shrA=(expB>=expA)?(expB-expA):0;
   shrB=(expA>=expB)?(expA-expA):0;
                        expA-expB

That wasn't the only typo here...

   sgn2A=sgnA; exp2A=expA; fra2A=fraA>>shrA;
   sgn2B=sgnB; exp2B=expB; fra2B=fraB>>shrB;
   //logical clock-edge here.
   fr1C_A=fra2A+fra2B;
   fr1C_B=fra2A-fra2B;
   fr1C_C=fra2B-fra2A;
   if(sgn2A^sgn2B)
   {
     if(fr1C_C>>63)
       { sgn1C=sgn2A; fra1C=fr1C_B; }
     else
       { sgn1C=sgn2B; fra1C=fr1C_C; }
   }
   else
     { sgn1C=!sgn2A; fra1C=fr1C_A; }
   //logical clock-edge here.
   if(fra2C>>62)
     { exp3C=exp2C+1; fra3C=fra2C>>1; }
   else
     { shl=clz64(fra2C)-2; exp3C=exp2C-shl; fra3C=fra2C<<shl; }
   //logical clock-edge here.
   if((exp3C>=2047) || (exp3C<=0))
     { sgnC=sgn2C; expC=(exp3C<=0)?0:2047; fraC=0; }
   else
   {
     sgnC=sgn2C; expC=exp3C; fraC=fra3C>>9;
     //if rounding is done, it goes here.
   }
   valC=(sgnC<<63)|(expC<<52)|fraC;
   //final clock edge.
   //result is now ready.

I had also messed up the construction of the final value, as it fails to mask off the hidden bit, ...
This sort of thing is sometimes difficult to type freehand without messed up at least something...
But, yeah, I guess the open question is if there is a general cheaper way to approach this sort of thing that still gives acceptable results.
It is possible to reduce the number of adders in the middle, but with other drawbacks:
Using sign-bits and selective bitwise invert allows using only a single adder, but can lead to an undesirable artifact, eg:
   4.0 - 1.0 => 2.999999
   1.0 - 2.0 => -0.999999
   ...
For many use-cases, this would be unacceptable.
At least, more so than the lack of rounding and NaN handling and similar...
Well, and there is still the issue of FMUL.
I guess one option could be to define its behavior in terms of 16*16=>32 bit widening multipliers. Though, one would still need to supply a few 8*8=>16 multipliers or similar, which are more of a problem. Defining it in terms of 17-bit multiplies with 3*3=>6 on the bottom better matches how one might do it on an FPGA (effectively multiplying two 53 bit values, and discarding the low results).
Or, the possibly cheaper alternative of allowing the low order bits of the mantissa to be ignored, possibly leaving:
   S.E11.M47.Z5
But, giving an FMUL which could be described "exactly" using 6 unsigned 16 bit multipliers (but, "less good" than using the full 52 bits).
Though, if bit exact, seemingly there would likely still be the issue that N-R fails to fully converge due to "spooky magic" that happens within sub-ULP space.
But, then again, I haven't tested if this behavior would differ with truncate-only rounding.
Well, similarly, one could define some cheaper format converters as being almost exclusively in terms of bit-repacking logic (no rounding or other complex logic).
Though, there is less of a "good" way to do cheap integer conversion, besides the trick I had used for SIMD:
Map integer range to a value between 1.0 and 2.0 (unsigned) or 2.0 and 4.0 (signed), and use the mantissa bit to hold the integer value.
Well, say, to convert 'x' to a 16-bit value:
   y=x*(1.0/32768)+3.0;
Then, extract the high 16 bits of mantissa, inverting the MSB.
...

Date	Sujet	#		Auteur
3 Feb 25	Re: Cost of handling misaligned access	106		Anton Ertl

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal