Newsportal USENET - Re: Arguments for a sane ISA 6-years later

On 7/30/2024 4:44 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:
Otherwise, stuff isn't going to fit into the FPGAs.
>
Something like TSO is a lot of complexity for not much gain.
Given that you are so constrained, the easiest corner to cut is to
have only one core. And then even seqyential consistency is trivial
to implement.

On the XC7A100T, this is what I am doing...
With the current feature-set, don't have enough resource budget to go dual core at present.
I can go dual core on the Xc7A200T though.
Granted, one could argue that maybe one should not do such an elaborate CPU. Say, a case could be made for just doing a RISC-V implementation.
There is an RV32GC implementation (dual-issue superscalar) that can run on the XC7A100T that, ironically, still takes most of the FPGA and can only run at ~ 25 or 33 MHz. Its IPC is pretty good, but it runs at a low clock-speed and is 32-bit.
Only real way to make small/fast cores though is to make them single-issue and limit the feature-set (only doing a basic integer ISA).
Some amount of the cases where consistency issues have come up in my case have do do with RAM-backed hardware devices, like the rasterizer module. It has its own internal caches that need to be flushed, and not flushing caches (between this module and CPU) when trying to "transfer" control over things like the framebuffer or Z-buffer, can result in obvious graphical issues (and, texture-corruption doesn't necessarily look good either).
At present, the implementation is based mostly on drawing to a backing buffer which (at least once per frame, often more) needs to be reclaimed by the main CPU, so that its contents can be drawn to the screen or into the window buffer (in GUI mode).
Currently though, this module is relatively fast, but generally the CPU side of things isn't fast enough to keep it busy.
At present, the CPU still does transform. It is looking like, if one wants speed, one might also need a module that is able to do things like 3D transform (and/or figure out ways to try to make the front-end stages faster).
Ironically, despite its seeming levels of suck, I am apparently getting (technical) 3D performance stats on par with the original PlayStation, but a lot of the PS1 games had arguably comically low geometric complexity.
Sadly, AFAIK, no one has open-sourced any of the PS1 games (and Quake 1/2/3 don't have quite the same level of geometric minimalism).
Might be nice if the front-end stages could have been done using fixed-point math, but generally OpenGL is built around floating point, and generally stuff doesn't work correctly unless one had more or less full precision Binary32 in the transform stages.
Also using "glBegin()"/"glEnd()" and doing math per-vertex is not ideal for a CPU bound use-case.
It is generally better in this case to try to prebuild vertex arrays and use "glDrawArrays()" or "glDrawElements()" or similar. But, this isn't really how the Quake engines work. If anything, Quake3 seems to lean a little more into it, seemingly going much of its rendering with GL_TRIANGLE_FAN and GL_TRIANGLE_STRIP.
Ironically, in contrast to Quake 1 which really liked using GL_POLYGON.
With the current implementation, likely fastest case would be to use vertex arrays and GL_QUADS (with an occasional collinear vertex when one needs a triangle).
Though, seems like GL 1.x assumed costs being per-vertex rather than per-primitive (Triangle or Quad in this case).
Actually, I am almost left to wonder if an API design like Direct3D might have fared better here.
Granted, a case could be made for trying to make an implementation which does most of its front-end work in homogeneous coordinates (AKA: 4D XYZW space) rather than world-space (but, this would require a non-trivial rewrite to the front-end stages). But, could somewhat reduce the amount of times I need to send vertices through the transformation matrix.
Based on past experiments with software rasterized GL though, I had assumed that most of the time was going to be eaten up by the backend work (edge walking and span drawing); which is where I had put most of my optimization attention in TKRA-GL.
OTOH, there are possibly other uses for a rasterizer module, such as potentially using it for 2D rendering tasks (without otherwise sending everything through the OpenGL API or similar).
Though, its use-cases are partially limited in that it only supports squarish power-of-2 textures in Morton Order (which are atypical in things like UI drawing, where images are often NPOT and raster order).
Well, technically also the texture images and buffers needs to be at a physical address and with a 16-byte-aligned base address, ..., but nevermind this part...

Contrast, floating point and precise exceptions are a lot more relevant
to software.
John von Neumann (IIRC) argued against floating point, with similar
arguments that are now used to defend weak ordering.

Floating point has a lot of obvious use-cases though (and is already in widespread use). Would be a hard sell to have a processor without any floating point support.
Like, we can use fixed point where it makes sense to do so, but there are also a lot of cases where fixed-point doesn't really work for the problem.
Granted, there are also cases of people using floating point where maybe they shouldn't.
Though, there are cases where one could argue for precision-reduced floating point, say:
   S.E8.F16.Z7
   S.E11.F32.Z20
Where, Z bits are ignored and filled with 0's (and the other low-order bits are not necessarily accurate).
The argument being, the former is Binary32 but using logic similar to what one might use for a proper Binary16 unit, and the latter is Binary64 with logic similar to what one may use for a proper Binary32 unit.
FWIW: The latter is what one may get in my case if using the FADDA/FSUBA/FMULA instructions, with no guarantees though about format other than that it is "equivalent to or slightly better than Binary32".
The former existed for SIMD, but got largely displaced by proper Binary32 as I actually needed fast Binary32 SIMD (and the truncated case only makes sense if the hardware is natively doing Binary16 or similar).
Granted, in both cases, assuming that one is doing the internal math for FMUL using DSP48's (hard logic) or something similar.
Don't currently have any "native" floating point smaller than Binary16; though several 8-bit formats (including A-Law) are supported via converter ops.
Despite going and defining dedicated FP8 formats though (E4.F3.S, E4.F4, S.E4.F3), have more often ended up using A-Law (S.E3.F4) and sometimes adding an exponent bias (though generally because it has both a sign and is more accurate in these cases).
Generally can't use A-Law directly for NN's, because it seems one needs around 8 bits or so for the intermediate accumulator (mostly requiring Binary16 or similar). But, FP8 or biased A-Law would make a sensible format for weights and inputs/outputs.
I guess, if one wants, they could try to make a case for a SIMD op that does, say:
   4xFP16 = 4xFP16 + 4xAL*4xAL
Where, an exponent-biasing step is applied before the final weighting function (with the results being converted back to A-Law). Well, could in premise apply a Binary16 FMAC here.
Well, and/or with an op that multiplies two A-Law inputs and produces Binary16 outputs (or use FP8 and avoid the biasing wonk at the expense of of accuracy). This would still allow using A-Law without needing to use quite so many converter ops.
Though, a case that could be made for FP8S over A-Law, is that the mantissa multiply can fit directly into LUT6s (unlike the 4-bit A-Law mantissa). But, would then need to decide between the S.E4.F3 or E4.F3.S formats (though, arguably, the latter format may have been a mistake; its "merit" mostly going out the window once A-Law got added to the mix).
Also looks like NVIDIA and similar are using S.E4.F3, but pretty much no one else is using the A-Law format for this.
TBD.
Well, there are also the bitwise NN ops, which seem promising (*); will need to figure out a good way to interface them efficiently with FP16 nets (they have different use-cases; as the bitwise NN ops can only work with 1 or 2 bit inputs/outputs, with 3-bit weights).
Likely option for interfacing them would be, say, ops that take a SIMD vector and collapse the sign-bits down into the low-order bits:
   Rn = { Rn[59:0], S4, S3, S2, S1 };
*: They seem to be pretty effective at things like OCR, and a small-scale OCR test is fast enough to run inside a Verilator simulation without being annoyingly slow.
...

The other examples I gave are all examples where people have argued
that simplifying hardware at the cost of more complex software was the
way to go, and history proved them wrong.

For a lot of the time, there has also been Moore's Law, the situation may start to change once Moore's Law is entirely dead, and the only way to improve performance further is by reducing complexity in low-priority areas.
Say, even if per-core performance were worse, and they are harder to program, this may still win if one can fit more cores in a chip and run a similar amount of work with less power.
Decided to leave out going into a tangent about stuff related to neural nets and trying to process input from a camera module (sadly, even dealing with the input from a single 640x480 MIPI CPI "potato camera" in real-time is a bit of an ask on a Spartan-7 or on an Artix-7 FPGA board...).

- anton

Date	Sujet	#	Auteur
24 Jul 24	Arguments for a sane ISA 6-years later	63	MitchAlsup1
25 Jul 24	Re: Arguments for a sane ISA 6-years later	62	BGB
25 Jul 24	Re: Arguments for a sane ISA 6-years later	57	Chris M. Thomasson
26 Jul 24	Re: Arguments for a sane ISA 6-years later	56	Anton Ertl
26 Jul 24	Re: Arguments for a sane ISA 6-years later	20	BGB
29 Jul 24	Re: Arguments for a sane ISA 6-years later	19	Anton Ertl
29 Jul 24	Intel overvoltage (was: Arguments for a sane ISA 6-years later)	2	Thomas Koenig
29 Jul 24	Re: Intel overvoltage	1	BGB
29 Jul 24	Re: Arguments for a sane ISA 6-years later	16	BGB
30 Jul 24	Re: Arguments for a sane ISA 6-years later	15	Anton Ertl
30 Jul 24	Re: Arguments for a sane ISA 6-years later	14	BGB
30 Jul 24	Re: Arguments for a sane ISA 6-years later	2	Chris M. Thomasson
30 Jul 24	Re: Arguments for a sane ISA 6-years later	1	BGB
1 Aug 24	Re: Arguments for a sane ISA 6-years later	11	Anton Ertl
1 Aug 24	Re: Arguments for a sane ISA 6-years later	1	Michael S
1 Aug 24	Re: Arguments for a sane ISA 6-years later	8	MitchAlsup1
1 Aug 24	Re: Arguments for a sane ISA 6-years later	1	Michael S
2 Aug 24	Re: Arguments for a sane ISA 6-years later	6	MitchAlsup1
2 Aug 24	Re: Arguments for a sane ISA 6-years later	1	Michael S
4 Aug 24	Re: Arguments for a sane ISA 6-years later	4	MitchAlsup1
5 Aug 24	Re: Arguments for a sane ISA 6-years later	3	Stephen Fuld
5 Aug 24	Re: Arguments for a sane ISA 6-years later	2	Stephen Fuld
5 Aug 24	Re: Arguments for a sane ISA 6-years later	1	MitchAlsup1
1 Aug 24	Re: Arguments for a sane ISA 6-years later	1	BGB
26 Jul 24	Re: Arguments for a sane ISA 6-years later	20	MitchAlsup1
27 Jul 24	Re: Arguments for a sane ISA 6-years later	1	BGB
29 Jul 24	Memory ordering (was: Arguments for a sane ISA 6-years later)	18	Anton Ertl
29 Jul 24	Re: Memory ordering	15	MitchAlsup1
29 Jul 24	Re: Memory ordering	6	Chris M. Thomasson
29 Jul 24	Re: Memory ordering	5	MitchAlsup1
30 Jul 24	Re: Memory ordering	4	Michael S
31 Jul 24	Re: Memory ordering	3	Chris M. Thomasson
31 Jul 24	Re: Memory ordering	2	Chris M. Thomasson
31 Jul 24	Re: Memory ordering	1	Chris M. Thomasson
30 Jul 24	Re: Memory ordering	8	Anton Ertl
30 Jul 24	Re: Memory ordering	2	Chris M. Thomasson
30 Jul 24	Re: Memory ordering	1	Chris M. Thomasson
31 Jul 24	Re: Memory ordering	5	MitchAlsup1
31 Jul 24	Re: Memory ordering	1	Chris M. Thomasson
1 Aug 24	Re: Memory ordering	3	Anton Ertl
1 Aug 24	Re: Memory ordering	2	MitchAlsup1
2 Aug 24	Re: Memory ordering	1	Anton Ertl
29 Jul 24	Re: Memory ordering	2	Chris M. Thomasson
30 Jul 24	Re: Memory ordering	1	Chris M. Thomasson
29 Jul 24	Re: Arguments for a sane ISA 6-years later	13	Chris M. Thomasson
29 Jul 24	Re: Arguments for a sane ISA 6-years later	9	BGB
29 Jul 24	Re: Arguments for a sane ISA 6-years later	8	Chris M. Thomasson
29 Jul 24	Re: Arguments for a sane ISA 6-years later	1	Chris M. Thomasson
29 Jul 24	Re: Arguments for a sane ISA 6-years later	2	BGB
29 Jul 24	Re: Arguments for a sane ISA 6-years later	1	Chris M. Thomasson
30 Jul 24	Re: Arguments for a sane ISA 6-years later	4	jseigh
30 Jul 24	Re: Arguments for a sane ISA 6-years later	3	Chris M. Thomasson
31 Jul 24	Re: Arguments for a sane ISA 6-years later	2	jseigh
31 Jul 24	Re: Arguments for a sane ISA 6-years later	1	Chris M. Thomasson
29 Jul 24	Memory ordering (was: Arguments for a sane ISA 6-years later)	1	Anton Ertl
29 Jul 24	Re: Arguments for a sane ISA 6-years later	2	MitchAlsup1
29 Jul 24	Re: Arguments for a sane ISA 6-years later	1	BGB
6 Aug 24	Re: Arguments for a sane ISA 6-years later	2	Chris M. Thomasson
6 Aug 24	Re: Arguments for a sane ISA 6-years later	1	Chris M. Thomasson
25 Jul 24	Re: Arguments for a sane ISA 6-years later	4	MitchAlsup1
26 Jul 24	Re: Arguments for a sane ISA 6-years later	1	BGB
28 Jul 24	Re: Arguments for a sane ISA 6-years later	2	Paul A. Clayton
28 Jul 24	Re: Arguments for a sane ISA 6-years later	1	MitchAlsup1