On 7/30/2024 3:56 PM, Chris M. Thomasson wrote:
On 7/30/2024 1:23 PM, BGB wrote:
On 7/30/2024 4:44 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
Otherwise, stuff isn't going to fit into the FPGAs.
>
Something like TSO is a lot of complexity for not much gain.
>
Given that you are so constrained, the easiest corner to cut is to
have only one core. And then even seqyential consistency is trivial
to implement.
>
>
On the XC7A100T, this is what I am doing...
>
With the current feature-set, don't have enough resource budget to go dual core at present.
>
I can go dual core on the Xc7A200T though.
>
>
>
Granted, one could argue that maybe one should not do such an elaborate CPU. Say, a case could be made for just doing a RISC-V implementation.
>
There is an RV32GC implementation (dual-issue superscalar) that can run on the XC7A100T that, ironically, still takes most of the FPGA and can only run at ~ 25 or 33 MHz. Its IPC is pretty good, but it runs at a low clock-speed and is 32-bit.
>
Only real way to make small/fast cores though is to make them single-issue and limit the feature-set (only doing a basic integer ISA).
[...]
Have you ever messed around with a Cell processor? Think of its vector processing units, or Synergistic Processing Elements (SPE) iirc. Also, iirc it was not that easy to program for. buffered DMA wrt the SPE's, again iirc. So, some games only used the "single" PPE unit. Iirc, they wanted more PPE units but that was not realized...
No real first-hand experience programming for it, but was early 20s when the PlayStation3 came out, and wasn't really messing with much of anything beyond normal desktop PCs at the time.
I had a few times considered trying to pair a bigger core (such as one running BJX2 main profile) with smaller cores (running a smaller profile for BJX2), but couldn't really get the smaller core small enough while still being useful for what I wanted to do with it.
While a moderately smaller core is possible by using a single-issue integer-only design, this is rather limited...
And, sticking two more feature-limited cores on an FPGA isn't terribly useful.
Nor is going tri-core or quad-core with minimalist cores.
Say:
One core, of my current configuration is more useful than, say:
Two cores that do basic Integer+FPU+TLB;
Four cores, that only do Integer.
Like, say, an RV64I or RV32IM quad-core would not necessarily all that useful.
Trying to fit the BJX2 core on an XC7S50, I needed to drop to 2-wide in order to fit it in with the fast SIMD unit.
It was a tradeoff between:
3-wide, but 10 cycle SIMD ops;
2-wide, with 3 cycle SIMD ops.
On the XC7S25 or XC7A35T, not really managed to fit much beyond simple integer cores. But, these FPGAs are small enough, that it is generally better to drop to 32-bit.
So, for example, an RV32IM is about what makes sense on an XC7S25 or XC7A35T.
Where the last number is loosely correlated to total LUT size:
XC7A100T is ~ 3x the LUTs as the XC7A35T.
But not exactly 1:1 between Artix and Spartan.
For Spartan, the number is closer to the number of kLUTs, but Artix has slightly less LUTs relative to the part number; so the XC7S25 and XC7A35T are fairly comparable.
As for the matter of, if I add SIMD ops for 8-bit multiply widening to Binary16, whether to use A-Law or FP8, currently FP8 seems to be ahead:
More popular (NVIDIA is also using FP8);
More dynamic range;
Will be slightly cheaper to implement;
...
Also torn between the more expensive route:
Trying for a 3 cycle MAC operation;
Would likely glue it onto the low-precision SIMD unit.
Or, the cheaper route:
Trying for a 2 cycle PMUL;
Likely via the CONV2 path.
Not likely worthwhile to put it in the 3-cycle MUL path:
Would gain little performance-wise over converter ops;
This was mostly used for more complex converters:
Index-Color Packing;
Color-Cell Encode;
...
The operation logic is likely fast enough that it could be put in a 2-cycle path.
Though, trying to shove it onto the front-end of a SIMD FADD is likely pushing it.
Multiplier logic likely something like:
tSgnA=valA[7];
tSgnB=valB[7];
tExpA={ valA[6], !valA[6], valA[5:3] };
tExpB={ valB[6], !valB[6], valB[5:3] };
tFraA=valA[2:0];
tFraB=valB[2:0];
tZeroA=(valA[6:0]==7'h00);
tZeroB=(valB[6:0]==7'h00);
tSgnC=tSgnA^tSgnB;
tExpC0=tExpA+tExpB+0;
tExpC1=tExpA+tExpB+1;
tZeroC=tZeroA|tZeroB;
case({tFraA, tFraB})
6'b000_000: tFraC0=8'h40;
6'b000_001: tFraC0=8'h48;
...
6'b001_001: tFraC0=8'h51;
...
6'b111_111: tFraC0=8'hE1;
endcase
if(tFraC0[7])
begin
tExpC=tExpC1;
tFraC={tFraC0[7:0], 3'h0};
end
else
begin
tExpC=tExpC0;
tFraC={tFraC0[6:0], 4'h0};
end
tValC={tSgnC, tExpC, tFraC[9:0]};
if(tZeroC)
tValC=16'h0000;
Which can most likely fit in a 2-cycle operation...
...