Liste des Groupes | Revenir à c arch |
On 8/30/2024 7:11 PM, Paul A. Clayton wrote:You are falling for the VLIW thought train trap...On 8/28/24 11:36 PM, BGB wrote:>On 8/28/2024 11:40 AM, MitchAlsup1 wrote:[snip]>My 1-wide machines does ENTER and EXIT at 4 registers per cycle.>
Try doing 4 LDs or 4 STs per cycle on a 1-wide machine.
>
It likely isn't going to happen because a 1-wide machine isn't going
to have the needed register ports.
For an in-order implementation, banking could be used for saving
a contiguous range of registers with no bank conflicts.
>
Mitch Alsup chose to provide four read/write ports with the
typical use being three read, one write instructions. This not
only facilitates faster register save/restore for function calls
(and context switches/interrupts) but presents the opportunity of
limited dual issue ("CoIssue").
>
I was mostly doing dual-issue with a 4R2W design.
>
Initially, 6R3W won out mostly because 4R2W disallows an indexed store
to be run in parallel with another op; but 6R3W did allow this. This
scenario made enough of a difference to seemingly justify the added cost
of a 3-wide design with a 3rd lane that goes mostly unused (and is
mostly limited to register MOV's and basic ALU ops and similar).
>
>
But, then this leads to an annoyance:
As is, I will need to generate different code for 1W, 2W, and 3W
configurations;
It is starting to become tempting to generate code resembling that for
the 1W case (albeit still using the shuffling that would be used when
bundling), and then using superscalar since, it turns out, it is not
quite as expensive as I had thought).
With superscalar, I wouldn't have the issue of 2W and 3W cores havingSuch is the advantage of configurable register file ports.
issues running code built for the other.
Also, on both 2W and 3W configurations, I can have a 128-bit MOV.XVLIW trap again.
(load/store pair) instruction, so if one assumes 2-wide as the minimum,
this instruction can be safely assumed to exist.
I can mostly ignore 1-wide scenarios (2R1W and 3W1W), mostly as I haveTisc..
ended up mostly deciding to relegate these to RISC-V.
By the time I have stripped down BJX2 enough to fit into a small FPGA,An accurate but slight underestimate.
it essentially has almost nothing to offer that RV wouldn't offer
already (and it makes more practical sense to use something like RV32IM
or similar).
>
>
>
I am not sure how one would efficiently pull off a 4W write operation.
>
>
>
Can note that generally, the GPR part of the register file can be built
with LUTRAMs, which on Xilinx chips have the property:
1R1W, 5-bit addr, 3-bit data; comb read, clock-edge write.
1R1W, 6-bit addr, 2-bit data; comb read, clock-edge write.
>
>
This means, the number of LUTRAMs needed for NxM with G registers can be
calculated:
2R1W, 32, Cost=44
3R1W, 32, Cost=66
4R2W, 32, Cost=176
6R3W, 32, Cost=396
4R4W, 32, Cost=352
6R4W, 32, Cost=528
>
2R1W, 64, Cost=64
3R1W, 64, Cost=96
4R2W, 64, Cost=256
6R3W, 64, Cost=576
4R4W, 64, Cost=512
6R4W, 64, Cost=768
>
10R5W, 64, cost=1600.
>Depends on who implemented the SRAM and RF technology.
I am not sure about ASIC.
For FPGA, pretty sure that bidirectional ports would gain little orIt is even easier when you have access to individual transistors
nothing over fixed-direction ports (since bidirectional IO is not a
thing, and the internal logic is almost entirely different between a
read and write port).
Les messages affichés proviennent d'usenet.