Liste des Groupes | Revenir à c arch |
BGB wrote:I am sure it can be done as I have seen a lot of papers too with results in the hundreds of megahertz. It has got to be the manual placement and routing that helps. The routing in my design typically takes up about 80% of the delay. One can build circuits up out of individual primitive gates in Verilog (or(), and(), etc) but for behavioral purposes I do not do that, instead relying on the tools to generate the best combinations of gates. It is a ton of work to do everything manually. I am happy to have things work at 40 MHz even though 200 MHz may be possible with 10x the work put into it. Typically running behavioural code. Doing things mostly for my own edification. ( I have got my memory controller working at 200 MHz, so it is possible).On 2/21/2025 1:51 PM, EricP wrote:I don't think that it is copyright infringement to have a script or codeBGB wrote:>>>
Can note that the latency of carry-select adders is a little weird:
16/32/64: Latency goes up steadily;
But, still less than linear;
128-bit: Only slightly more latency than 64-bit.
>
The best I could find in past testing was seemingly 16-bit chunks for normal adding. Where, 16-bits seemed to be around the break-even between the chained CARRY4's and the Carry-Select (CS being slower below 16 bits).
>
But, for a 64-bit adder, still basically need to give it a clock- cycle to do its thing. Though, not like 32 is particularly fast either; hence part of the whole 2 cycle latency on ALU ops thing. Mostly has to do with ADD/SUB (and CMP, which is based on SUB).
>
>
Admittedly part of why I have such mixed feelings on full compare- and- branch:
Pro: It can offer a performance advantage (in terms of per-clock);
Con: Branch is now beholden to the latency of a Subtract.
IIRC your cpu clock speed is about 75 MHz (13.3 ns)
and you are saying it takes 2 clocks for a 64-bit ADD.
>
The 75MHz was mostly experimental, mostly I am running at 50MHz because it is easier (a whole lot of corners need to be cut for 75MHz, so often overall performance ended up being worse).
>
>
Via the main ALU, which also shares the logic for SUB and CMP and similar...
>
Generally, I give more or less a full cycle for the ADD to do its thing, with the result presented to the outside world on the second cycle, where it can go through the register forwarding chains and similar.
>
This gives it a 2 cycle latency.
>
Operations with a 1 cycle latency need to feed their output directly into the register forwarding logic.
>
>
In a pseudocode sense, something like:
tValB = IsSUB ? ~valB : valB;
tAddA0={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 0;
tAddA1={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 1;
tAddB0={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 0;
tAddB1={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 1;
tAddC0=...
...
tAddSbA = tCarryIn;
tAddSbB = tAddSbA ? tAddA1[16] : tAddA0[16];
tAddSbC = tAddSbB ? tAddB1[16] : tAddB0[16];
...
tAddRes = {
tAddSbD ? tAddD1[15:0] : tAddD0[15:0],
tAddSbC ? tAddC1[15:0] : tAddC0[15:0],
tAddSbB ? tAddB1[15:0] : tAddB0[15:0],
tAddSbA ? tAddA1[15:0] : tAddA0[15:0]
};
>
>
This works, but still need to ideally give it a full clock-cycle to do its work.
>
>
>
Note that one has to be careful with logic coupling, as if too many things are tied together, one may get a "routing congestion" warning message, and generally timing fails in this case...
>
Also, "inferring latch" warning is one of those "you really gotta go fix this" issues (both generally indicates Verilog bugs, and also negatively effects timing).
>
>I don't remember what Xilinx chip you are using but this paper describes>
how to do a 64-bit ADD at between 350 Mhz (2.8 ns) to 400 MHz (2.5 ns)
on a Virtex-5:
>
A Fast Carry Chain Adder for Virtex-5 FPGAs, 2010
https://scholar.archive.org/work/tz6fy2zm4fcobc6k7khsbwskh4/access/ wayback/http://ece.gmu.edu:80/coursewebpages/ECE/ECE645/S11/projects/ project_1_resources/Adders_MELECON_2010.pdf
>
As for Virtex: I am not made of money...
>
Virtex tends to be absurdly expensive high-end FPGAs.
Even the older Virtex chips are still absurdly expensive.
>
>
Kintex is considered mid range, but still too expensive, and mostly not usable in the free versions of Vivado (and there are no real viable FOSS alternatives to Vivado). When I tried looking at some of the "open source" tools for targeting Xilinx chips, they were doing the hacky thing of basically invoking Xilinx's tools in the background (which, if used to target a Kintex, is essentially piracy).
generator output drive a compiler or tool instead of your hands.
Where, a valid FOSS tool would need to be able to do everything and generate the bitstream itself.The second paper was also on both Spartan-6 and says it has the same
>
>
>
Mostly I am using Spartan-7 and Artix-7.
Generally at the -1 speed grade (slowest, but cheapest).
LUT architecture as Vertex-5 and -6. Their speed testing was done on
Vertex-6 but the design should apply.
Anyway it was the concepts of how to optimize the carry that were important.
I would expect to have to write code to port the ideas.
These are mostly considered low-end and consumer-electronics oriented FPGAs by Xilinx.<snip>
I have a QMTech board with an XC7A200T at -1, but generally, it seems to actually have a slightly harder time passing timing constraints than the XC7A100T in the Nexys A7 (possibly some sort of Vivado magic here).Well that's what I'm trying to figure out because its not just this paper
>
>and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:>
>
Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
>
Errm, skim, this doesn't really look like something you can pull off in normal Verilog.
but a lot, like many hundreds, of papers I've read from commercial or
academic source that seem to be able to control the FPGA results
to a fine degree.
Generally, one doesn't control over how the components hook together, only one can influence what happens based on how they write their Verilog.That paper mentions in section III
"In order to reduce uncontrollable routing delays in the comparisons,
everything was manually placed, according to the floorplan in Fig. 7."
Is that the key - manually place things adjacent and hope the
wire router does the right thing?
That sounds too flaky. You need to be able to reliably construct optimized
modules and then attach to them.
You can just write:This can't just be left to the random luck of the wire router.
reg[63:0] tValA;
reg[63:0] tValB;
reg[63:0] tValC;
tValC=tValA+tValB;
>
>
But, then it spits out something with a chain of 16 CARRY4's, so there is a fairly high latency on the high order bits of the result.
>
>
Generally, Vivado synthesis seems to mostly be happy (at 50 MHz), if the total logic path length stays under around 12 or so. Paths with 15 or more are often near the edge of failing timing.
>
At 75MHz, one has to battle with pretty much anything much over 8.
>
>
And, at 200MHz, you have have path lengths of 2 that are failing...
Like, it seemingly can't do much more than "FF -> LUT -> FF" at these speeds.
There must be something else that these commercial and academic users
are able to do to reliably optimize their design.
Maybe its a tool only available to big bucks customers.
This has me curious. I'm going to keep looking around.
Les messages affichés proviennent d'usenet.