Liste des Groupes | Revenir à c arch |
On 2025-02-22 10:16 a.m., EricP wrote:It would be, however, to use it to sidestep Vivado's licensing to try to target a Kintex by using the tools in unorthodox ways...BGB wrote:On 2/21/2025 1:51 PM, EricP wrote:>BGB wrote:>>>
Can note that the latency of carry-select adders is a little weird:
16/32/64: Latency goes up steadily;
But, still less than linear;
128-bit: Only slightly more latency than 64-bit.
>
The best I could find in past testing was seemingly 16-bit chunks for normal adding. Where, 16-bits seemed to be around the break- even between the chained CARRY4's and the Carry-Select (CS being slower below 16 bits).
>
But, for a 64-bit adder, still basically need to give it a clock- cycle to do its thing. Though, not like 32 is particularly fast either; hence part of the whole 2 cycle latency on ALU ops thing. Mostly has to do with ADD/SUB (and CMP, which is based on SUB).
>
>
Admittedly part of why I have such mixed feelings on full compare- and- branch:
Pro: It can offer a performance advantage (in terms of per-clock);
Con: Branch is now beholden to the latency of a Subtract.
IIRC your cpu clock speed is about 75 MHz (13.3 ns)
and you are saying it takes 2 clocks for a 64-bit ADD.
>
The 75MHz was mostly experimental, mostly I am running at 50MHz because it is easier (a whole lot of corners need to be cut for 75MHz, so often overall performance ended up being worse).
>
>
Via the main ALU, which also shares the logic for SUB and CMP and similar...
>
Generally, I give more or less a full cycle for the ADD to do its thing, with the result presented to the outside world on the second cycle, where it can go through the register forwarding chains and similar.
>
This gives it a 2 cycle latency.
>
Operations with a 1 cycle latency need to feed their output directly into the register forwarding logic.
>
>
In a pseudocode sense, something like:
tValB = IsSUB ? ~valB : valB;
tAddA0={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 0;
tAddA1={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 1;
tAddB0={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 0;
tAddB1={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 1;
tAddC0=...
...
tAddSbA = tCarryIn;
tAddSbB = tAddSbA ? tAddA1[16] : tAddA0[16];
tAddSbC = tAddSbB ? tAddB1[16] : tAddB0[16];
...
tAddRes = {
tAddSbD ? tAddD1[15:0] : tAddD0[15:0],
tAddSbC ? tAddC1[15:0] : tAddC0[15:0],
tAddSbB ? tAddB1[15:0] : tAddB0[15:0],
tAddSbA ? tAddA1[15:0] : tAddA0[15:0]
};
>
>
This works, but still need to ideally give it a full clock-cycle to do its work.
>
>
>
Note that one has to be careful with logic coupling, as if too many things are tied together, one may get a "routing congestion" warning message, and generally timing fails in this case...
>
Also, "inferring latch" warning is one of those "you really gotta go fix this" issues (both generally indicates Verilog bugs, and also negatively effects timing).
>
>I don't remember what Xilinx chip you are using but this paper describes>
how to do a 64-bit ADD at between 350 Mhz (2.8 ns) to 400 MHz (2.5 ns)
on a Virtex-5:
>
A Fast Carry Chain Adder for Virtex-5 FPGAs, 2010
https://scholar.archive.org/work/tz6fy2zm4fcobc6k7khsbwskh4/access/ wayback/http://ece.gmu.edu:80/coursewebpages/ECE/ECE645/S11/ projects/ project_1_resources/Adders_MELECON_2010.pdf
>
As for Virtex: I am not made of money...
>
Virtex tends to be absurdly expensive high-end FPGAs.
Even the older Virtex chips are still absurdly expensive.
>
>
Kintex is considered mid range, but still too expensive, and mostly not usable in the free versions of Vivado (and there are no real viable FOSS alternatives to Vivado). When I tried looking at some of the "open source" tools for targeting Xilinx chips, they were doing the hacky thing of basically invoking Xilinx's tools in the background (which, if used to target a Kintex, is essentially piracy).
I don't think that it is copyright infringement to have a script or code
generator output drive a compiler or tool instead of your hands.
Possibly.>Where, a valid FOSS tool would need to be able to do everything and generate the bitstream itself.>
>
>
>
Mostly I am using Spartan-7 and Artix-7.
Generally at the -1 speed grade (slowest, but cheapest).
The second paper was also on both Spartan-6 and says it has the same
LUT architecture as Vertex-5 and -6. Their speed testing was done on
Vertex-6 but the design should apply.
>
Anyway it was the concepts of how to optimize the carry that were important.
I would expect to have to write code to port the ideas.
>
You could invoke some of the LE's directly as primitives in Verilog, but then one has an ugly mess that will only work on a specific class of FPGA.These are mostly considered low-end and consumer-electronics oriented FPGAs by Xilinx.>
<snip>
>I have a QMTech board with an XC7A200T at -1, but generally, it seems to actually have a slightly harder time passing timing constraints than the XC7A100T in the Nexys A7 (possibly some sort of Vivado magic here).>
>
>and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:>
>
Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
>
Errm, skim, this doesn't really look like something you can pull off in normal Verilog.
Well that's what I'm trying to figure out because its not just this paper
but a lot, like many hundreds, of papers I've read from commercial or
academic source that seem to be able to control the FPGA results
to a fine degree.
>
It is also possible to get higher speeds with smaller/simple designs.I am sure it can be done as I have seen a lot of papers too with results in the hundreds of megahertz. It has got to be the manual placement and routing that helps. The routing in my design typically takes up about 80% of the delay. One can build circuits up out of individual primitive gates in Verilog (or(), and(), etc) but for behavioral purposes I do not do that, instead relying on the tools to generate the best combinations of gates. It is a ton of work to do everything manually. I am happy to have things work at 40 MHz even though 200 MHz may be possible with 10x the work put into it. Typically running behavioural code. Doing things mostly for my own edification. ( I have got my memory controller working at 200 MHz, so it is possible).Generally, one doesn't control over how the components hook together, only one can influence what happens based on how they write their Verilog.>
That paper mentions in section III
"In order to reduce uncontrollable routing delays in the comparisons,
everything was manually placed, according to the floorplan in Fig. 7."
>
Is that the key - manually place things adjacent and hope the
wire router does the right thing?
>
That sounds too flaky. You need to be able to reliably construct optimized
modules and then attach to them.
>You can just write:>
reg[63:0] tValA;
reg[63:0] tValB;
reg[63:0] tValC;
tValC=tValA+tValB;
>
>
But, then it spits out something with a chain of 16 CARRY4's, so there is a fairly high latency on the high order bits of the result.
>
>
Generally, Vivado synthesis seems to mostly be happy (at 50 MHz), if the total logic path length stays under around 12 or so. Paths with 15 or more are often near the edge of failing timing.
>
At 75MHz, one has to battle with pretty much anything much over 8.
>
>
And, at 200MHz, you have have path lengths of 2 that are failing...
Like, it seemingly can't do much more than "FF -> LUT -> FF" at these speeds.
This can't just be left to the random luck of the wire router.
There must be something else that these commercial and academic users
are able to do to reliably optimize their design.
Maybe its a tool only available to big bucks customers.
>
This has me curious. I'm going to keep looking around.
>
>
One thing that I have found that helps is to use smaller modules and tasks for repetitive code where possible. The tools seem to put together a faster design if everything is smaller modules. I ponder it may have to do with making place and route easier.
Les messages affichés proviennent d'usenet.