Newsportal USENET - Re: Cost of handling misaligned access

On 2/22/2025 1:25 PM, Robert Finch wrote:

On 2025-02-22 10:16 a.m., EricP wrote:
BGB wrote:
On 2/21/2025 1:51 PM, EricP wrote:
BGB wrote:
>
Can note that the latency of carry-select adders is a little weird:
16/32/64: Latency goes up steadily;
    But, still less than linear;
128-bit: Only slightly more latency than 64-bit.
>
The best I could find in past testing was seemingly 16-bit chunks for normal adding. Where, 16-bits seemed to be around the break- even between the chained CARRY4's and the Carry-Select (CS being slower below 16 bits).
>
But, for a 64-bit adder, still basically need to give it a clock- cycle to do its thing. Though, not like 32 is particularly fast either; hence part of the whole 2 cycle latency on ALU ops thing. Mostly has to do with ADD/SUB (and CMP, which is based on SUB).
>
>
Admittedly part of why I have such mixed feelings on full compare- and- branch:
Pro: It can offer a performance advantage (in terms of per-clock);
Con: Branch is now beholden to the latency of a Subtract.
>
IIRC your cpu clock speed is about 75 MHz (13.3 ns)
and you are saying it takes 2 clocks for a 64-bit ADD.
>
>
The 75MHz was mostly experimental, mostly I am running at 50MHz because it is easier (a whole lot of corners need to be cut for 75MHz, so often overall performance ended up being worse).
>
>
Via the main ALU, which also shares the logic for SUB and CMP and similar...
>
Generally, I give more or less a full cycle for the ADD to do its thing, with the result presented to the outside world on the second cycle, where it can go through the register forwarding chains and similar.
>
This gives it a 2 cycle latency.
>
Operations with a 1 cycle latency need to feed their output directly into the register forwarding logic.
>
>
In a pseudocode sense, something like:
tValB = IsSUB ? ~valB : valB;
tAddA0={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 0;
tAddA1={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 1;
tAddB0={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 0;
tAddB1={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 1;
tAddC0=...
...
tAddSbA = tCarryIn;
tAddSbB = tAddSbA ? tAddA1[16] : tAddA0[16];
tAddSbC = tAddSbB ? tAddB1[16] : tAddB0[16];
...
tAddRes = {
     tAddSbD ? tAddD1[15:0] : tAddD0[15:0],
     tAddSbC ? tAddC1[15:0] : tAddC0[15:0],
     tAddSbB ? tAddB1[15:0] : tAddB0[15:0],
     tAddSbA ? tAddA1[15:0] : tAddA0[15:0]
};
>
>
This works, but still need to ideally give it a full clock-cycle to do its work.
>
>
>
Note that one has to be careful with logic coupling, as if too many things are tied together, one may get a "routing congestion" warning message, and generally timing fails in this case...
>
Also, "inferring latch" warning is one of those "you really gotta go fix this" issues (both generally indicates Verilog bugs, and also negatively effects timing).
>
>
I don't remember what Xilinx chip you are using but this paper describes
how to do a 64-bit ADD at between 350 Mhz (2.8 ns) to 400 MHz (2.5 ns)
on a Virtex-5:
>
A Fast Carry Chain Adder for Virtex-5 FPGAs, 2010
https://scholar.archive.org/work/tz6fy2zm4fcobc6k7khsbwskh4/access/ wayback/http://ece.gmu.edu:80/coursewebpages/ECE/ECE645/S11/ projects/ project_1_resources/Adders_MELECON_2010.pdf
>
>
As for Virtex: I am not made of money...
>
Virtex tends to be absurdly expensive high-end FPGAs.
Even the older Virtex chips are still absurdly expensive.
>
>
Kintex is considered mid range, but still too expensive, and mostly not usable in the free versions of Vivado (and there are no real viable FOSS alternatives to Vivado). When I tried looking at some of the "open source" tools for targeting Xilinx chips, they were doing the hacky thing of basically invoking Xilinx's tools in the background (which, if used to target a Kintex, is essentially piracy).
>
I don't think that it is copyright infringement to have a script or code
generator output drive a compiler or tool instead of your hands.

It would be, however, to use it to sidestep Vivado's licensing to try to target a Kintex by using the tools in unorthodox ways...
As I see it, hacking the existing tools to sidestep licensing fees is essentially piracy.
Whereas, writing ones own tools is fair game, albeit provided a "clean room" strategy is used (or, basically, one party reverse engineers and documents the bitstream format, and some other party writes the tools based on that documentation). In this case, Xilinx would only be entitled to profits from the FPGA itself.
But, that said, I also have the opinion that when a user buys a piece of hardware, they are entitled to ownership over said hardware. OEM restrictions on the use of said hardware (outside of copyright on any software running on said hardware) are invalid as far as I am concerned.
Similarly selling devices as "loss leaders" with the intent to regain profits via advertising or selling services is also not really defensible (and any losses due to customers circumventing the hardware, are the fault of the seller, not of the customer).
Well, even as much as companies selling things like cellphones and game-consoles would try to disagree (seeing selling the hardware at a loss to make it up in licensed game sales or similar as a business strategy).
Though, this does still leave things like firmware as a gray area. But, realistically, since the firmware is coupled to the hardware, then "sale" would also imply the right of the users to treat any dealings with the firmware as if it were part of the hardware (with the main exception if the user separates the firmware from the hardware, in which case copyright would apply; say, if ripping the ROMs and uploading them to the internet).
This would not apply to Vivado though, which is more solidly in the "software" camp (and the bare FPGA is pretty solidly in the "hardware" camp).

>
Where, a valid FOSS tool would need to be able to do everything and generate the bitstream itself.
>
>
>
Mostly I am using Spartan-7 and Artix-7.
Generally at the -1 speed grade (slowest, but cheapest).
>
The second paper was also on both Spartan-6 and says it has the same
LUT architecture as Vertex-5 and -6. Their speed testing was done on
Vertex-6 but the design should apply.
>
Anyway it was the concepts of how to optimize the carry that were important.
I would expect to have to write code to port the ideas.
>

Possibly.

These are mostly considered low-end and consumer-electronics oriented FPGAs by Xilinx.
>
<snip>
>
I have a QMTech board with an XC7A200T at -1, but generally, it seems to actually have a slightly harder time passing timing constraints than the XC7A100T in the Nexys A7 (possibly some sort of Vivado magic here).
>
>
and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
>
Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
>
>
Errm, skim, this doesn't really look like something you can pull off in normal Verilog.
>
Well that's what I'm trying to figure out because its not just this paper
but a lot, like many hundreds, of papers I've read from commercial or
academic source that seem to be able to control the FPGA results
to a fine degree.
>

You could invoke some of the LE's directly as primitives in Verilog, but then one has an ugly mess that will only work on a specific class of FPGA.
Generally though, one has access in terms of said primitives, rather than control over the logic block.
Vs, say, code that will work with Verilator, Vivado, and Quartus, without needing to be entirely rewritten for each.
Though, that said, my design might still need some reworking to be "effective" with Quartus or Altera hardware; or to use the available hardware.
Say, rather than like on a Spartan or Artix (pure FPGA), the Cyclone FPGA's tend to include ARM hard processors, with the FPGA and ARM cores able to communicate over a bus. The FPGA part of the DE10 apparently has its own RAM chip, but it is SDRAM (rather than DDR2 or DDR3 like in a lot of the Xilinx based boards).
Well, apart from some low-end boards which use QSPI SRAMs (though, having looked, a lot of these RAMs are DRAM internally, but the RAM module has its own RAM refresh logic).

Generally, one doesn't control over how the components hook together, only one can influence what happens based on how they write their Verilog.
>
That paper mentions in section III
"In order to reduce uncontrollable routing delays in the comparisons,
everything was manually placed, according to the floorplan in Fig. 7."
>
Is that the key - manually place things adjacent and hope the
wire router does the right thing?
>
That sounds too flaky. You need to be able to reliably construct optimized
modules and then attach to them.
>
You can just write:
reg[63:0] tValA;
reg[63:0] tValB;
reg[63:0] tValC;
tValC=tValA+tValB;
>
>
But, then it spits out something with a chain of 16 CARRY4's, so there is a fairly high latency on the high order bits of the result.
>
>
Generally, Vivado synthesis seems to mostly be happy (at 50 MHz), if the total logic path length stays under around 12 or so. Paths with 15 or more are often near the edge of failing timing.
>
At 75MHz, one has to battle with pretty much anything much over 8.
>
>
And, at 200MHz, you have have path lengths of 2 that are failing...
Like, it seemingly can't do much more than "FF -> LUT -> FF" at these speeds.
>
This can't just be left to the random luck of the wire router.
There must be something else that these commercial and academic users
are able to do to reliably optimize their design.
Maybe its a tool only available to big bucks customers.
>
This has me curious. I'm going to keep looking around.
>
>
I am sure it can be done as I have seen a lot of papers too with results in the hundreds of megahertz. It has got to be the manual placement and routing that helps. The routing in my design typically takes up about 80% of the delay. One can build circuits up out of individual primitive gates in Verilog (or(), and(), etc) but for behavioral purposes I do not do that, instead relying on the tools to generate the best combinations of gates. It is a ton of work to do everything manually. I am happy to have things work at 40 MHz even though 200 MHz may be possible with 10x the work put into it. Typically running behavioural code. Doing things mostly for my own edification. ( I have got my memory controller working at 200 MHz, so it is possible).
One thing that I have found that helps is to use smaller modules and tasks for repetitive code where possible. The tools seem to put together a faster design if everything is smaller modules. I ponder it may have to do with making place and route easier.

It is also possible to get higher speeds with smaller/simple designs.
But, yeah, also I can note in Vivado, that the timing does tend to be dominated more by "net delay" rather than "logic delay".
This is why my thoughts for a possible 75 MHz focused core would be to drop down to 2-wide superscalar. It is more a question of what could be done to try to leverage the higher clock-speed to an advantage (and not lose too much performance in other areas).
In most past attempts, any sacrifices made have tended to hurt performance more than any gains (along with the usual issue of my 2-wide and 3-wide ISA variants tending to not be fully binary compatible).
XG3 has a possible way to sidestep this, as code compiled for 2-wide with XG3 should still be able to more or less leverage the speed of a 3-wide core, subject to register dependencies between instructions.
The main differences being any compromises made for 2-wide; say, if I disallowed 96-bit encodings to reduce the required fetch width. Along with the probable loss of any 128-bit 3-input instructions (there are very few of these). Most core parts of the ISA will still work on a 2-wide configuration.
Though, could still keep 96-bit encodings though, as my existing 2-wide decoder still uses 96-bit fetch, just it drops down to 2-wide after Decode.
This (and keeping full Compare-and-Branch) would still allow XG3 to be mostly binary compatible between 2-wide and 3-wide configurations, just "slightly less cheap".
I am left once again knocking some dust off of the 2-wide configuration, as things had started breaking in that profile due to lack of maintenance (and previously was mostly revived in my effort to fit the BJX2 core into the XC7S50...).
...

Date	Sujet	#	Auteur
2 Feb 25	Re: Cost of handling misaligned access	112	BGB
3 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	109	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	11	BGB
3 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	8	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	7	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	5	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	2	Thomas Koenig
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
4 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	3	Thomas Koenig
3 Feb 25	Re: Cost of handling misaligned access	2	BGB
3 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
4 Feb 25	Re: Cost of handling misaligned access	41	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	40	Terje Mathisen
5 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	35	Michael S
6 Feb 25	Re: Cost of handling misaligned access	32	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	31	Michael S
6 Feb 25	Re: Cost of handling misaligned access	2	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	28	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	27	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	26	Michael S
6 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	19	Michael S
7 Feb 25	Re: Cost of handling misaligned access	18	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	17	Michael S
7 Feb 25	Re: Cost of handling misaligned access	16	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	15	Michael S
7 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	3	MitchAlsup1
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
8 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	6	Michael S
8 Feb 25	Re: Cost of handling misaligned access	5	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	2	Michael S
11 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
10 Feb 25	Re: Cost of handling misaligned access	1	Michael S
7 Feb 25	Re: Cost of handling misaligned access	5	BGB
7 Feb 25	Re: Cost of handling misaligned access	4	MitchAlsup1
7 Feb 25	Re: Cost of handling misaligned access	3	BGB
8 Feb 25	Re: Cost of handling misaligned access	2	Anssi Saari
8 Feb 25	Re: Cost of handling misaligned access	1	BGB
6 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	5	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	3	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	2	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
13 Feb 25	Re: Cost of handling misaligned access	48	Marcus
13 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
14 Feb 25	Re: Cost of handling misaligned access	41	BGB
14 Feb 25	Re: Cost of handling misaligned access	40	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	39	BGB
18 Feb 25	Re: Cost of handling misaligned access	33	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	1	BGB
18 Feb 25	Re: Cost of handling misaligned access	31	Michael S
18 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
18 Feb 25	Re: Cost of handling misaligned access	26	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
18 Feb 25	Re: Cost of handling misaligned access	24	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	23	Terje Mathisen
19 Feb 25	Re: Cost of handling misaligned access	22	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	21	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
20 Feb 25	Re: Cost of handling misaligned access	5	MitchAlsup1
20 Feb 25	Re: Cost of handling misaligned access	2	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	2	Robert Finch
21 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	14	BGB
22 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
22 Feb 25	Re: Cost of handling misaligned access	12	Robert Finch
23 Feb 25	Re: Cost of handling misaligned access	10	BGB
23 Feb 25	Re: Cost of handling misaligned access	9	Michael S
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	7	Michael S
24 Feb 25	Re: Cost of handling misaligned access	4	Robert Finch
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
25 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
23 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
18 Feb 25	Re: Cost of handling misaligned access	3	BGB
19 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	5	Robert Finch
17 Feb 25	Re: Cost of handling misaligned access	5	Terje Mathisen