Re: Cost of handling misaligned access

Liste des GroupesRevenir à c arch 
Sujet : Re: Cost of handling misaligned access
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 23. Feb 2025, 00:37:53
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vpdn4k$6130$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
User-Agent : Mozilla Thunderbird
On 2/22/2025 1:25 PM, Robert Finch wrote:
On 2025-02-22 10:16 a.m., EricP wrote:
BGB wrote:
On 2/21/2025 1:51 PM, EricP wrote:
BGB wrote:
>
Can note that the latency of carry-select adders is a little weird:
  16/32/64: Latency goes up steadily;
    But, still less than linear;
  128-bit: Only slightly more latency than 64-bit.
>
The best I could find in past testing was seemingly 16-bit chunks for normal adding. Where, 16-bits seemed to be around the break- even between the chained CARRY4's and the Carry-Select (CS being slower below 16 bits).
>
But, for a 64-bit adder, still basically need to give it a clock- cycle to do its thing. Though, not like 32 is particularly fast either; hence part of the whole 2 cycle latency on ALU ops thing. Mostly has to do with ADD/SUB (and CMP, which is based on SUB).
>
>
Admittedly part of why I have such mixed feelings on full compare- and- branch:
  Pro: It can offer a performance advantage (in terms of per-clock);
  Con: Branch is now beholden to the latency of a Subtract.
>
IIRC your cpu clock speed is about 75 MHz (13.3 ns)
and you are saying it takes 2 clocks for a 64-bit ADD.
>
>
The 75MHz was mostly experimental, mostly I am running at 50MHz because it is easier (a whole lot of corners need to be cut for 75MHz, so often overall performance ended up being worse).
>
>
Via the main ALU, which also shares the logic for SUB and CMP and similar...
>
Generally, I give more or less a full cycle for the ADD to do its thing, with the result presented to the outside world on the second cycle, where it can go through the register forwarding chains and similar.
>
This gives it a 2 cycle latency.
>
Operations with a 1 cycle latency need to feed their output directly into the register forwarding logic.
>
>
In a pseudocode sense, something like:
  tValB = IsSUB ? ~valB : valB;
  tAddA0={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 0;
  tAddA1={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 1;
  tAddB0={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 0;
  tAddB1={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 1;
  tAddC0=...
  ...
  tAddSbA = tCarryIn;
  tAddSbB = tAddSbA ? tAddA1[16] : tAddA0[16];
  tAddSbC = tAddSbB ? tAddB1[16] : tAddB0[16];
  ...
  tAddRes = {
     tAddSbD ? tAddD1[15:0] : tAddD0[15:0],
     tAddSbC ? tAddC1[15:0] : tAddC0[15:0],
     tAddSbB ? tAddB1[15:0] : tAddB0[15:0],
     tAddSbA ? tAddA1[15:0] : tAddA0[15:0]
  };
>
>
This works, but still need to ideally give it a full clock-cycle to do its work.
>
>
>
Note that one has to be careful with logic coupling, as if too many things are tied together, one may get a "routing congestion" warning message, and generally timing fails in this case...
>
Also, "inferring latch" warning is one of those "you really gotta go fix this" issues (both generally indicates Verilog bugs, and also negatively effects timing).
>
>
I don't remember what Xilinx chip you are using but this paper describes
how to do a 64-bit ADD at between 350 Mhz (2.8 ns) to 400 MHz (2.5 ns)
on a Virtex-5:
>
A Fast Carry Chain Adder for Virtex-5 FPGAs, 2010
https://scholar.archive.org/work/tz6fy2zm4fcobc6k7khsbwskh4/access/ wayback/http://ece.gmu.edu:80/coursewebpages/ECE/ECE645/S11/ projects/ project_1_resources/Adders_MELECON_2010.pdf
>
>
As for Virtex: I am not made of money...
>
Virtex tends to be absurdly expensive high-end FPGAs.
  Even the older Virtex chips are still absurdly expensive.
>
>
Kintex is considered mid range, but still too expensive, and mostly not usable in the free versions of Vivado (and there are no real viable FOSS alternatives to Vivado). When I tried looking at some of the "open source" tools for targeting Xilinx chips, they were doing the hacky thing of basically invoking Xilinx's tools in the background (which, if used to target a Kintex, is essentially piracy).
>
I don't think that it is copyright infringement to have a script or code
generator output drive a compiler or tool instead of your hands.
It would be, however, to use it to sidestep Vivado's licensing to try to target a Kintex by using the tools in unorthodox ways...
As I see it, hacking the existing tools to sidestep licensing fees is essentially piracy.
Whereas, writing ones own tools is fair game, albeit provided a "clean room" strategy is used (or, basically, one party reverse engineers and documents the bitstream format, and some other party writes the tools based on that documentation). In this case, Xilinx would only be entitled to profits from the FPGA itself.
But, that said, I also have the opinion that when a user buys a piece of hardware, they are entitled to ownership over said hardware. OEM restrictions on the use of said hardware (outside of copyright on any software running on said hardware) are invalid as far as I am concerned.
Similarly selling devices as "loss leaders" with the intent to regain profits via advertising or selling services is also not really defensible (and any losses due to customers circumventing the hardware, are the fault of the seller, not of the customer).
Well, even as much as companies selling things like cellphones and game-consoles would try to disagree (seeing selling the hardware at a loss to make it up in licensed game sales or similar as a business strategy).
Though, this does still leave things like firmware as a gray area. But, realistically, since the firmware is coupled to the hardware, then "sale" would also imply the right of the users to treat any dealings with the firmware as if it were part of the hardware (with the main exception if the user separates the firmware from the hardware, in which case copyright would apply; say, if ripping the ROMs and uploading them to the internet).
This would not apply to Vivado though, which is more solidly in the "software" camp (and the bare FPGA is pretty solidly in the "hardware" camp).

>
Where, a valid FOSS tool would need to be able to do everything and generate the bitstream itself.
>
>
>
Mostly I am using Spartan-7 and Artix-7.
  Generally at the -1 speed grade (slowest, but cheapest).
>
The second paper was also on both Spartan-6 and says it has the same
LUT architecture as Vertex-5 and -6. Their speed testing was done on
Vertex-6 but the design should apply.
>
Anyway it was the concepts of how to optimize the carry that were important.
I would expect to have to write code to port the ideas.
>
Possibly.

These are mostly considered low-end and consumer-electronics oriented FPGAs by Xilinx.
>
<snip>
>
I have a QMTech board with an XC7A200T at -1, but generally, it seems to actually have a slightly harder time passing timing constraints than the XC7A100T in the Nexys A7 (possibly some sort of Vivado magic here).
>
>
and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
>
Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
>
>
Errm, skim, this doesn't really look like something you can pull off in normal Verilog.
>
Well that's what I'm trying to figure out because its not just this paper
but a lot, like many hundreds, of papers I've read from commercial or
academic source that seem to be able to control the FPGA results
to a fine degree.
>
You could invoke some of the LE's directly as primitives in Verilog, but then one has an ugly mess that will only work on a specific class of FPGA.
Generally though, one has access in terms of said primitives, rather than control over the logic block.
Vs, say, code that will work with Verilator, Vivado, and Quartus, without needing to be entirely rewritten for each.
Though, that said, my design might still need some reworking to be "effective" with Quartus or Altera hardware; or to use the available hardware.
Say, rather than like on a Spartan or Artix (pure FPGA), the Cyclone FPGA's tend to include ARM hard processors, with the FPGA and ARM cores able to communicate over a bus. The FPGA part of the DE10 apparently has its own RAM chip, but it is SDRAM (rather than DDR2 or DDR3 like in a lot of the Xilinx based boards).
Well, apart from some low-end boards which use QSPI SRAMs (though, having looked, a lot of these RAMs are DRAM internally, but the RAM module has its own RAM refresh logic).

Generally, one doesn't control over how the components hook together, only one can influence what happens based on how they write their Verilog.
>
That paper mentions in section III
"In order to reduce uncontrollable routing delays in the comparisons,
everything was manually placed, according to the floorplan in Fig. 7."
>
Is that the key - manually place things adjacent and hope the
wire router does the right thing?
>
That sounds too flaky. You need to be able to reliably construct optimized
modules and then attach to them.
>
You can just write:
  reg[63:0] tValA;
  reg[63:0] tValB;
  reg[63:0] tValC;
  tValC=tValA+tValB;
>
>
But, then it spits out something with a chain of 16 CARRY4's, so there is a fairly high latency on the high order bits of the result.
>
>
Generally, Vivado synthesis seems to mostly be happy (at 50 MHz), if the total logic path length stays under around 12 or so. Paths with 15 or more are often near the edge of failing timing.
>
At 75MHz, one has to battle with pretty much anything much over 8.
>
>
And, at 200MHz, you have have path lengths of 2 that are failing...
Like, it seemingly can't do much more than "FF -> LUT -> FF" at these speeds.
>
This can't just be left to the random luck of the wire router.
There must be something else that these commercial and academic users
are able to do to reliably optimize their design.
Maybe its a tool only available to big bucks customers.
>
This has me curious. I'm going to keep looking around.
>
>
I am sure it can be done as I have seen a lot of papers too with results in the hundreds of megahertz. It has got to be the manual placement and routing that helps. The routing in my design typically takes up about 80% of the delay. One can build circuits up out of individual primitive gates in Verilog (or(), and(), etc) but for behavioral purposes I do not do that, instead relying on the tools to generate the best combinations of gates. It is a ton of work to do everything manually. I am happy to have things work at 40 MHz even though 200 MHz may be possible with 10x the work put into it. Typically running behavioural code. Doing things mostly for my own edification. ( I have got my memory controller working at 200 MHz, so it is possible).
One thing that I have found that helps is to use smaller modules and tasks for repetitive code where possible. The tools seem to put together a faster design if everything is smaller modules. I ponder it may have to do with making place and route easier.
 
It is also possible to get higher speeds with smaller/simple designs.
But, yeah, also I can note in Vivado, that the timing does tend to be dominated more by "net delay" rather than "logic delay".
This is why my thoughts for a possible 75 MHz focused core would be to drop down to 2-wide superscalar. It is more a question of what could be done to try to leverage the higher clock-speed to an advantage (and not lose too much performance in other areas).
In most past attempts, any sacrifices made have tended to hurt performance more than any gains (along with the usual issue of my 2-wide and 3-wide ISA variants tending to not be fully binary compatible).
XG3 has a possible way to sidestep this, as code compiled for 2-wide with XG3 should still be able to more or less leverage the speed of a 3-wide core, subject to register dependencies between instructions.
The main differences being any compromises made for 2-wide; say, if I disallowed 96-bit encodings to reduce the required fetch width. Along with the probable loss of any 128-bit 3-input instructions (there are very few of these). Most core parts of the ISA will still work on a 2-wide configuration.
Though, could still keep 96-bit encodings though, as my existing 2-wide decoder still uses 96-bit fetch, just it drops down to 2-wide after Decode.
This (and keeping full Compare-and-Branch) would still allow XG3 to be mostly binary compatible between 2-wide and 3-wide configurations, just "slightly less cheap".
I am left once again knocking some dust off of the 2-wide configuration, as things had started breaking in that profile due to lack of maintenance (and previously was mostly revived in my effort to fit the BJX2 core into the XC7S50...).
...

Date Sujet#  Auteur
2 Feb 25 * Re: Cost of handling misaligned access112BGB
3 Feb 25 +* Re: Cost of handling misaligned access2MitchAlsup1
3 Feb 25 i`- Re: Cost of handling misaligned access1BGB
3 Feb 25 `* Re: Cost of handling misaligned access109Anton Ertl
3 Feb 25  +* Re: Cost of handling misaligned access11BGB
3 Feb 25  i`* Re: Cost of handling misaligned access10Anton Ertl
3 Feb 25  i +- Re: Cost of handling misaligned access1BGB
3 Feb 25  i `* Re: Cost of handling misaligned access8Thomas Koenig
4 Feb 25  i  `* Re: Cost of handling misaligned access7Anton Ertl
4 Feb 25  i   +* Re: Cost of handling misaligned access5Thomas Koenig
4 Feb 25  i   i`* Re: Cost of handling misaligned access4Anton Ertl
4 Feb 25  i   i +* Re: Cost of handling misaligned access2Thomas Koenig
10 Feb 25  i   i i`- Re: Cost of handling misaligned access1Mike Stump
10 Feb 25  i   i `- Re: Cost of handling misaligned access1Mike Stump
4 Feb 25  i   `- Re: Cost of handling misaligned access1MitchAlsup1
3 Feb 25  +* Re: Cost of handling misaligned access3Thomas Koenig
3 Feb 25  i`* Re: Cost of handling misaligned access2BGB
3 Feb 25  i `- Re: Cost of handling misaligned access1MitchAlsup1
4 Feb 25  +* Re: Cost of handling misaligned access41Anton Ertl
5 Feb 25  i`* Re: Cost of handling misaligned access40Terje Mathisen
5 Feb 25  i +* Re: Cost of handling misaligned access4Anton Ertl
5 Feb 25  i i+* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i ii`- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i i`- Re: Cost of handling misaligned access1Anton Ertl
5 Feb 25  i `* Re: Cost of handling misaligned access35Michael S
6 Feb 25  i  +* Re: Cost of handling misaligned access32Anton Ertl
6 Feb 25  i  i`* Re: Cost of handling misaligned access31Michael S
6 Feb 25  i  i +* Re: Cost of handling misaligned access2Anton Ertl
6 Feb 25  i  i i`- Re: Cost of handling misaligned access1Michael S
6 Feb 25  i  i `* Re: Cost of handling misaligned access28Terje Mathisen
6 Feb 25  i  i  `* Re: Cost of handling misaligned access27Terje Mathisen
6 Feb 25  i  i   `* Re: Cost of handling misaligned access26Michael S
6 Feb 25  i  i    `* Re: Cost of handling misaligned access25Terje Mathisen
6 Feb 25  i  i     +* Re: Cost of handling misaligned access19Michael S
7 Feb 25  i  i     i`* Re: Cost of handling misaligned access18Terje Mathisen
7 Feb 25  i  i     i `* Re: Cost of handling misaligned access17Michael S
7 Feb 25  i  i     i  `* Re: Cost of handling misaligned access16Terje Mathisen
7 Feb 25  i  i     i   `* Re: Cost of handling misaligned access15Michael S
7 Feb 25  i  i     i    +- Re: Cost of handling misaligned access1Terje Mathisen
7 Feb 25  i  i     i    +* Re: Cost of handling misaligned access3MitchAlsup1
8 Feb 25  i  i     i    i+- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i    i`- Re: Cost of handling misaligned access1Michael S
8 Feb 25  i  i     i    `* Re: Cost of handling misaligned access10Anton Ertl
8 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i     +* Re: Cost of handling misaligned access6Michael S
8 Feb 25  i  i     i     i`* Re: Cost of handling misaligned access5Anton Ertl
8 Feb 25  i  i     i     i +- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i +* Re: Cost of handling misaligned access2Michael S
11 Feb 25  i  i     i     i i`- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i `- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Michael S
10 Feb 25  i  i     i     `- Re: Cost of handling misaligned access1Michael S
7 Feb 25  i  i     `* Re: Cost of handling misaligned access5BGB
7 Feb 25  i  i      `* Re: Cost of handling misaligned access4MitchAlsup1
7 Feb 25  i  i       `* Re: Cost of handling misaligned access3BGB
8 Feb 25  i  i        `* Re: Cost of handling misaligned access2Anssi Saari
8 Feb 25  i  i         `- Re: Cost of handling misaligned access1BGB
6 Feb 25  i  `* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i   `- Re: Cost of handling misaligned access1Michael S
6 Feb 25  +* Re: Cost of handling misaligned access5Waldek Hebisch
6 Feb 25  i+* Re: Cost of handling misaligned access3Anton Ertl
6 Feb 25  ii`* Re: Cost of handling misaligned access2Waldek Hebisch
6 Feb 25  ii `- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i`- Re: Cost of handling misaligned access1Terje Mathisen
13 Feb 25  `* Re: Cost of handling misaligned access48Marcus
13 Feb 25   +- Re: Cost of handling misaligned access1Thomas Koenig
14 Feb 25   +* Re: Cost of handling misaligned access41BGB
14 Feb 25   i`* Re: Cost of handling misaligned access40MitchAlsup1
18 Feb 25   i `* Re: Cost of handling misaligned access39BGB
18 Feb 25   i  +* Re: Cost of handling misaligned access33MitchAlsup1
18 Feb 25   i  i+- Re: Cost of handling misaligned access1BGB
18 Feb 25   i  i`* Re: Cost of handling misaligned access31Michael S
18 Feb 25   i  i +- Re: Cost of handling misaligned access1Thomas Koenig
18 Feb 25   i  i +* Re: Cost of handling misaligned access26MitchAlsup1
18 Feb 25   i  i i`* Re: Cost of handling misaligned access25Terje Mathisen
18 Feb 25   i  i i `* Re: Cost of handling misaligned access24MitchAlsup1
19 Feb 25   i  i i  `* Re: Cost of handling misaligned access23Terje Mathisen
19 Feb 25   i  i i   `* Re: Cost of handling misaligned access22MitchAlsup1
19 Feb 25   i  i i    `* Re: Cost of handling misaligned access21BGB
20 Feb 25   i  i i     +- Re: Cost of handling misaligned access1Robert Finch
20 Feb 25   i  i i     +* Re: Cost of handling misaligned access5MitchAlsup1
20 Feb 25   i  i i     i+* Re: Cost of handling misaligned access2BGB
20 Feb 25   i  i i     ii`- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     i`* Re: Cost of handling misaligned access2Robert Finch
21 Feb 25   i  i i     i `- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     `* Re: Cost of handling misaligned access14BGB
22 Feb 25   i  i i      +- Re: Cost of handling misaligned access1Robert Finch
22 Feb 25   i  i i      `* Re: Cost of handling misaligned access12Robert Finch
23 Feb 25   i  i i       +* Re: Cost of handling misaligned access10BGB
23 Feb 25   i  i i       i`* Re: Cost of handling misaligned access9Michael S
24 Feb 25   i  i i       i +- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i `* Re: Cost of handling misaligned access7Michael S
24 Feb 25   i  i i       i  +* Re: Cost of handling misaligned access4Robert Finch
24 Feb 25   i  i i       i  i+- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i  i`* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i  i `- Re: Cost of handling misaligned access1BGB
25 Feb 25   i  i i       i  `* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i   `- Re: Cost of handling misaligned access1BGB
23 Feb 25   i  i i       `- Re: Cost of handling misaligned access1Robert Finch
18 Feb 25   i  i `* Re: Cost of handling misaligned access3BGB
19 Feb 25   i  i  `* Re: Cost of handling misaligned access2MitchAlsup1
18 Feb 25   i  `* Re: Cost of handling misaligned access5Robert Finch
17 Feb 25   `* Re: Cost of handling misaligned access5Terje Mathisen

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal