Re: Cost of handling misaligned access

Liste des GroupesRevenir à c arch 
Sujet : Re: Cost of handling misaligned access
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 24. Feb 2025, 02:37:47
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vpgihh$odpm$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
User-Agent : Mozilla Thunderbird
On 2/23/2025 4:08 PM, Michael S wrote:
On Sun, 23 Feb 2025 11:13:53 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:
 
BGB wrote:
On 2/22/2025 1:25 PM, Robert Finch wrote:
On 2025-02-22 10:16 a.m., EricP wrote:
BGB wrote:
On 2/21/2025 1:51 PM, EricP wrote:
>
and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
>
Fast and Area Efficient Adder for Wide Data in Recent Xilinx
FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
 
>
Errm, skim, this doesn't really look like something you can pull
off in normal Verilog.
>
Well that's what I'm trying to figure out because its not just
this paper
but a lot, like many hundreds, of papers I've read from
commercial or academic source that seem to be able to control the
FPGA results to a fine degree.
 
>
You could invoke some of the LE's directly as primitives in
Verilog, but then one has an ugly mess that will only work on a
specific class of FPGA.
>
Generally though, one has access in terms of said primitives,
rather than control over the logic block.
>
>
Vs, say, code that will work with Verilator, Vivado, and Quartus,
without needing to be entirely rewritten for each.
>
>
Though, that said, my design might still need some reworking to be
"effective" with Quartus or Altera hardware; or to use the
available hardware.
>
Ok but this "portability" appears to be costing you dearly.
>
Say, rather than like on a Spartan or Artix (pure FPGA), the
Cyclone FPGA's tend to include ARM hard processors, with the FPGA
and ARM cores able to communicate over a bus. The FPGA part of the
DE10 apparently has its own RAM chip, but it is SDRAM (rather than
DDR2 or DDR3 like in a lot of the Xilinx based boards).
>
Well, apart from some low-end boards which use QSPI SRAMs (though,
having looked, a lot of these RAMs are DRAM internally, but the RAM
module has its own RAM refresh logic).
>
  
>
This can't just be left to the random luck of the wire router.
There must be something else that these commercial and academic
users are able to do to reliably optimize their design.
Maybe its a tool only available to big bucks customers.
>
This has me curious. I'm going to keep looking around.
>
 
I am sure it can be done as I have seen a lot of papers too with
results in the hundreds of megahertz. It has got to be the manual
placement and routing that helps. The routing in my design
typically takes up about 80% of the delay. One can build circuits
up out of individual primitive gates in Verilog (or(), and(), etc)
but for behavioral purposes I do not do that, instead relying on
the tools to generate the best combinations of gates. It is a ton
of work to do everything manually. I am happy to have things work
at 40 MHz even though 200 MHz may be possible with 10x the work
put into it. Typically running behavioural code. Doing things
mostly for my own edification. ( I have got my memory controller
working at 200 MHz, so it is possible).
One thing that I have found that helps is to use smaller modules
and tasks for repetitive code where possible. The tools seem to
put together a faster design if everything is smaller modules. I
ponder it may have to do with making place and route easier.
 
>
It is also possible to get higher speeds with smaller/simple
designs.
>
But, yeah, also I can note in Vivado, that the timing does tend to
be dominated more by "net delay" rather than "logic delay".
>
>
>
This is why my thoughts for a possible 75 MHz focused core would be
to drop down to 2-wide superscalar. It is more a question of what
could be done to try to leverage the higher clock-speed to an
advantage (and not lose too much performance in other areas).
>
You are missing my point. You are trying work around a problem with
low level module design by rearranging high level architecture
components.
>
It sounds like your ALU stage is taking about 20 ns to do an ADD
and that is having consequences that ripple through the design,
like taking an extra clock for result forwarding,
which causes performance issues when considering Compare And Branch,
and would cause a stall with back-to-back operations.
>
This goes back to module optimization where you said:
>
BGB wrote:
On 2/21/2025 1:51 PM, EricP wrote:
 
and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
>
Fast and Area Efficient Adder for Wide Data in Recent Xilinx
FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
>
Errm, skim, this doesn't really look like something you can pull
off in normal Verilog.
>
Generally, one doesn't control over how the components hook
together, only one can influence what happens based on how they
write their Verilog.
>
You can just write:
   reg[63:0] tValA;
   reg[63:0] tValB;
   reg[63:0] tValC;
   tValC=tValA+tValB;
>
>
But, then it spits out something with a chain of 16 CARRY4's, so
there is a fairly high latency on the high order bits of the
result.
>
It looks to me that Vivado intends that after you get your basic
design working, this module optimization is *exactly* what one is
supposed to do.
>
In this case the prototype design establishes that you need multiple
64-bit adders and the generic ones synthesis spits out are slow.
So you isolate that module off, use Verilog to drive the basic LE
selections, then iterate doing relative LE placement specifiers,
route the module, and when you get the fastest 64-bit adder you can
then lock down the netlist and save the module design.
>
Now you have a plug-in 64-bit adder module that runs at (I don't know
the speed difference between Virtex and your Spartan-7 so wild guess)
oh, say, 4 ns, to use multiple places... fetch, decode, alu, agu.
>
Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
isolate that module, optimize placement, route, lock down netlist,
and now you have a 5 ns plug-in ALU module.
>
Doing this you build up your own IP library of optimized hardware
modules.
>
As more and more modules are optimized the system synthesis gets
faster because much of the fine grain work and routing is already
done.
>
  It sounds like your 1st hand FPGA design experience is VERY outdated.
 
Something is going on, it seems very different from my experience, in any case...
Apart from brief fiddling with YoSYS, the FPGA tools are mostly higher-level and GUI driven.
Process seems to be like:
   Syntheis:
     Vivado takes Verilog and does stuff with it.
     User doesn't see what it generates, only an overview of it in a GUI;
     Can look at resource usage and similar.
   Implementation:
     Vivado goes and generates a Bitstream.
     This stage is where it detects if timing fails.
   Hardware Manager:
     Access device over USB+FTDI or JTAG;
     Can load bitstream onto the device or reboot the FPGA, ...
It is possible to get at the constraints file and edit it manually, but a lot of the rest is hidden away and seemingly regenerated ex-nihilo each time synthesis is run.
With Digilent devices:
   Can generally also copy the bitstream onto an SDcard;
   Or, write the bitstream into an on-board Flash ROM.
I am not entirely sure how the FPGA bootup process works, seems like something is present which allows the FPGA to load the bitstream from a FAT32 volume, but this behavior seems absent on the QMTech board.
Looking up some stuff, it seems process may be something like:
   FPGA starts;
   It tries to load a bootstrap from a QSPI Flash on specific pins;
     Reading a bitstream directly from the start of the ROM?
   The Bootstrap then does a more complex boot:
     Scans the QSPI and SDcard for ".bit" files to load;
       In a FAT32 volume this time.
     Reads bitstream into FPGA;
       FPGA reboots and does the rest itself.
Reading some stuff, it appears it is possible to reprogram the initial QSPI Flash, but doing so (on a board like the Nexys A7) might (implicitly) break the boards' behavior of being able to boot a bitstream off the SDcard.
But, apparently, may be an option for the QMTech board which does apparently have a QSPI Flash connected up to the FPGA, but generally proceeds to do crap-all until one configures it over JTAG (I am guessing, maybe the initial ROM contents are empty, or it tries and fails to boot in some other way).
For 5 and 6 series devices, Xilinx seems to use a different set of tools known as ISE, but I haven't messed around with it.
Also, not going to micro-optimize roughly 200K lines of Verilog...
   How does it fit into the FPGA? FPGA magic I guess...
Then again, things like lookup tables may be fairly bulky in Verilog, but compact down a fair bit as LUTs.
So, say, for example, decoder in may case:
   Big lookup for all the possible instructions;
   Logic to unpack the various instruction forms.
Possibly counter-intuitive:
The big scary mess of instruction lookup is not the main thing that sways cost;
Rather, it is mostly the logic for instruction forms and immediate fields.
This is partly why for XG3 there are comparably few distinct layouts (and, fewer than RISC-V), since I started to realize that it is more "variability" that is the cost factor.
If I were designing a new decoder now, I might also handle the register outputs via MUX'ing (similar to what I ended up doing for decoding immediate values).
Say, for each of the "Rs,Rt,Rp,Rn" decoder output ports, a 2-bit field is used to select a register:
   00: Fixed (Default=ZZR)
   01: Rn
   10: Rm
   11: Ro
Or, maybe for the fixed ports, another few bits are used for another MUX:
   000: ZZR
   001: LR
   010: SP
   011: GBR
   100: DLR (R0, XG1/XG2)
   101: DHR (R1, XG1/XG2)
   ...
Though, I would need another register selector to decode the XG1 or XG2 F8 block, so a different strategy may be used:
   Mux1:
     00: Rn2 (F8 layout, XG1/XG2)
     01: Rn
     10: Rm
     11: Ro
   Mux2:
     000: ZZR
     ...
   Mux (top):
     0: Dynamic Register (Mux1)
     1: Fixed Register (Mux2)
But, for sake of decoder efficiency, it could make sense to do a decoder that is natively XG3 only and then handle XG1 and XG2 via repacking (basically, the reverse of my current decoding strategy).
Though, handling the 16-bit ops via repacking, while theoretically possible, would be a big ask in terms of latency.
A possible hack could be to do a partial bit-repack, essentially dumping the 16-bit ops in a semi-repacked form into some part of the XG3 decoding space (rather than use a fully separate 16-bit decoder).
...
Though, in other news:
Did partly revive my 2-wide configuration;
Though, it crashes immediately after loading the kernel image, as the 2-wide configuration isn't full binary compatible.
Then switched things out so that it will also load the 2-wide build of the kernel. Closer, kernel boots, but still more fiddling needed, crashes in sanity tests (tries to sanity test SIMD and crashes, but currently I have the SIMD unit and some other stuff disabled, so this stands to reason).
At present, the 2-wide configuration is using 78% of the XC7S50, albeit without the VGA or PS2 logic enabled. Performance is notably worse than with the full 3-wide core is though (I suspect mostly due to cutting various "make stuff faster" settings).
However, it does have 2.4ns of slack ATM, so could maybe be twiddled (XC7S50 is more resource-limited than timing limited though).
At present, this configuration:
   2-wide XG1/XG2 (64 GPRs);
     2-wide mostly limited to ALU|ALU.
     Does not support ALU|MEM.
   1-wide RISC-V (scalar only);
   Binary64 FPU;
   64x4w TLB;
   ...
Has a resource cost of ~ 23k LUT for the CPU core.
   ~ 34% of the LUT cost is going into the L1 cache;
   ~ 13% into the FPU;
   ...
...

Date Sujet#  Auteur
2 Feb 25 * Re: Cost of handling misaligned access112BGB
3 Feb 25 +* Re: Cost of handling misaligned access2MitchAlsup1
3 Feb 25 i`- Re: Cost of handling misaligned access1BGB
3 Feb 25 `* Re: Cost of handling misaligned access109Anton Ertl
3 Feb 25  +* Re: Cost of handling misaligned access11BGB
3 Feb 25  i`* Re: Cost of handling misaligned access10Anton Ertl
3 Feb 25  i +- Re: Cost of handling misaligned access1BGB
3 Feb 25  i `* Re: Cost of handling misaligned access8Thomas Koenig
4 Feb 25  i  `* Re: Cost of handling misaligned access7Anton Ertl
4 Feb 25  i   +* Re: Cost of handling misaligned access5Thomas Koenig
4 Feb 25  i   i`* Re: Cost of handling misaligned access4Anton Ertl
4 Feb 25  i   i +* Re: Cost of handling misaligned access2Thomas Koenig
10 Feb 25  i   i i`- Re: Cost of handling misaligned access1Mike Stump
10 Feb 25  i   i `- Re: Cost of handling misaligned access1Mike Stump
4 Feb 25  i   `- Re: Cost of handling misaligned access1MitchAlsup1
3 Feb 25  +* Re: Cost of handling misaligned access3Thomas Koenig
3 Feb 25  i`* Re: Cost of handling misaligned access2BGB
3 Feb 25  i `- Re: Cost of handling misaligned access1MitchAlsup1
4 Feb 25  +* Re: Cost of handling misaligned access41Anton Ertl
5 Feb 25  i`* Re: Cost of handling misaligned access40Terje Mathisen
5 Feb 25  i +* Re: Cost of handling misaligned access4Anton Ertl
5 Feb 25  i i+* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i ii`- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i i`- Re: Cost of handling misaligned access1Anton Ertl
5 Feb 25  i `* Re: Cost of handling misaligned access35Michael S
6 Feb 25  i  +* Re: Cost of handling misaligned access32Anton Ertl
6 Feb 25  i  i`* Re: Cost of handling misaligned access31Michael S
6 Feb 25  i  i +* Re: Cost of handling misaligned access2Anton Ertl
6 Feb 25  i  i i`- Re: Cost of handling misaligned access1Michael S
6 Feb 25  i  i `* Re: Cost of handling misaligned access28Terje Mathisen
6 Feb 25  i  i  `* Re: Cost of handling misaligned access27Terje Mathisen
6 Feb 25  i  i   `* Re: Cost of handling misaligned access26Michael S
6 Feb 25  i  i    `* Re: Cost of handling misaligned access25Terje Mathisen
6 Feb 25  i  i     +* Re: Cost of handling misaligned access19Michael S
7 Feb 25  i  i     i`* Re: Cost of handling misaligned access18Terje Mathisen
7 Feb 25  i  i     i `* Re: Cost of handling misaligned access17Michael S
7 Feb 25  i  i     i  `* Re: Cost of handling misaligned access16Terje Mathisen
7 Feb 25  i  i     i   `* Re: Cost of handling misaligned access15Michael S
7 Feb 25  i  i     i    +- Re: Cost of handling misaligned access1Terje Mathisen
7 Feb 25  i  i     i    +* Re: Cost of handling misaligned access3MitchAlsup1
8 Feb 25  i  i     i    i+- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i    i`- Re: Cost of handling misaligned access1Michael S
8 Feb 25  i  i     i    `* Re: Cost of handling misaligned access10Anton Ertl
8 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i     +* Re: Cost of handling misaligned access6Michael S
8 Feb 25  i  i     i     i`* Re: Cost of handling misaligned access5Anton Ertl
8 Feb 25  i  i     i     i +- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i +* Re: Cost of handling misaligned access2Michael S
11 Feb 25  i  i     i     i i`- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i `- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Michael S
10 Feb 25  i  i     i     `- Re: Cost of handling misaligned access1Michael S
7 Feb 25  i  i     `* Re: Cost of handling misaligned access5BGB
7 Feb 25  i  i      `* Re: Cost of handling misaligned access4MitchAlsup1
7 Feb 25  i  i       `* Re: Cost of handling misaligned access3BGB
8 Feb 25  i  i        `* Re: Cost of handling misaligned access2Anssi Saari
8 Feb 25  i  i         `- Re: Cost of handling misaligned access1BGB
6 Feb 25  i  `* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i   `- Re: Cost of handling misaligned access1Michael S
6 Feb 25  +* Re: Cost of handling misaligned access5Waldek Hebisch
6 Feb 25  i+* Re: Cost of handling misaligned access3Anton Ertl
6 Feb 25  ii`* Re: Cost of handling misaligned access2Waldek Hebisch
6 Feb 25  ii `- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i`- Re: Cost of handling misaligned access1Terje Mathisen
13 Feb 25  `* Re: Cost of handling misaligned access48Marcus
13 Feb 25   +- Re: Cost of handling misaligned access1Thomas Koenig
14 Feb 25   +* Re: Cost of handling misaligned access41BGB
14 Feb 25   i`* Re: Cost of handling misaligned access40MitchAlsup1
18 Feb 25   i `* Re: Cost of handling misaligned access39BGB
18 Feb 25   i  +* Re: Cost of handling misaligned access33MitchAlsup1
18 Feb 25   i  i+- Re: Cost of handling misaligned access1BGB
18 Feb 25   i  i`* Re: Cost of handling misaligned access31Michael S
18 Feb 25   i  i +- Re: Cost of handling misaligned access1Thomas Koenig
18 Feb 25   i  i +* Re: Cost of handling misaligned access26MitchAlsup1
18 Feb 25   i  i i`* Re: Cost of handling misaligned access25Terje Mathisen
18 Feb 25   i  i i `* Re: Cost of handling misaligned access24MitchAlsup1
19 Feb 25   i  i i  `* Re: Cost of handling misaligned access23Terje Mathisen
19 Feb 25   i  i i   `* Re: Cost of handling misaligned access22MitchAlsup1
19 Feb 25   i  i i    `* Re: Cost of handling misaligned access21BGB
20 Feb 25   i  i i     +- Re: Cost of handling misaligned access1Robert Finch
20 Feb 25   i  i i     +* Re: Cost of handling misaligned access5MitchAlsup1
20 Feb 25   i  i i     i+* Re: Cost of handling misaligned access2BGB
20 Feb 25   i  i i     ii`- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     i`* Re: Cost of handling misaligned access2Robert Finch
21 Feb 25   i  i i     i `- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     `* Re: Cost of handling misaligned access14BGB
22 Feb 25   i  i i      +- Re: Cost of handling misaligned access1Robert Finch
22 Feb 25   i  i i      `* Re: Cost of handling misaligned access12Robert Finch
23 Feb 25   i  i i       +* Re: Cost of handling misaligned access10BGB
23 Feb 25   i  i i       i`* Re: Cost of handling misaligned access9Michael S
24 Feb 25   i  i i       i +- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i `* Re: Cost of handling misaligned access7Michael S
24 Feb 25   i  i i       i  +* Re: Cost of handling misaligned access4Robert Finch
24 Feb 25   i  i i       i  i+- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i  i`* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i  i `- Re: Cost of handling misaligned access1BGB
25 Feb 25   i  i i       i  `* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i   `- Re: Cost of handling misaligned access1BGB
23 Feb 25   i  i i       `- Re: Cost of handling misaligned access1Robert Finch
18 Feb 25   i  i `* Re: Cost of handling misaligned access3BGB
19 Feb 25   i  i  `* Re: Cost of handling misaligned access2MitchAlsup1
18 Feb 25   i  `* Re: Cost of handling misaligned access5Robert Finch
17 Feb 25   `* Re: Cost of handling misaligned access5Terje Mathisen

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal