Newsportal USENET - Re: Cost of handling misaligned access

On 2/23/2025 4:08 PM, Michael S wrote:

On Sun, 23 Feb 2025 11:13:53 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:

BGB wrote:
On 2/22/2025 1:25 PM, Robert Finch wrote:
On 2025-02-22 10:16 a.m., EricP wrote:
BGB wrote:
On 2/21/2025 1:51 PM, EricP wrote:
>
and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
>
Fast and Area Efficient Adder for Wide Data in Recent Xilinx
FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf

>
Errm, skim, this doesn't really look like something you can pull
off in normal Verilog.
>
Well that's what I'm trying to figure out because its not just
this paper
but a lot, like many hundreds, of papers I've read from
commercial or academic source that seem to be able to control the
FPGA results to a fine degree.

>
You could invoke some of the LE's directly as primitives in
Verilog, but then one has an ugly mess that will only work on a
specific class of FPGA.
>
Generally though, one has access in terms of said primitives,
rather than control over the logic block.
>
>
Vs, say, code that will work with Verilator, Vivado, and Quartus,
without needing to be entirely rewritten for each.
>
>
Though, that said, my design might still need some reworking to be
"effective" with Quartus or Altera hardware; or to use the
available hardware.
>
Ok but this "portability" appears to be costing you dearly.
>
Say, rather than like on a Spartan or Artix (pure FPGA), the
Cyclone FPGA's tend to include ARM hard processors, with the FPGA
and ARM cores able to communicate over a bus. The FPGA part of the
DE10 apparently has its own RAM chip, but it is SDRAM (rather than
DDR2 or DDR3 like in a lot of the Xilinx based boards).
>
Well, apart from some low-end boards which use QSPI SRAMs (though,
having looked, a lot of these RAMs are DRAM internally, but the RAM
module has its own RAM refresh logic).
>

>
This can't just be left to the random luck of the wire router.
There must be something else that these commercial and academic
users are able to do to reliably optimize their design.
Maybe its a tool only available to big bucks customers.
>
This has me curious. I'm going to keep looking around.
>

I am sure it can be done as I have seen a lot of papers too with
results in the hundreds of megahertz. It has got to be the manual
placement and routing that helps. The routing in my design
typically takes up about 80% of the delay. One can build circuits
up out of individual primitive gates in Verilog (or(), and(), etc)
but for behavioral purposes I do not do that, instead relying on
the tools to generate the best combinations of gates. It is a ton
of work to do everything manually. I am happy to have things work
at 40 MHz even though 200 MHz may be possible with 10x the work
put into it. Typically running behavioural code. Doing things
mostly for my own edification. ( I have got my memory controller
working at 200 MHz, so it is possible).
One thing that I have found that helps is to use smaller modules
and tasks for repetitive code where possible. The tools seem to
put together a faster design if everything is smaller modules. I
ponder it may have to do with making place and route easier.

>
It is also possible to get higher speeds with smaller/simple
designs.
>
But, yeah, also I can note in Vivado, that the timing does tend to
be dominated more by "net delay" rather than "logic delay".
>
>
>
This is why my thoughts for a possible 75 MHz focused core would be
to drop down to 2-wide superscalar. It is more a question of what
could be done to try to leverage the higher clock-speed to an
advantage (and not lose too much performance in other areas).
>
You are missing my point. You are trying work around a problem with
low level module design by rearranging high level architecture
components.
>
It sounds like your ALU stage is taking about 20 ns to do an ADD
and that is having consequences that ripple through the design,
like taking an extra clock for result forwarding,
which causes performance issues when considering Compare And Branch,
and would cause a stall with back-to-back operations.
>
This goes back to module optimization where you said:
>
BGB wrote:
On 2/21/2025 1:51 PM, EricP wrote:

and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
>
Fast and Area Efficient Adder for Wide Data in Recent Xilinx
FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
>
Errm, skim, this doesn't really look like something you can pull
off in normal Verilog.
>
Generally, one doesn't control over how the components hook
together, only one can influence what happens based on how they
write their Verilog.
>
You can just write:
   reg[63:0] tValA;
   reg[63:0] tValB;
   reg[63:0] tValC;
   tValC=tValA+tValB;
>
>
But, then it spits out something with a chain of 16 CARRY4's, so
there is a fairly high latency on the high order bits of the
result.
>
It looks to me that Vivado intends that after you get your basic
design working, this module optimization is *exactly* what one is
supposed to do.
>
In this case the prototype design establishes that you need multiple
64-bit adders and the generic ones synthesis spits out are slow.
So you isolate that module off, use Verilog to drive the basic LE
selections, then iterate doing relative LE placement specifiers,
route the module, and when you get the fastest 64-bit adder you can
then lock down the netlist and save the module design.
>
Now you have a plug-in 64-bit adder module that runs at (I don't know
the speed difference between Virtex and your Spartan-7 so wild guess)
oh, say, 4 ns, to use multiple places... fetch, decode, alu, agu.
>
Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
isolate that module, optimize placement, route, lock down netlist,
and now you have a 5 ns plug-in ALU module.
>
Doing this you build up your own IP library of optimized hardware
modules.
>
As more and more modules are optimized the system synthesis gets
faster because much of the fine grain work and routing is already
done.
>
It sounds like your 1st hand FPGA design experience is VERY outdated.

Something is going on, it seems very different from my experience, in any case...
Apart from brief fiddling with YoSYS, the FPGA tools are mostly higher-level and GUI driven.
Process seems to be like:
   Syntheis:
   Vivado takes Verilog and does stuff with it.
   User doesn't see what it generates, only an overview of it in a GUI;
   Can look at resource usage and similar.
   Implementation:
   Vivado goes and generates a Bitstream.
   This stage is where it detects if timing fails.
   Hardware Manager:
   Access device over USB+FTDI or JTAG;
   Can load bitstream onto the device or reboot the FPGA, ...
It is possible to get at the constraints file and edit it manually, but a lot of the rest is hidden away and seemingly regenerated ex-nihilo each time synthesis is run.
With Digilent devices:
   Can generally also copy the bitstream onto an SDcard;
   Or, write the bitstream into an on-board Flash ROM.
I am not entirely sure how the FPGA bootup process works, seems like something is present which allows the FPGA to load the bitstream from a FAT32 volume, but this behavior seems absent on the QMTech board.
Looking up some stuff, it seems process may be something like:
   FPGA starts;
   It tries to load a bootstrap from a QSPI Flash on specific pins;
   Reading a bitstream directly from the start of the ROM?
   The Bootstrap then does a more complex boot:
   Scans the QSPI and SDcard for ".bit" files to load;
   In a FAT32 volume this time.
   Reads bitstream into FPGA;
   FPGA reboots and does the rest itself.
Reading some stuff, it appears it is possible to reprogram the initial QSPI Flash, but doing so (on a board like the Nexys A7) might (implicitly) break the boards' behavior of being able to boot a bitstream off the SDcard.
But, apparently, may be an option for the QMTech board which does apparently have a QSPI Flash connected up to the FPGA, but generally proceeds to do crap-all until one configures it over JTAG (I am guessing, maybe the initial ROM contents are empty, or it tries and fails to boot in some other way).
For 5 and 6 series devices, Xilinx seems to use a different set of tools known as ISE, but I haven't messed around with it.
Also, not going to micro-optimize roughly 200K lines of Verilog...
   How does it fit into the FPGA? FPGA magic I guess...
Then again, things like lookup tables may be fairly bulky in Verilog, but compact down a fair bit as LUTs.
So, say, for example, decoder in may case:
   Big lookup for all the possible instructions;
   Logic to unpack the various instruction forms.
Possibly counter-intuitive:
The big scary mess of instruction lookup is not the main thing that sways cost;
Rather, it is mostly the logic for instruction forms and immediate fields.
This is partly why for XG3 there are comparably few distinct layouts (and, fewer than RISC-V), since I started to realize that it is more "variability" that is the cost factor.
If I were designing a new decoder now, I might also handle the register outputs via MUX'ing (similar to what I ended up doing for decoding immediate values).
Say, for each of the "Rs,Rt,Rp,Rn" decoder output ports, a 2-bit field is used to select a register:
   00: Fixed (Default=ZZR)
   01: Rn
   10: Rm
   11: Ro
Or, maybe for the fixed ports, another few bits are used for another MUX:
   000: ZZR
   001: LR
   010: SP
   011: GBR
   100: DLR (R0, XG1/XG2)
   101: DHR (R1, XG1/XG2)
   ...
Though, I would need another register selector to decode the XG1 or XG2 F8 block, so a different strategy may be used:
   Mux1:
   00: Rn2 (F8 layout, XG1/XG2)
   01: Rn
   10: Rm
   11: Ro
   Mux2:
   000: ZZR
   ...
   Mux (top):
   0: Dynamic Register (Mux1)
   1: Fixed Register (Mux2)
But, for sake of decoder efficiency, it could make sense to do a decoder that is natively XG3 only and then handle XG1 and XG2 via repacking (basically, the reverse of my current decoding strategy).
Though, handling the 16-bit ops via repacking, while theoretically possible, would be a big ask in terms of latency.
A possible hack could be to do a partial bit-repack, essentially dumping the 16-bit ops in a semi-repacked form into some part of the XG3 decoding space (rather than use a fully separate 16-bit decoder).
...
Though, in other news:
Did partly revive my 2-wide configuration;
Though, it crashes immediately after loading the kernel image, as the 2-wide configuration isn't full binary compatible.
Then switched things out so that it will also load the 2-wide build of the kernel. Closer, kernel boots, but still more fiddling needed, crashes in sanity tests (tries to sanity test SIMD and crashes, but currently I have the SIMD unit and some other stuff disabled, so this stands to reason).
At present, the 2-wide configuration is using 78% of the XC7S50, albeit without the VGA or PS2 logic enabled. Performance is notably worse than with the full 3-wide core is though (I suspect mostly due to cutting various "make stuff faster" settings).
However, it does have 2.4ns of slack ATM, so could maybe be twiddled (XC7S50 is more resource-limited than timing limited though).
At present, this configuration:
   2-wide XG1/XG2 (64 GPRs);
   2-wide mostly limited to ALU|ALU.
   Does not support ALU|MEM.
   1-wide RISC-V (scalar only);
   Binary64 FPU;
   64x4w TLB;
   ...
Has a resource cost of ~ 23k LUT for the CPU core.
   ~ 34% of the LUT cost is going into the L1 cache;
   ~ 13% into the FPU;
   ...
...

Date	Sujet	#	Auteur
2 Feb 25	Re: Cost of handling misaligned access	112	BGB
3 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	109	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	11	BGB
3 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	8	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	7	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	5	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	2	Thomas Koenig
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
4 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	3	Thomas Koenig
3 Feb 25	Re: Cost of handling misaligned access	2	BGB
3 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
4 Feb 25	Re: Cost of handling misaligned access	41	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	40	Terje Mathisen
5 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	35	Michael S
6 Feb 25	Re: Cost of handling misaligned access	32	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	31	Michael S
6 Feb 25	Re: Cost of handling misaligned access	2	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	28	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	27	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	26	Michael S
6 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	19	Michael S
7 Feb 25	Re: Cost of handling misaligned access	18	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	17	Michael S
7 Feb 25	Re: Cost of handling misaligned access	16	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	15	Michael S
7 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	3	MitchAlsup1
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
8 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	6	Michael S
8 Feb 25	Re: Cost of handling misaligned access	5	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	2	Michael S
11 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
10 Feb 25	Re: Cost of handling misaligned access	1	Michael S
7 Feb 25	Re: Cost of handling misaligned access	5	BGB
7 Feb 25	Re: Cost of handling misaligned access	4	MitchAlsup1
7 Feb 25	Re: Cost of handling misaligned access	3	BGB
8 Feb 25	Re: Cost of handling misaligned access	2	Anssi Saari
8 Feb 25	Re: Cost of handling misaligned access	1	BGB
6 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	5	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	3	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	2	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
13 Feb 25	Re: Cost of handling misaligned access	48	Marcus
13 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
14 Feb 25	Re: Cost of handling misaligned access	41	BGB
14 Feb 25	Re: Cost of handling misaligned access	40	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	39	BGB
18 Feb 25	Re: Cost of handling misaligned access	33	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	1	BGB
18 Feb 25	Re: Cost of handling misaligned access	31	Michael S
18 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
18 Feb 25	Re: Cost of handling misaligned access	26	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
18 Feb 25	Re: Cost of handling misaligned access	24	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	23	Terje Mathisen
19 Feb 25	Re: Cost of handling misaligned access	22	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	21	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
20 Feb 25	Re: Cost of handling misaligned access	5	MitchAlsup1
20 Feb 25	Re: Cost of handling misaligned access	2	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	2	Robert Finch
21 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	14	BGB
22 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
22 Feb 25	Re: Cost of handling misaligned access	12	Robert Finch
23 Feb 25	Re: Cost of handling misaligned access	10	BGB
23 Feb 25	Re: Cost of handling misaligned access	9	Michael S
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	7	Michael S
24 Feb 25	Re: Cost of handling misaligned access	4	Robert Finch
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
25 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
23 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
18 Feb 25	Re: Cost of handling misaligned access	3	BGB
19 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	5	Robert Finch
17 Feb 25	Re: Cost of handling misaligned access	5	Terje Mathisen