Liste des Groupes | Revenir à c arch |
On Sun, 23 Feb 2025 11:13:53 -0500Something is going on, it seems very different from my experience, in any case...
EricP <ThatWouldBeTelling@thevillage.com> wrote:
BGB wrote:It sounds like your 1st hand FPGA design experience is VERY outdated.On 2/22/2025 1:25 PM, Robert Finch wrote:>On 2025-02-22 10:16 a.m., EricP wrote:>BGB wrote:On 2/21/2025 1:51 PM, EricP wrote:>>>
and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
>
Fast and Area Efficient Adder for Wide Data in Recent Xilinx
FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
Errm, skim, this doesn't really look like something you can pull
off in normal Verilog.
Well that's what I'm trying to figure out because its not just
this paper
but a lot, like many hundreds, of papers I've read from
commercial or academic source that seem to be able to control the
FPGA results to a fine degree.
You could invoke some of the LE's directly as primitives in
Verilog, but then one has an ugly mess that will only work on a
specific class of FPGA.
>
Generally though, one has access in terms of said primitives,
rather than control over the logic block.
>
>
Vs, say, code that will work with Verilator, Vivado, and Quartus,
without needing to be entirely rewritten for each.
>
>
Though, that said, my design might still need some reworking to be
"effective" with Quartus or Altera hardware; or to use the
available hardware.
Ok but this "portability" appears to be costing you dearly.
>Say, rather than like on a Spartan or Artix (pure FPGA), the>
Cyclone FPGA's tend to include ARM hard processors, with the FPGA
and ARM cores able to communicate over a bus. The FPGA part of the
DE10 apparently has its own RAM chip, but it is SDRAM (rather than
DDR2 or DDR3 like in a lot of the Xilinx based boards).
>
Well, apart from some low-end boards which use QSPI SRAMs (though,
having looked, a lot of these RAMs are DRAM internally, but the RAM
module has its own RAM refresh logic).
>
>>I am sure it can be done as I have seen a lot of papers too with
This can't just be left to the random luck of the wire router.
There must be something else that these commercial and academic
users are able to do to reliably optimize their design.
Maybe its a tool only available to big bucks customers.
>
This has me curious. I'm going to keep looking around.
>
results in the hundreds of megahertz. It has got to be the manual
placement and routing that helps. The routing in my design
typically takes up about 80% of the delay. One can build circuits
up out of individual primitive gates in Verilog (or(), and(), etc)
but for behavioral purposes I do not do that, instead relying on
the tools to generate the best combinations of gates. It is a ton
of work to do everything manually. I am happy to have things work
at 40 MHz even though 200 MHz may be possible with 10x the work
put into it. Typically running behavioural code. Doing things
mostly for my own edification. ( I have got my memory controller
working at 200 MHz, so it is possible).
One thing that I have found that helps is to use smaller modules
and tasks for repetitive code where possible. The tools seem to
put together a faster design if everything is smaller modules. I
ponder it may have to do with making place and route easier.
It is also possible to get higher speeds with smaller/simple
designs.
>
But, yeah, also I can note in Vivado, that the timing does tend to
be dominated more by "net delay" rather than "logic delay".
>
>
>
This is why my thoughts for a possible 75 MHz focused core would be
to drop down to 2-wide superscalar. It is more a question of what
could be done to try to leverage the higher clock-speed to an
advantage (and not lose too much performance in other areas).
You are missing my point. You are trying work around a problem with
low level module design by rearranging high level architecture
components.
>
It sounds like your ALU stage is taking about 20 ns to do an ADD
and that is having consequences that ripple through the design,
like taking an extra clock for result forwarding,
which causes performance issues when considering Compare And Branch,
and would cause a stall with back-to-back operations.
>
This goes back to module optimization where you said:
>
BGB wrote:On 2/21/2025 1:51 PM, EricP wrote:>
and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:>
>
Fast and Area Efficient Adder for Wide Data in Recent Xilinx
FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
Errm, skim, this doesn't really look like something you can pull
off in normal Verilog.
>
Generally, one doesn't control over how the components hook
together, only one can influence what happens based on how they
write their Verilog.
>
You can just write:
reg[63:0] tValA;
reg[63:0] tValB;
reg[63:0] tValC;
tValC=tValA+tValB;
>
>
But, then it spits out something with a chain of 16 CARRY4's, so
there is a fairly high latency on the high order bits of the
result.
It looks to me that Vivado intends that after you get your basic
design working, this module optimization is *exactly* what one is
supposed to do.
>
In this case the prototype design establishes that you need multiple
64-bit adders and the generic ones synthesis spits out are slow.
So you isolate that module off, use Verilog to drive the basic LE
selections, then iterate doing relative LE placement specifiers,
route the module, and when you get the fastest 64-bit adder you can
then lock down the netlist and save the module design.
>
Now you have a plug-in 64-bit adder module that runs at (I don't know
the speed difference between Virtex and your Spartan-7 so wild guess)
oh, say, 4 ns, to use multiple places... fetch, decode, alu, agu.
>
Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
isolate that module, optimize placement, route, lock down netlist,
and now you have a 5 ns plug-in ALU module.
>
Doing this you build up your own IP library of optimized hardware
modules.
>
As more and more modules are optimized the system synthesis gets
faster because much of the fine grain work and routing is already
done.
>
Les messages affichés proviennent d'usenet.