Liste des Groupes | Revenir à c arch |
mitchalsup@aol.com (MitchAlsup1) writes:There are also some savings in reduced I-cache usage (possibly leading to higher I-cache hit rate), reduced memory I-fetch memory bandwidth required, etc, though these may be modest at best.The point is that the cost of not getting allocated into a registerLatency is not the issue in modern high-performance AMD64 cores, which
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.
have zero-cycle store-to-load forwarding
<http://www.complang.tuwien.ac.at/anton/memdep/>.
And yet, putting variables in registers gives a significant speedup:
On a Rocket Lake, numbers are times in seconds:
sieve bubble matrix fib fft
0.075 0.070 0.036 0.049 0.017 TOS in reg, RP in reg, IP in reg
0.100 0.149 0.054 0.106 0.037 TOS in mem, RP in mem, IP write-through to mem
In the first line, I used gforth-fast and tried to disable all
optimizations except those that keep certain variables in registers:
gforth-fast --ss-states=1 --ss-number=31 --opt-ip-updates=0 onebench.fs
I could not reduce the static superinstructions below 31 and still get
a result; I will have to investigate why, but that probably does not
make that much of a difference for several of these benchmarks.
In the second line I used gforth, an engine that keeps the top of
stack in memory, the return-stack pointer in memory, stores IP to
memory after every change, and does not use static superinstructions,
all for better identifying where an error happened.
The the examples cited, the lack of register allocation triplesWhat makes you think that instruction count is particularly relevant?
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.
Yes, you may save some decoding resources if you use LD-OP-ST on an
architecture that supports it, but you first had to invest into a more
complex decoder. And in the OoO engine the difference may be gone (at
least on Intel CPUs).
Les messages affichés proviennent d'usenet.