On 4/23/2024 1:22 AM, Anton Ertl wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
>
CRAY machines stayed "in style" as long as memory latency remained
smaller than the length of a vector (64 cycles) and fell out of favor
when the cores got fast enough that memory could no longer keep up.
Mitch Alsup repeatedly makes this claim without giving any
justification. Your question may shed some light on that.
So why would conventional short vectors work better, then? Surely the
latency discrepancy would be even worse for them.
Thinking about it, they probably don't work better. They just don't
work worse, so why spend area on 4096-bit vector registers like the
Cray-1 did when 128-512-bit vector registers work just as well? Plus,
they have 200 or so of these registers, so 4096-bit registers would be
really expensive. How many vector registers does the Cray-1 (and its
successors) have?
Yeah.
Or if you can already saturate the RAM bandwidth with 128-bit vectors, why go wider?...
Or, one may find that even the difference between 64 and 128 bit vectors goes away once one's working set exceeds 1/3 to 1/2 the size of the L1 cache.
Meanwhile, it remains an issue that, wider vectors are more expensive.
Though, unclear even if 64-bit is a clear win over 32-bit in terms of performance. Arguably, many uses of 64-bit could have been served with a primarily 32-bit machine that allows paired registers for things like memory addressing and similar.
Though, OTOH, 64/128 allows unifying GPRs, FPU, and SIMD, into a single register space.
Also, 32/64/128 bit splitting/pairing isn't really workable as it would end up needing 8 or 12 register read ports (so, would be more expensive than the "use 64-bit registers and effectively waste half the register for 32-bit operations" option).
Well, unless the number of register ports remain constant (with 64-bit ports), and the 32-bit registers are effectively faked (by splitting the registers in half, and merging halves on write-back). But, there is little obvious advantage to this over the "just waste half the register" option (and it would be more expensive than just wasting half the register).
On modern machines OoO machinery bridges the latency gap between the
L2 cache, maybe even the L3 cache and the core for data-parallel code.
For the latency gap to main memory there are the hardware prefetchers,
and they use the L1 or L2 cache as intermediate buffer, while the
Cray-1 and followons use vector registers.
On my current PC, while it hides latency, one is hard-pressed to exceed roughly 4GB/sec of memory bandwidth (per core), though the overall system memory bandwidth seems to be higher.
Say: 8C/16T (memcpy)
Each core has a local peak of ~ 4GB/s;
System seems to be ~ 12-16 GB/s
Seemingly ~ 6-8 GB/s per group of 4 cores.
Peak memcpy bandwidth (L1 local) being in the area of 24 GB/s.
Latency is hidden fairly well, granted, but doesn't make as big of a difference if the task is bandwidth limited.
So what's the benefit of using vector/SIMD instructions at all rather
than doing it with scalar code? A SIMD instruction that replaces n
scalar instructions consumes fewer resources for instruction fetching,
decoding, register renaming, administering the instruction in the OoO
engine, and in retiring the instruction.
In my case, with my custom CPU core:
The elements are packaged in a way to make them easier to work with, for either parallel or pipeline execution.
For the low-precision unit, it can work on all 4 at the same time, if 4 are available. This unit does Binary16 or (optionally) Binary32.
In the main FPU, the SIMD packaging allows the FPU to pipeline the operations despite the FPU having too high of a latency to be pipelined normally.
The advantage of SIMD would be reduced if the pipeline were long enough to handle Binary64 values directly, but 6 EX stages would be asking a bit much (further increasing either pipeline length or width having a significant impact on the cost of the register-forwarding logic).
Arguably, one could have a separate (and longer) pipeline for FPU, but this would add complexity with a shared register space.
So why not use SIMD instructions with longer vector registers? The
progression from 128-bit SSE through AVX-256 to AVX-512 by Intel
suggests that this is happening, but with every doubling the cost in
area doubles but the returns are diminishing thanks to Amdahl's law.
So at some point you stop. Intel introduced AVX-512 for Larrabee (a
special-purpose machine), and now is backpedaling with desktop, laptop
and small-server CPUs (even though only the Golden/Raptor Cove cores
are enabled on the small-server CPUs) only supporting AVX, and with
AVX10 only guaranteeing 256-bit vector registers, so maybe 512-bit
vector registers are already too costly for the benefit they give in
general-purpose computing.
Yeah.
As I see it, in general, 128-bit SIMD seems to be the local optimum.
Both 256 and 512 end up having more drawbacks than merits as I see it.
Back to old-style vector processors. There have been machines that
supported longer vector registers and AFAIK also memory-to-memory
machines. The question is why have they not been the answer of the
vector-processor community to the problem of covering the latency? Or
maybe they have? AFAIK NEC SX has been available in some form even in
recent years, maybe still.
Going outside of Load/Store adds has a lot of hair for comparably little benefit.
Like, technically, I could go Load-Op / Op-Store for a subset of operations, as I ended up with the logic to support it. But, it doesn't seem like it would bring enough benefit to really be worth it (would not improve code-density as they require 64-bit encodings in my case, and in most cases seem unlikely to bring a performance advantage either; and given some limitations of the WEXifier, using them might actually make performance worse by interfering with shuffle-and-bundle).
The main merit they would have is if the CPU were register-pressure limited, but in my case, with 64 GPRs, this isn't really the case either.
Anyway, after thinking about this, the reason behind Mitch Alsup's
statement is that in a
doall(load process store)
computation (like what SIMD is good at), the loads precede the
corresponding processing by the load latency (i.e., memory latency on
the Cray machines). If your OoO capabilities are limited (and I think
they are on the Cray machines), you cannot start the second iteration
of the doall loop before the processing step of the first iteration
has finished with the register. You can do a bit of software
pipelining and software register renaming by transforming this into
load1 doall(load2 process1 store1 load1 process2 store2)
but at some point you run out of vector registers.
One thing that comes to mind is tracking individual parts of the
vector registers, which allows to starting the next iteration as soon
as the first part of the vector register no longer has any readers.
However, it's probably not that far off in complexity to tracking
shorter vector registers in an OoO engine. And if you support
exceptions (the Crays probably don't), this becomes messy, while with
short vector registers it's easier to implement the (ISA)
architecture.
As can be noted, SIMD is easy to implement.
Main obvious drawback is the potential for combinatorial explosions of instructions. One needs to keep a fairly careful watch over this.
Like, if one is faced with an NxN or NxM grid of possibilities, naive strategy is to be like "I will define an instruction for every possibility in the grid.", but this is bad. More reasonable to devise a minimal set of instructions that will allow the operation to be done within in a reasonable number of instructions.
But, then again, I can also note that I axed things like packed-byte operations and saturating arithmetic, which are pretty much de-facto in packed-integer SIMD.
Likewise, a lot of the gaps are filled in with specialized converter and helper ops. Even here, some conversion chains will require multiple instructions.
Well, and if there is no practical difference between a scalar and SIMD version of an instruction, may well just use the SIMD version for scalar.
...
- anton
Date | Sujet | # | | Auteur |
23 Apr 24 | Short Vectors Versus Long Vectors | 109 | | Lawrence D'Oliveiro |
23 Apr 24 | Re: Short Vectors Versus Long Vectors | 97 | | MitchAlsup1 |
23 Apr 24 | Re: Short Vectors Versus Long Vectors | 21 | | Lawrence D'Oliveiro |
23 Apr 24 | Re: Short Vectors Versus Long Vectors | 15 | | Anton Ertl |
23 Apr 24 | Re: Short Vectors Versus Long Vectors | 3 | | Lawrence D'Oliveiro |
23 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | Anton Ertl |
23 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | MitchAlsup1 |
23 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | MitchAlsup1 |
23 Apr 24 | Re: Short Vectors Versus Long Vectors | 10 | | BGB |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 9 | | MitchAlsup1 |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 8 | | BGB |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 6 | | Lawrence D'Oliveiro |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 5 | | MitchAlsup1 |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 2 | | BGB |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | Lawrence D'Oliveiro |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 2 | | Terje Mathisen |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | MitchAlsup1 |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | MitchAlsup1 |
23 Apr 24 | Re: Short Vectors Versus Long Vectors | 5 | | MitchAlsup1 |
23 Apr 24 | Re: Short Vectors Versus Long Vectors | 4 | | Lawrence D'Oliveiro |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 3 | | MitchAlsup1 |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 2 | | Lawrence D'Oliveiro |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | MitchAlsup1 |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 75 | | John Savard |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 26 | | MitchAlsup1 |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 23 | | John Savard |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 2 | | Lawrence D'Oliveiro |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | BGB |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 19 | | Anton Ertl |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 18 | | Lawrence D'Oliveiro |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 2 | | Michael S |
27 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | Lawrence D'Oliveiro |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 15 | | John Levine |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 14 | | MitchAlsup1 |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 9 | | Michael S |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 8 | | MitchAlsup1 |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 7 | | Michael S |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 6 | | BGB |
27 Apr 24 | Re: Short Vectors Versus Long Vectors | 5 | | Thomas Koenig |
27 Apr 24 | Re: Short Vectors Versus Long Vectors | 3 | | John Levine |
28 Apr 24 | Re: Short Vectors Versus Long Vectors | 2 | | Lawrence D'Oliveiro |
28 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | John Levine |
28 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | Tim Rentsch |
26 Apr 24 | Re: Short Vectors Versus Long Vectors | 4 | | Lawrence D'Oliveiro |
26 Apr 24 | Re: Short Vectors Versus Long Vectors | 3 | | MitchAlsup1 |
26 Apr 24 | Re: Short Vectors Versus Long Vectors | 2 | | Lawrence D'Oliveiro |
26 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | MitchAlsup1 |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | David Schultz |
27 Apr 24 | Re: Short Vectors Versus Long Vectors | 2 | | aph |
27 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | MitchAlsup1 |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 2 | | Lawrence D'Oliveiro |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | Anton Ertl |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | Thomas Koenig |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 45 | | Anton Ertl |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 43 | | Lawrence D'Oliveiro |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 42 | | Anton Ertl |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 41 | | Lawrence D'Oliveiro |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 40 | | John Savard |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 39 | | Lawrence D'Oliveiro |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 3 | | Michael S |
26 Apr 24 | Re: Short Vectors Versus Long Vectors | 2 | | Lawrence D'Oliveiro |
26 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | Michael S |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 7 | | John Levine |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 6 | | Thomas Koenig |
29 Apr 24 | Re: Short Vectors Versus Long Vectors | 5 | | George Neuner |
29 Apr 24 | Re: Short Vectors Versus Long Vectors | 3 | | Terje Mathisen |
29 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | BGB |
30 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | George Neuner |
29 Apr 24 | Re: lotsa power, Short Vectors Versus Long Vectors | 1 | | John Levine |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 28 | | John Savard |
25 Apr 24 | Re: Short Vectors Versus Long Vectors | 27 | | Thomas Koenig |
25 Apr 24 | Re: lots of juice, Short Vectors Versus Long Vectors | 3 | | John Levine |
27 Apr 24 | Re: lots of juice, Short Vectors Versus Long Vectors | 2 | | Thomas Koenig |
28 Apr 24 | Re: lots of juice, Short Vectors Versus Long Vectors | 1 | | Lawrence D'Oliveiro |
28 Apr 24 | Re: Short Vectors Versus Long Vectors | 23 | | Tim Rentsch |
30 Apr 24 | Re: Short Vectors Versus Long Vectors | 22 | | Thomas Koenig |
30 Apr 24 | Re: old power, Short Vectors Versus Long Vectors | 19 | | John Levine |
30 Apr 24 | Re: old power, Short Vectors Versus Long Vectors | 18 | | Lawrence D'Oliveiro |
1 May 24 | Re: old power, Short Vectors Versus Long Vectors | 17 | | John Levine |
1 May 24 | Re: old power, Short Vectors Versus Long Vectors | 1 | | MitchAlsup1 |
2 May 24 | Re: old power, Short Vectors Versus Long Vectors | 15 | | BGB |
2 May 24 | Re: old power, Short Vectors Versus Long Vectors | 14 | | MitchAlsup1 |
3 May 24 | Re: old power, Short Vectors Versus Long Vectors | 1 | | BGB |
3 May 24 | Re: old power, Short Vectors Versus Long Vectors | 12 | | Lawrence D'Oliveiro |
3 May 24 | Re: old power, Short Vectors Versus Long Vectors | 11 | | BGB |
4 May 24 | Re: old power, Short Vectors Versus Long Vectors | 10 | | MitchAlsup1 |
5 May 24 | Re: old power, Short Vectors Versus Long Vectors | 4 | | Thomas Koenig |
5 May 24 | Re: old power, Short Vectors Versus Long Vectors | 1 | | BGB |
5 May 24 | Re: old power, Short Vectors Versus Long Vectors | 2 | | MitchAlsup1 |
5 May 24 | Re: old power, Short Vectors Versus Long Vectors | 1 | | BGB |
5 May 24 | Re: old power, Short Vectors Versus Long Vectors | 2 | | BGB |
5 May 24 | Re: old power, Short Vectors Versus Long Vectors | 1 | | MitchAlsup1 |
6 May 24 | Re: not even sort of old power, Short Vectors Versus Long Vectors | 3 | | John Levine |
6 May 24 | Re: not even sort of old power, Short Vectors Versus Long Vectors | 1 | | MitchAlsup1 |
6 May 24 | Re: not even sort of old power, Short Vectors Versus Long Vectors | 1 | | Thomas Koenig |
1 May 24 | Re: Short Vectors Versus Long Vectors | 2 | | Tim Rentsch |
1 May 24 | Re: Short Vectors Versus Long Vectors | 1 | | Thomas Koenig |
24 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | MitchAlsup1 |
30 Apr 24 | Re: Short Vectors Versus Long Vectors | 11 | | MitchAlsup1 |
30 Apr 24 | Re: Short Vectors Versus Long Vectors | 1 | | MitchAlsup1 |
1 May 24 | Re: Short Vectors Versus Long Vectors | 9 | | Lawrence D'Oliveiro |