Re: Short Vectors Versus Long Vectors

Liste des GroupesRevenir à c arch 
Sujet : Re: Short Vectors Versus Long Vectors
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 23. Apr 2024, 22:23:33
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v098so$1rp16$1@dont-email.me>
References : 1 2 3 4
User-Agent : Mozilla Thunderbird
On 4/23/2024 1:22 AM, Anton Ertl wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
>
CRAY machines stayed "in style" as long as memory latency remained
smaller than the length of a vector (64 cycles) and fell out of favor
when the cores got fast enough that memory could no longer keep up.
 Mitch Alsup repeatedly makes this claim without giving any
justification.  Your question may shed some light on that.
 
So why would conventional short vectors work better, then? Surely the
latency discrepancy would be even worse for them.
 Thinking about it, they probably don't work better.  They just don't
work worse, so why spend area on 4096-bit vector registers like the
Cray-1 did when 128-512-bit vector registers work just as well?  Plus,
they have 200 or so of these registers, so 4096-bit registers would be
really expensive.  How many vector registers does the Cray-1 (and its
successors) have?
 
Yeah.
Or if you can already saturate the RAM bandwidth with 128-bit vectors, why go wider?...
Or, one may find that even the difference between 64 and 128 bit vectors goes away once one's working set exceeds 1/3 to 1/2 the size of the L1 cache.
Meanwhile, it remains an issue that, wider vectors are more expensive.
Though, unclear even if 64-bit is a clear win over 32-bit in terms of performance. Arguably, many uses of 64-bit could have been served with a primarily 32-bit machine that allows paired registers for things like memory addressing and similar.
Though, OTOH, 64/128 allows unifying GPRs, FPU, and SIMD, into a single register space.
Also, 32/64/128 bit splitting/pairing isn't really workable as it would end up needing 8 or 12 register read ports (so, would be more expensive than the "use 64-bit registers and effectively waste half the register for 32-bit operations" option).
Well, unless the number of register ports remain constant (with 64-bit ports), and the 32-bit registers are effectively faked (by splitting the registers in half, and merging halves on write-back). But, there is little obvious advantage to this over the "just waste half the register" option (and it would be more expensive than just wasting half the register).

On modern machines OoO machinery bridges the latency gap between the
L2 cache, maybe even the L3 cache and the core for data-parallel code.
For the latency gap to main memory there are the hardware prefetchers,
and they use the L1 or L2 cache as intermediate buffer, while the
Cray-1 and followons use vector registers.
 
On my current PC, while it hides latency, one is hard-pressed to exceed roughly 4GB/sec of memory bandwidth (per core), though the overall system memory bandwidth seems to be higher.
   Say: 8C/16T (memcpy)
     Each core has a local peak of ~ 4GB/s;
     System seems to be ~ 12-16 GB/s
       Seemingly ~ 6-8 GB/s per group of 4 cores.
Peak memcpy bandwidth (L1 local) being in the area of 24 GB/s.
Latency is hidden fairly well, granted, but doesn't make as big of a difference if the task is bandwidth limited.

So what's the benefit of using vector/SIMD instructions at all rather
than doing it with scalar code?  A SIMD instruction that replaces n
scalar instructions consumes fewer resources for instruction fetching,
decoding, register renaming, administering the instruction in the OoO
engine, and in retiring the instruction.
 
In my case, with my custom CPU core:
The elements are packaged in a way to make them easier to work with, for either parallel or pipeline execution.
For the low-precision unit, it can work on all 4 at the same time, if 4 are available. This unit does Binary16 or (optionally) Binary32.
In the main FPU, the SIMD packaging allows the FPU to pipeline the operations despite the FPU having too high of a latency to be pipelined normally.
The advantage of SIMD would be reduced if the pipeline were long enough to handle Binary64 values directly, but 6 EX stages would be asking a bit much (further increasing either pipeline length or width having a significant impact on the cost of the register-forwarding logic).
Arguably, one could have a separate (and longer) pipeline for FPU, but this would add complexity with a shared register space.

So why not use SIMD instructions with longer vector registers?  The
progression from 128-bit SSE through AVX-256 to AVX-512 by Intel
suggests that this is happening, but with every doubling the cost in
area doubles but the returns are diminishing thanks to Amdahl's law.
So at some point you stop.  Intel introduced AVX-512 for Larrabee (a
special-purpose machine), and now is backpedaling with desktop, laptop
and small-server CPUs (even though only the Golden/Raptor Cove cores
are enabled on the small-server CPUs) only supporting AVX, and with
AVX10 only guaranteeing 256-bit vector registers, so maybe 512-bit
vector registers are already too costly for the benefit they give in
general-purpose computing.
 
Yeah.
As I see it, in general, 128-bit SIMD seems to be the local optimum.
Both 256 and 512 end up having more drawbacks than merits as I see it.

Back to old-style vector processors.  There have been machines that
supported longer vector registers and AFAIK also memory-to-memory
machines.  The question is why have they not been the answer of the
vector-processor community to the problem of covering the latency?  Or
maybe they have?  AFAIK NEC SX has been available in some form even in
recent years, maybe still.
 
Going outside of Load/Store adds has a lot of hair for comparably little benefit.
Like, technically, I could go Load-Op / Op-Store for a subset of operations, as I ended up with the logic to support it. But, it doesn't seem like it would bring enough benefit to really be worth it (would not improve code-density as they require 64-bit encodings in my case, and in most cases seem unlikely to bring a performance advantage either; and given some limitations of the WEXifier, using them might actually make performance worse by interfering with shuffle-and-bundle).
The main merit they would have is if the CPU were register-pressure limited, but in my case, with 64 GPRs, this isn't really the case either.

Anyway, after thinking about this, the reason behind Mitch Alsup's
statement is that in a
 doall(load process store)
 computation (like what SIMD is good at), the loads precede the
corresponding processing by the load latency (i.e., memory latency on
the Cray machines).  If your OoO capabilities are limited (and I think
they are on the Cray machines), you cannot start the second iteration
of the doall loop before the processing step of the first iteration
has finished with the register.  You can do a bit of software
pipelining and software register renaming by transforming this into
 load1 doall(load2 process1 store1 load1 process2 store2)
 but at some point you run out of vector registers.
 One thing that comes to mind is tracking individual parts of the
vector registers, which allows to starting the next iteration as soon
as the first part of the vector register no longer has any readers.
However, it's probably not that far off in complexity to tracking
shorter vector registers in an OoO engine.  And if you support
exceptions (the Crays probably don't), this becomes messy, while with
short vector registers it's easier to implement the (ISA)
architecture.
 
As can be noted, SIMD is easy to implement.
Main obvious drawback is the potential for combinatorial explosions of instructions. One needs to keep a fairly careful watch over this.
Like, if one is faced with an NxN or NxM grid of possibilities, naive strategy is to be like "I will define an instruction for every possibility in the grid.", but this is bad. More reasonable to devise a minimal set of instructions that will allow the operation to be done within in a reasonable number of instructions.
But, then again, I can also note that I axed things like packed-byte operations and saturating arithmetic, which are pretty much de-facto in packed-integer SIMD.
Likewise, a lot of the gaps are filled in with specialized converter and helper ops. Even here, some conversion chains will require multiple instructions.
Well, and if there is no practical difference between a scalar and SIMD version of an instruction, may well just use the SIMD version for scalar.
...

- anton

Date Sujet#  Auteur
23 Apr 24 * Short Vectors Versus Long Vectors109Lawrence D'Oliveiro
23 Apr 24 +* Re: Short Vectors Versus Long Vectors97MitchAlsup1
23 Apr 24 i+* Re: Short Vectors Versus Long Vectors21Lawrence D'Oliveiro
23 Apr 24 ii+* Re: Short Vectors Versus Long Vectors15Anton Ertl
23 Apr 24 iii+* Re: Short Vectors Versus Long Vectors3Lawrence D'Oliveiro
23 Apr 24 iiii+- Re: Short Vectors Versus Long Vectors1Anton Ertl
23 Apr 24 iiii`- Re: Short Vectors Versus Long Vectors1MitchAlsup1
23 Apr 24 iii+- Re: Short Vectors Versus Long Vectors1MitchAlsup1
23 Apr 24 iii`* Re: Short Vectors Versus Long Vectors10BGB
24 Apr 24 iii `* Re: Short Vectors Versus Long Vectors9MitchAlsup1
24 Apr 24 iii  `* Re: Short Vectors Versus Long Vectors8BGB
24 Apr 24 iii   +* Re: Short Vectors Versus Long Vectors6Lawrence D'Oliveiro
24 Apr 24 iii   i`* Re: Short Vectors Versus Long Vectors5MitchAlsup1
24 Apr 24 iii   i +* Re: Short Vectors Versus Long Vectors2BGB
24 Apr 24 iii   i i`- Re: Short Vectors Versus Long Vectors1Lawrence D'Oliveiro
24 Apr 24 iii   i `* Re: Short Vectors Versus Long Vectors2Terje Mathisen
25 Apr 24 iii   i  `- Re: Short Vectors Versus Long Vectors1MitchAlsup1
24 Apr 24 iii   `- Re: Short Vectors Versus Long Vectors1MitchAlsup1
23 Apr 24 ii`* Re: Short Vectors Versus Long Vectors5MitchAlsup1
23 Apr 24 ii `* Re: Short Vectors Versus Long Vectors4Lawrence D'Oliveiro
24 Apr 24 ii  `* Re: Short Vectors Versus Long Vectors3MitchAlsup1
24 Apr 24 ii   `* Re: Short Vectors Versus Long Vectors2Lawrence D'Oliveiro
24 Apr 24 ii    `- Re: Short Vectors Versus Long Vectors1MitchAlsup1
24 Apr 24 i`* Re: Short Vectors Versus Long Vectors75John Savard
24 Apr 24 i +* Re: Short Vectors Versus Long Vectors26MitchAlsup1
24 Apr 24 i i+* Re: Short Vectors Versus Long Vectors23John Savard
24 Apr 24 i ii+* Re: Short Vectors Versus Long Vectors2Lawrence D'Oliveiro
24 Apr 24 i iii`- Re: Short Vectors Versus Long Vectors1BGB
24 Apr 24 i ii+* Re: Short Vectors Versus Long Vectors19Anton Ertl
25 Apr 24 i iii`* Re: Short Vectors Versus Long Vectors18Lawrence D'Oliveiro
25 Apr 24 i iii +* Re: Short Vectors Versus Long Vectors2Michael S
27 Apr 24 i iii i`- Re: Short Vectors Versus Long Vectors1Lawrence D'Oliveiro
25 Apr 24 i iii `* Re: Short Vectors Versus Long Vectors15John Levine
25 Apr 24 i iii  `* Re: Short Vectors Versus Long Vectors14MitchAlsup1
25 Apr 24 i iii   +* Re: Short Vectors Versus Long Vectors9Michael S
25 Apr 24 i iii   i`* Re: Short Vectors Versus Long Vectors8MitchAlsup1
25 Apr 24 i iii   i `* Re: Short Vectors Versus Long Vectors7Michael S
25 Apr 24 i iii   i  `* Re: Short Vectors Versus Long Vectors6BGB
27 Apr 24 i iii   i   `* Re: Short Vectors Versus Long Vectors5Thomas Koenig
27 Apr 24 i iii   i    +* Re: Short Vectors Versus Long Vectors3John Levine
28 Apr 24 i iii   i    i`* Re: Short Vectors Versus Long Vectors2Lawrence D'Oliveiro
28 Apr 24 i iii   i    i `- Re: Short Vectors Versus Long Vectors1John Levine
28 Apr 24 i iii   i    `- Re: Short Vectors Versus Long Vectors1Tim Rentsch
26 Apr 24 i iii   `* Re: Short Vectors Versus Long Vectors4Lawrence D'Oliveiro
26 Apr 24 i iii    `* Re: Short Vectors Versus Long Vectors3MitchAlsup1
26 Apr 24 i iii     `* Re: Short Vectors Versus Long Vectors2Lawrence D'Oliveiro
26 Apr 24 i iii      `- Re: Short Vectors Versus Long Vectors1MitchAlsup1
24 Apr 24 i ii`- Re: Short Vectors Versus Long Vectors1David Schultz
27 Apr 24 i i`* Re: Short Vectors Versus Long Vectors2aph
27 Apr 24 i i `- Re: Short Vectors Versus Long Vectors1MitchAlsup1
24 Apr 24 i +* Re: Short Vectors Versus Long Vectors2Lawrence D'Oliveiro
24 Apr 24 i i`- Re: Short Vectors Versus Long Vectors1Anton Ertl
24 Apr 24 i +- Re: Short Vectors Versus Long Vectors1Thomas Koenig
24 Apr 24 i `* Re: Short Vectors Versus Long Vectors45Anton Ertl
24 Apr 24 i  +* Re: Short Vectors Versus Long Vectors43Lawrence D'Oliveiro
24 Apr 24 i  i`* Re: Short Vectors Versus Long Vectors42Anton Ertl
25 Apr 24 i  i `* Re: Short Vectors Versus Long Vectors41Lawrence D'Oliveiro
25 Apr 24 i  i  `* Re: Short Vectors Versus Long Vectors40John Savard
25 Apr 24 i  i   `* Re: Short Vectors Versus Long Vectors39Lawrence D'Oliveiro
25 Apr 24 i  i    +* Re: Short Vectors Versus Long Vectors3Michael S
26 Apr 24 i  i    i`* Re: Short Vectors Versus Long Vectors2Lawrence D'Oliveiro
26 Apr 24 i  i    i `- Re: Short Vectors Versus Long Vectors1Michael S
25 Apr 24 i  i    +* Re: Short Vectors Versus Long Vectors7John Levine
25 Apr 24 i  i    i`* Re: Short Vectors Versus Long Vectors6Thomas Koenig
29 Apr 24 i  i    i `* Re: Short Vectors Versus Long Vectors5George Neuner
29 Apr 24 i  i    i  +* Re: Short Vectors Versus Long Vectors3Terje Mathisen
29 Apr 24 i  i    i  i+- Re: Short Vectors Versus Long Vectors1BGB
30 Apr 24 i  i    i  i`- Re: Short Vectors Versus Long Vectors1George Neuner
29 Apr 24 i  i    i  `- Re: lotsa power, Short Vectors Versus Long Vectors1John Levine
25 Apr 24 i  i    `* Re: Short Vectors Versus Long Vectors28John Savard
25 Apr 24 i  i     `* Re: Short Vectors Versus Long Vectors27Thomas Koenig
25 Apr 24 i  i      +* Re: lots of juice, Short Vectors Versus Long Vectors3John Levine
27 Apr 24 i  i      i`* Re: lots of juice, Short Vectors Versus Long Vectors2Thomas Koenig
28 Apr 24 i  i      i `- Re: lots of juice, Short Vectors Versus Long Vectors1Lawrence D'Oliveiro
28 Apr 24 i  i      `* Re: Short Vectors Versus Long Vectors23Tim Rentsch
30 Apr 24 i  i       `* Re: Short Vectors Versus Long Vectors22Thomas Koenig
30 Apr 24 i  i        +* Re: old power, Short Vectors Versus Long Vectors19John Levine
30 Apr 24 i  i        i`* Re: old power, Short Vectors Versus Long Vectors18Lawrence D'Oliveiro
1 May 24 i  i        i `* Re: old power, Short Vectors Versus Long Vectors17John Levine
1 May 24 i  i        i  +- Re: old power, Short Vectors Versus Long Vectors1MitchAlsup1
2 May 24 i  i        i  `* Re: old power, Short Vectors Versus Long Vectors15BGB
2 May 24 i  i        i   `* Re: old power, Short Vectors Versus Long Vectors14MitchAlsup1
3 May 24 i  i        i    +- Re: old power, Short Vectors Versus Long Vectors1BGB
3 May 24 i  i        i    `* Re: old power, Short Vectors Versus Long Vectors12Lawrence D'Oliveiro
3 May 24 i  i        i     `* Re: old power, Short Vectors Versus Long Vectors11BGB
4 May 24 i  i        i      `* Re: old power, Short Vectors Versus Long Vectors10MitchAlsup1
5 May 24 i  i        i       +* Re: old power, Short Vectors Versus Long Vectors4Thomas Koenig
5 May 24 i  i        i       i+- Re: old power, Short Vectors Versus Long Vectors1BGB
5 May 24 i  i        i       i`* Re: old power, Short Vectors Versus Long Vectors2MitchAlsup1
5 May 24 i  i        i       i `- Re: old power, Short Vectors Versus Long Vectors1BGB
5 May 24 i  i        i       +* Re: old power, Short Vectors Versus Long Vectors2BGB
5 May 24 i  i        i       i`- Re: old power, Short Vectors Versus Long Vectors1MitchAlsup1
6 May 24 i  i        i       `* Re: not even sort of old power, Short Vectors Versus Long Vectors3John Levine
6 May 24 i  i        i        +- Re: not even sort of old power, Short Vectors Versus Long Vectors1MitchAlsup1
6 May 24 i  i        i        `- Re: not even sort of old power, Short Vectors Versus Long Vectors1Thomas Koenig
1 May 24 i  i        `* Re: Short Vectors Versus Long Vectors2Tim Rentsch
1 May 24 i  i         `- Re: Short Vectors Versus Long Vectors1Thomas Koenig
24 Apr 24 i  `- Re: Short Vectors Versus Long Vectors1MitchAlsup1
30 Apr 24 `* Re: Short Vectors Versus Long Vectors11MitchAlsup1
30 Apr 24  +- Re: Short Vectors Versus Long Vectors1MitchAlsup1
1 May 24  `* Re: Short Vectors Versus Long Vectors9Lawrence D'Oliveiro

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal