Sujet : Re: What integer C type to use
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 13. Mar 2024, 20:24:15
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <3186c85221c1baff1a23a46449dadd19@www.novabbs.org>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
User-Agent : Rocksolid Light
Stefan Monnier wrote:
OTH, I am trying to discuss a vague notion of "Cray-style vectors". My
intentions are to see what was applicable in more recent times and
which ideas are not totally obsolete for a future.
Another way to look at the difference between SSE-style vectors (which
I'd call "short vectors") at the ISA level is the fact that SSE-style
vector instructions are designed under the assumption that the latency
of a vector instruction will be more or less the same as that of
a non-vector instruction (i.e. you have enough ALUs to do all the
operations at the same time),
In the early-mid 1980s there was a class of processor assist engines
using the tern Array-Processor that performed a Cray-like vector in
the same latency as a scalar operation. CRAY streamed single-operands
through FUs, Array processors took entire <but shorter> vectors through
lanes of calculations.
I would call these "medium vector" to distinguish from (short vector)
SIMD and (long vector) CRAY {or just vector without qualifier}. So we
have::
SIMD <short> vector
ARRAY <medium> vector
CRAY <long> vector
CDC <memory> vector
whereas Cray-style vector instructions
(which we could call "long vectors") are designed under the assumption
that the latency will be somewhat proportional to the length of the
vector because the core of the CPU will only access a chunk of the
vector at a time.
So, short vectors have a fairly free hand at shuffling data across their
vector (e.g. bitmatrix transpose), and they can be
implemented/scheduled/dispatched just like any other instruction, but
the vector length tends to be severely limited and exposed all over
the place.
Consuming OpCode space like nobody's business.
In contrast long vectors usually depend on specialized implementations
(e.g. chaining) to get good performance, but their length is
easier/cheaper to change.
The only limitation is when one masks out beats of the vector from
calculation of memory referencing. This is what kept CRAY at 64-element vectors. It was also the Achilles heal of CRAY--once memory gets more
than 64 beats away, the length of the vector can no longer absorb the
latency to memory. NEC did not have this problem.
AFAICT long vectors made sense when we could build machines with
a memory bandwidth that was higher and ALUs were more expensive.
Nowadays we tend to have the opposite.
While BW is important (very) it is latency that is crucial. Latency
to memory must be smaller than vector length.
Also, the massive number of transistors we spend nowadays on OoO means
that a good OoO CPU can dispatch individual non-vector instructions to
ALUs just as well as the Cray did with its vectors with chaining.
Not "just as well" but "within spitting distance of"
Stefan