Sujet : Re: What integer C type to use
De : already5chosen (at) *nospam* yahoo.com (Michael S)
Groupes : comp.archDate : 12. Mar 2024, 18:49:18
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <20240312194918.00002cde@yahoo.com>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
User-Agent : Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
On Tue, 12 Mar 2024 17:18:36 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
Michael S wrote:
On Tue, 12 Mar 2024 11:14:47 +0100
David Brown <david.brown@hesbynett.no> wrote:
You use the word vector where you mean SIMD.
Yes, I was using the word somewhat interchangeably, as I was
talking in general terms. Perhaps I should have been more
precise. I know this thread talked about "Cray style vectors",
but I thought this branch had diverged - I don't know anywhere
near enough about the details of Cray machines to talk much about
them.
Even for Cray/NEC-style vectors, the same throughput for different
precision is not an universal property. Cray's and NEC's vector
processors happen to be designed like that, but one can easily
imagine vector processors of similar style that have 2 or even 3
times higher throughput for SP vs DP.
I personally never encountered such machines, but would be
surprised if it were never built and sold back by one or another
usual suspect (may be, Fujitsu?) in days when designers liked
Cray's style.
While theoretically possible, they did not do this because both halves
of a 2×SP would not arrive from memory necessarily simultaneously.
{Consider a gather load you need a vector of addresses 2× as long
for pairs of SP going into a single vector register element.}
Doctor, it hurts when I do this!
So, what prevents you from providing no gather with resolution
below 64 bits?
Which, of course, leaves the question of what property makes vector
processor Cray-style. Just having ALU/FPU several times narrower
than VR is, IMHO, not enough to be considered Cray-style.
That property is that the length of the vector register is chosen to
absorb the latency to memory. SMID is too short to have this property.
I don't like this definition at all.
For starter, what is "memory"? Does L1D cache count, or only L2 and
higher?
Then, what is "absorb" ? Is the whole VR register file part of
absorbent or latency should be covered by one register? Is OoO machinery
part of absorbent? Is HW threading part of absorbent? And for any of
your possible answers I have my "Why?".
In my book, the critical distinction is that at least one size of
partial (chopped) none-load-store vector operations has higher
throughput (and hopefully, but not necessarily lower latency) than
full vector operations of the same type.