Sujet : Re: Is Intel exceptionally unsuccessful as an architecture designer?
De : paaronclayton (at) *nospam* gmail.com (Paul A. Clayton)
Groupes : comp.archDate : 20. Sep 2024, 11:52:07
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vcjk4q$132af$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.0
On 9/20/24 12:05 AM, Lawrence D'Oliveiro wrote:
On Fri, 20 Sep 2024 00:58:44 +0000, MitchAlsup1 wrote:
Hint:: They can context switch every instruction.
How does that help?
Multithreading provides memory-level parallelism, which turns a
latency problem into a bandwidth problem.
So if an instruction
does not complete in its cycle, they switch to a different set of
threads;
That will need to do its own memory accesses. But the memory interface is
still busy trying to complete the access for the previous thread.
This is _bandwidth_. From what I understand, GPUs also typically
have memory controllers optimized for throughput rather than
latency, with larger queue depth.
A memory channel is not busy with a single access for the entire
latency. Each DRAM chip/rank(bundle of chips) has multiple banks
that can be busy somewhat independently. With enough channels,
ranks, and banks (and sufficiently distributed utilization), high
bandwidth can be achieved.
Many GPUs also have 128-byte cache lines/access granularity, which
helps with bandwidth. This size is based on 32 4-byte contiguous
(unit-stride) and aligned SIMD.
Also note: a single instruction causes 32-128 threads to make 1 step of
forward progress.
How many memory accesses does it take to complete that one step?
While GPUs support scatter/gather with high throughput — again
benefiting from latency tolerance — throughput is best with unit
stride accesses. With a unit-stride stream, 32 4-byte accesses are
one memory access.
Control divergence can also prevent all SIMD lanes from making
progress at the same time. I think more recent GPU designs are
reducing this penalty by allowing constrained superscalar
execution with another instruction using otherwise unused lanes.
Control divergence does not really apply to things like dense
large matrix multiplication.