Sujet : Re: Computer architects leaving Intel...
De : already5chosen (at) *nospam* yahoo.com (Michael S)
Groupes : comp.archDate : 11. Sep 2024, 13:35:00
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <20240911153500.00005010@yahoo.com>
References : 1 2 3 4 5 6 7 8
User-Agent : Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
On Wed, 11 Sep 2024 13:07:33 +0200
Terje Mathisen <
terje.mathisen@tmsw.no> wrote:
I do believe though that in reality it could be faster to use the
branchy version, and let the branch predictors do their job instead
of having to wait to evaluate all three terms:
bool is_overlap(char *src, char *dst, size_t len)
{
if (src < dst) {
return (src+len > dst);
}
return (dst+len > src);
}
Terje
I think that under assumptions that overlaps are very rare and that we
have wide OoO CPU, one-branch solution would be faster than multiple
branches.
Assuming Windows x64 coding conventions (dst==RCX, src==RDX, len=R8)
and using algorithm that I posted at night:
lea rax, [rcx,r8] ; rax = dst+len
lea r9, [rdx,r8] ; r9 = src+len
cmp rdx, rax
setb al ; al = src < dst+len
cmp rcx, r9
setb r9b ; r9b = dst < src+len
cmp al, r9b
je handle_overlap
; there is no overlap
The important observation here is that for as long as branch predictor
correctly predicted that the branch is not taken all previous
calculation are not on the critical latency path. So, the fact that
there are 7 instructions before branch that have latency of ~4 clocks
does not matter.
On the other hand, in your branchy variant the second branch is easy
to predict, but the first branch if (src < dst) not necessarily easy.