Re: auto predicating branches

Liste des GroupesRevenir à c arch 
Sujet : Re: auto predicating branches
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.arch
Date : 22. Apr 2025, 18:31:03
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Apr22.193103@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7 8 9 10 11 12
User-Agent : xrn 10.11
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
Compilers certainly can perform if-conversion, and there have been
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable.  E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches
<https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.
>
I'm missing something because the first answer by BeeOnRope looks wrong.
Unfortunately they don't show both pieces of code and the asm,
but a branching binary search to me looks just as load data dependent as it
recalculates middle each iteration and the array index is array[middle].

Here's the branchless version:

while (n > 1) {
  int middle = n/2;
  base += (needle < *base[middle]) ? 0 : middle;
  n -= middle;
}

The branching version would then be:

while (n > 1) {
  int middle = n/2;
  if (needle >= *base[middle])
     base +=  middle;
  n -= middle;
}

In the branching version we have the following loop recurrences:

1) middle = n/2; n -= middle
2) base +=  middle;

Recurrence 1 costs a few cycles per iteration, ideally 2.  Recurrence
2 costs even less.  There is no load in the loop recurrences.  The
load is used only for verifying the branch prediction.  And in
particular, the load does not data-depend on any previous load.

If branch prediction is always correct, the branching version builds
an instruction queue that is a long chain of scaled indexed loads
where the load index is data dependent on the prior iteration load value.

That's certainly not the case here.

Even if we assume that the branch predictor misses half of the time
(which is realistic), that means that the next load and the load of
all followig correctly predicted ifs just have to wait for the load of
the mispredicted branch plus the misprediction penalty.  Which means
that on average two loads will happen in parallel, while the
branchless version only performs dependent loads.  If the loads cost
more than a branch misprediction (because of cache misses), the
branching version is faster.

One can now improve the performance by using branchless for the first
few levels (which are cached), and adding prefetching and a mixture of
branchless and branchy for the rest, but for understanding the
principle the simple variants are best.

[*] I want to see the asm because Intel's CMOV always executes the
operand operation, then tosses the result if the predicate is false.

That's the usual thing for conditional execution/predication.  But
here 0 has no dependencies and middle has only cheap dependencies, the
only expensive part is the load for the condition that turns into a
data dependency in the branchless version.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Date Sujet#  Auteur
4 Oct 24 * Re: Tonights Tradeoff - Background Execution Buffers78Robert Finch
4 Oct 24 +* Re: Tonights Tradeoff - Background Execution Buffers75Anton Ertl
4 Oct 24 i`* Re: Tonights Tradeoff - Background Execution Buffers74Robert Finch
5 Oct 24 i `* Re: Tonights Tradeoff - Background Execution Buffers73Anton Ertl
9 Oct 24 i  `* Re: Tonights Tradeoff - Background Execution Buffers72Robert Finch
9 Oct 24 i   +* Re: Tonights Tradeoff - Background Execution Buffers3MitchAlsup1
9 Oct 24 i   i+- Re: Tonights Tradeoff - Background Execution Buffers1Robert Finch
12 Oct 24 i   i`- Re: Tonights Tradeoff - Background Execution Buffers1BGB
12 Oct 24 i   +* Re: Tonights Tradeoff - Carry and Overflow67Robert Finch
12 Oct 24 i   i`* Re: Tonights Tradeoff - Carry and Overflow66MitchAlsup1
12 Oct 24 i   i `* Re: Tonights Tradeoff - Carry and Overflow65BGB
12 Oct 24 i   i  `* Re: Tonights Tradeoff - Carry and Overflow64Robert Finch
13 Oct 24 i   i   +* Re: Tonights Tradeoff - Carry and Overflow3MitchAlsup1
13 Oct 24 i   i   i`* Re: Tonights Tradeoff - ATOM2Robert Finch
13 Oct 24 i   i   i `- Re: Tonights Tradeoff - ATOM1MitchAlsup1
13 Oct 24 i   i   +- Re: Tonights Tradeoff - Carry and Overflow1BGB
31 Oct 24 i   i   `* Page fetching cache controller59Robert Finch
31 Oct 24 i   i    +- Re: Page fetching cache controller1MitchAlsup1
6 Nov 24 i   i    `* Re: Q+ Fibonacci57Robert Finch
17 Apr 25 i   i     `* Re: register sets56Robert Finch
17 Apr 25 i   i      +* Re: register sets53Stephen Fuld
17 Apr 25 i   i      i+- Re: register sets1Robert Finch
17 Apr 25 i   i      i+* Re: register sets46MitchAlsup1
18 Apr 25 i   i      ii`* Re: register sets45Robert Finch
18 Apr 25 i   i      ii `* Re: register sets44MitchAlsup1
20 Apr 25 i   i      ii  `* Re: register sets43Robert Finch
21 Apr 25 i   i      ii   `* Re: auto predicating branches42Robert Finch
21 Apr 25 i   i      ii    `* Re: auto predicating branches41Anton Ertl
21 Apr 25 i   i      ii     +- Is an instruction on the critical path? (was: auto predicating branches)1Anton Ertl
21 Apr 25 i   i      ii     `* Re: auto predicating branches39MitchAlsup1
22 Apr 25 i   i      ii      `* Re: auto predicating branches38Anton Ertl
22 Apr 25 i   i      ii       +- Re: auto predicating branches1MitchAlsup1
22 Apr 25 i   i      ii       `* Re: auto predicating branches36Anton Ertl
22 Apr 25 i   i      ii        `* Re: auto predicating branches35MitchAlsup1
23 Apr 25 i   i      ii         +* Re: auto predicating branches3Stefan Monnier
23 Apr 25 i   i      ii         i`* Re: auto predicating branches2Anton Ertl
25 Apr 25 i   i      ii         i `- Re: auto predicating branches1MitchAlsup1
23 Apr 25 i   i      ii         `* Re: auto predicating branches31Anton Ertl
23 Apr 25 i   i      ii          `* Re: auto predicating branches30MitchAlsup1
24 Apr 25 i   i      ii           `* Re: asynch register rename29Robert Finch
27 Apr 25 i   i      ii            `* Re: fractional PCs28Robert Finch
27 Apr 25 i   i      ii             `* Re: fractional PCs27MitchAlsup1
28 Apr 25 i   i      ii              `* Re: fractional PCs26Robert Finch
28 Apr 25 i   i      ii               +* Re: fractional PCs15MitchAlsup1
29 Apr 25 i   i      ii               i`* Re: fractional PCs14Robert Finch
5 May 25 i   i      ii               i `* Re: control co-processor13Robert Finch
5 May 25 i   i      ii               i  `* Re: control co-processor12Al Kossow
5 May 25 i   i      ii               i   `* Re: control co-processor11Stefan Monnier
6 May 25 i   i      ii               i    +* Re: control co-processor3MitchAlsup1
7 May 25 i   i      ii               i    i+- Re: control co-processor1MitchAlsup1
15 Jul 25 i   i      ii               i    i`- Re: control co-processor1MitchAlsup1
7 May 25 i   i      ii               i    `* Scan chains (was: control co-processor)7Stefan Monnier
7 May 25 i   i      ii               i     +* Re: Scan chains (was: control co-processor)2Al Kossow
7 May 25 i   i      ii               i     i`- Re: Scan chains1Stefan Monnier
7 May 25 i   i      ii               i     +* Re: Scan chains3MitchAlsup1
7 May 25 i   i      ii               i     i`* Re: Scan chains2Stefan Monnier
8 May 25 i   i      ii               i     i `- Re: Scan chains1MitchAlsup1
15 Jul 25 i   i      ii               i     `- Re: Scan chains1MitchAlsup1
29 Apr 25 i   i      ii               `* Re: fractional PCs10Robert Finch
29 Apr 25 i   i      ii                `* Re: fractional PCs9MitchAlsup1
30 Apr 25 i   i      ii                 `* Re: fractional PCs8Robert Finch
30 Apr 25 i   i      ii                  +* Re: fractional PCs6Thomas Koenig
1 May 25 i   i      ii                  i+- Re: fractional PCs1Robert Finch
2 May 25 i   i      ii                  i`* Re: fractional PCs4moi
2 May 25 i   i      ii                  i +* Re: millicode, extracode, fractional PCs2John Levine
2 May 25 i   i      ii                  i i`- Re: millicode, extracode, fractional PCs1moi
2 May 25 i   i      ii                  i `- Re: fractional PCs1moi
30 Apr 25 i   i      ii                  `- Re: fractional PCs1MitchAlsup1
15 Jul 25 i   i      i`* Re: register sets5John Savard
15 Jul 25 i   i      i `* Re: register sets4MitchAlsup1
19 Jul 25 i   i      i  `* Re: register sets3Robert Finch
19 Jul 25 i   i      i   `* Re: register sets2Anton Ertl
19 Jul 25 i   i      i    `- Re: register sets1MitchAlsup1
15 Jul 25 i   i      `* Re: register sets2John Savard
15 Jul 25 i   i       `- Re: register sets1MitchAlsup1
13 Oct 24 i   `- Re: Tonights Tradeoff - Background Execution Buffers1Anton Ertl
4 Oct 24 +- Re: Tonights Tradeoff - Background Execution Buffers1BGB
6 Oct 24 `- Re: Tonights Tradeoff - Background Execution Buffers1MitchAlsup1

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal