Newsportal USENET - Re: auto predicating branches

On Tue, 22 Apr 2025 17:31:03 +0000, Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
Compilers certainly can perform if-conversion, and there have been
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable. E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches
<https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.
>
I'm missing something because the first answer by BeeOnRope looks wrong.
Unfortunately they don't show both pieces of code and the asm,
but a branching binary search to me looks just as load data dependent as
it
recalculates middle each iteration and the array index is array[middle].
>
Here's the branchless version:
>
while (n > 1) {
int middle = n/2;
base += (needle < *base[middle]) ? 0 : middle;
n -= middle;
}

loop:
   SRA Rn,Rn,#1    // Rn in
   LDD Rl,[Rb,Rm<<3] // Rb in
   CMP R7,Rn,Rl
   CMOVLT R7,#0,Rm
   ADD Rb,Rb,-Rm // Rb out
   ADD Rn,Rn,-Rm // Rn out
   BR    loop

>
The branching version would then be:
>
while (n > 1) {
int middle = n/2;
if (needle >= *base[middle])
base += middle;
n -= middle;
}

loop:
   SRA Rn,Rn,#1    // Rn in
   LDD Rl,[Rb,Rm<<3] // Rb in
   CMP R7,Rn,Rl
   PGE R7,T
   ADD Rb,Rb,-Rm // Rb conditionally out
   ADD Rn,Rn,-Rm // Rn out
   BR    loop
The ADD or Rn can be freely positioned in the loop, but the recurrence
remains at 2 cycles.
A 7-wide (or wider) machine will be able to insert a whole iteration
of the loop per cycle. The loop has a 2 cycle recurrence, so, the
execution window will fill up rapidly and retire at 1 loop per 2 cycles;
being recurrence-bound. This agrees with the second paragraph below.

In the branching version we have the following loop recurrences:
>
1) middle = n/2; n -= middle
2) base += middle;
>
Recurrence 1 costs a few cycles per iteration, ideally 2. Recurrence
2 costs even less. There is no load in the loop recurrences. The
load is used only for verifying the branch prediction. And in
particular, the load does not data-depend on any previous load.

A shifter with a latency of 2 would really hurt this loop.

If branch prediction is always correct, the branching version builds
an instruction queue that is a long chain of scaled indexed loads
where the load index is data dependent on the prior iteration load value.
>
That's certainly not the case here.
>
Even if we assume that the branch predictor misses half of the time
(which is realistic), that means that the next load and the load of
all followig correctly predicted ifs just have to wait for the load of
the mispredicted branch plus the misprediction penalty. Which means
that on average two loads will happen in parallel, while the
branchless version only performs dependent loads. If the loads cost
more than a branch misprediction (because of cache misses), the
branching version is faster.

I do not see 2 LDDs being performed parallel unless the execution
width is at least 14-wide. In any event loop recurrence restricts the
overall retirement to 0.5 LDDs per cycle--it is the recurrence that
feeds the iterations (i.e., retirement). The water hose is fundamentally
restricted by the recurrence.

One can now improve the performance by using branchless for the first
few levels (which are cached), and adding prefetching and a mixture of
branchless and branchy for the rest, but for understanding the
principle the simple variants are best.

Predicating has added latency on the delivery of Rb's recurrence in that
the consumer has to be ready for {didn't happen and now you are free"
along with "here is the data you are looking for". Complicating the
instruction queue "a little".

[*] I want to see the asm because Intel's CMOV always executes the
operand operation, then tosses the result if the predicate is false.

Use a less-stupid ISA
I took a look at extracting the < or >= cases and using it as a mask
(0, middle) but this takes 1 more instruction and does nothing about
the fundamental loop limiter.

That's the usual thing for conditional execution/predication. But
here 0 has no dependencies and middle has only cheap dependencies, the
only expensive part is the load for the condition that turns into a
data dependency in the branchless version.
>
- anton

Date	Sujet	#	Auteur
17 Apr 25	Re: register sets	56	Robert Finch
17 Apr 25	Re: register sets	53	Stephen Fuld
17 Apr 25	Re: register sets	1	Robert Finch
17 Apr 25	Re: register sets	46	MitchAlsup1
18 Apr 25	Re: register sets	45	Robert Finch
18 Apr 25	Re: register sets	44	MitchAlsup1
20 Apr 25	Re: register sets	43	Robert Finch
21 Apr 25	Re: auto predicating branches	42	Robert Finch
21 Apr 25	Re: auto predicating branches	41	Anton Ertl
21 Apr 25	Is an instruction on the critical path? (was: auto predicating branches)	1	Anton Ertl
21 Apr 25	Re: auto predicating branches	39	MitchAlsup1
22 Apr 25	Re: auto predicating branches	38	Anton Ertl
22 Apr 25	Re: auto predicating branches	1	MitchAlsup1
22 Apr 25	Re: auto predicating branches	36	Anton Ertl
22 Apr 25	Re: auto predicating branches	35	MitchAlsup1
23 Apr 25	Re: auto predicating branches	3	Stefan Monnier
23 Apr 25	Re: auto predicating branches	2	Anton Ertl
25 Apr 25	Re: auto predicating branches	1	MitchAlsup1
23 Apr 25	Re: auto predicating branches	31	Anton Ertl
23 Apr 25	Re: auto predicating branches	30	MitchAlsup1
24 Apr 25	Re: asynch register rename	29	Robert Finch
27 Apr 25	Re: fractional PCs	28	Robert Finch
27 Apr 25	Re: fractional PCs	27	MitchAlsup1
28 Apr 25	Re: fractional PCs	26	Robert Finch
28 Apr 25	Re: fractional PCs	15	MitchAlsup1
29 Apr 25	Re: fractional PCs	14	Robert Finch
5 May 25	Re: control co-processor	13	Robert Finch
5 May 25	Re: control co-processor	12	Al Kossow
5 May 25	Re: control co-processor	11	Stefan Monnier
6 May 25	Re: control co-processor	3	MitchAlsup1
7 May 25	Re: control co-processor	1	MitchAlsup1
15 Jul 25	Re: control co-processor	1	MitchAlsup1
7 May 25	Scan chains (was: control co-processor)	7	Stefan Monnier
7 May 25	Re: Scan chains (was: control co-processor)	2	Al Kossow
7 May 25	Re: Scan chains	1	Stefan Monnier
7 May 25	Re: Scan chains	3	MitchAlsup1
7 May 25	Re: Scan chains	2	Stefan Monnier
8 May 25	Re: Scan chains	1	MitchAlsup1
15 Jul 25	Re: Scan chains	1	MitchAlsup1
29 Apr 25	Re: fractional PCs	10	Robert Finch
29 Apr 25	Re: fractional PCs	9	MitchAlsup1
30 Apr 25	Re: fractional PCs	8	Robert Finch
30 Apr 25	Re: fractional PCs	6	Thomas Koenig
1 May 25	Re: fractional PCs	1	Robert Finch
2 May 25	Re: fractional PCs	4	moi
2 May 25	Re: millicode, extracode, fractional PCs	2	John Levine
2 May 25	Re: millicode, extracode, fractional PCs	1	moi
2 May 25	Re: fractional PCs	1	moi
30 Apr 25	Re: fractional PCs	1	MitchAlsup1
15 Jul 25	Re: register sets	5	John Savard
15 Jul 25	Re: register sets	4	MitchAlsup1
19 Jul 25	Re: register sets	3	Robert Finch
19 Jul 25	Re: register sets	2	Anton Ertl
19 Jul 25	Re: register sets	1	MitchAlsup1
15 Jul 25	Re: register sets	2	John Savard
15 Jul 25	Re: register sets	1	MitchAlsup1

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal