Re: auto predicating branches

Liste des GroupesRevenir à c arch 
Sujet : Re: auto predicating branches
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.arch
Date : 22. Apr 2025, 06:10:10
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Apr22.071010@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13
User-Agent : xrn 10.11
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 21 Apr 2025 6:05:32 +0000, Anton Ertl wrote:
>
Robert Finch <robfi680@gmail.com> writes:
Having branches automatically convert into
predicates when they branch forward a short distance <7 instructions.
>
If-conversion in hardware is a good idea, if done well, because it
involves issues that tend to be unknown to compilers:
>
I had little trouble teaching Brian how to put if-conversion into the
compiler with my PRED instructions. Alleviating HW from having to bother
other than being able to execute PREDicated clauses.

Compilers certainly can perform if-conversion, and there have been
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable.  E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches
<https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.

* How predictable is the condition?  If the condition is very well
  predictable, if-conversion is not a good idea, because it turns the
  control dependency (which does not cost latency when the prediction
  is correct) into a data dependency.  Moreover, in this case the
  if-conversion increases the resource consumption.  Compilers are not
  good at predicting the predictability AFAIK.
>
Rather than base the choice on the predictability of the condition,
It is based on whether FETCH will pass the join-point before the
condition resolves. On an 8-wide machine this might be "THE next
cycle".

On a CPU with out-of-order execution, the instruction fetcher often
runs many dozens of instructions ahead of the functional units which
produce the conditions, so your criterion could cover pretty big IFs
And, given that you want the compiler to do it, the compiler would
have to know about that.  Ok, what decision will you take in what
case, and why?

* Is the condition available before or after the original data
  dependencies?  And if afterwards, by how many cycles?  If it is
  afterwards and the branch prediction would be correct, the
  if-conversion means that the result of the instruction is available
  later, which may reduce IPC.
>
Generally, it only adds latency--if the execution window is not staled
at either end this does not harm IPC.

If the additional latency is on the critical path and execution is
dependency-limited, this reduces IPC.  And yes, this will result in
the buffers (especially the schedulers) filling up and stalling the
front end.

                                OTOH, if the branch prediction would
  be incorrect, the recovery also depends on when the condition
  becomes available,
>
There is no "recovery" from PREDication, just one clause getting
nullified.

I apparently wrote that in a misunderstandable way.  Here's another
attempt: When comparing the branching variant to the predicated
(if-converted) variant, if the branching variant would be
mispredicted, it is always at a disadvantage wrt. latency compared to
the predicated variant, because the branching variant restarts from
the instruction fetch when the condition becomes available, while the
predicated variant is already fetched and decoded and waits in a
scheduler for the condition.

Note that in the binary-search case linked-to above, that's also the
case, but in the branchy version the benefit comes from the correct
predictions and the lack of data-dependencies between the loads: In
those cases the cache-missing load does not depend on the previous
cache-missing load (unlike the branchless version), resulting in an
overall shorter latency.

                     and the total latency is higher in the case of no
  if-conversion.  The compiler may do an ok job at predicting whether
  a condition is available before or after the original data
  dependencies (I don't know a paper that evaluates that), but without
  knowing about the prediction accuracy of a specific condition that
  does not help much.
>
So the hardware should take predictability of a condition and the
availability of the condition into consideration for if-conversion.
>
My argument is that this is a SW decision (in the compiler) not a
HW decision (other than providing the PREDs).

That's a position, not an argument.  Do you have an argument for your
position?

Since PREDs are not
predicted (unless you think they are predicted BOTH ways) they do
not diminish the performance of the branch predictors.

Nor increase it.  But it sounds like you think that the compiler
should choose predication when the condition is not particularly
predictable.  How should the compiler know that?

The compiler choose PRED because FETCH reaches the join-point prior
to the branch resolving. PRED is almost always faster--and when
it has both then-clause and else-clause, it always saves a branch
instruction (jumping over the else-clause).

It appears that you have an underpowered front end in mind.  Even in
the wide machines of today and without predication, the front end
normally does not have a problem fetching at least as many
instructions as the rest of the machine can handle, as long as the
predictions are correct.  Even if the instruction fetcher cannot fetch
the full width in one cycle due to having more taken branches than the
instruction fetcher can handle or some other hiccup, this is usually
made up by delivering more instructions in other cycles than the rest
of the CPU can handle.  E.g., Skymont
<https://old.chipsandcheese.com/2024/06/15/intel-details-skymont/>
fetches from three sequential streams, 3*32bytes/cycle, and decodes
into 3*3 uops/cycle and stores them in 3 32-entry uop queues; the
renamer consumes 8 instructions from these queues, so as long as the
predictions are correct and the average fetching and decoding is >8
uops/cycle, the renamer will rarely see fewer than 8 uops available,
even if there is the occasional cycle where the taken branches are so
dense that the instruction fetcher cannot deliver enough relevant
bytes to the instruction decoder.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Date Sujet#  Auteur
7 Sep 24 * Tonights Tradeoff99Robert Finch
7 Sep 24 `* Re: Tonights Tradeoff98MitchAlsup1
8 Sep 24  `* Re: Tonights Tradeoff97Robert Finch
8 Sep 24   `* Re: Tonights Tradeoff96MitchAlsup1
10 Sep 24    `* Re: Tonights Tradeoff95Robert Finch
10 Sep 24     +* Re: Tonights Tradeoff17BGB
10 Sep 24     i+* Re: Tonights Tradeoff12Robert Finch
10 Sep 24     ii+* Re: Tonights Tradeoff10BGB
11 Sep 24     iii`* Re: Tonights Tradeoff9Robert Finch
11 Sep 24     iii +* Re: Tonights Tradeoff7Stephen Fuld
11 Sep 24     iii i+- Re: Tonights Tradeoff1MitchAlsup1
12 Sep 24     iii i`* Re: Tonights Tradeoff5Robert Finch
12 Sep 24     iii i `* Re: Tonights Tradeoff4MitchAlsup1
12 Sep 24     iii i  `* Re: Tonights Tradeoff3Robert Finch
12 Sep 24     iii i   `* Re: Tonights Tradeoff2MitchAlsup1
13 Sep 24     iii i    `- Re: Tonights Tradeoff1MitchAlsup1
12 Sep 24     iii `- Re: Tonights Tradeoff1BGB
11 Sep 24     ii`- Re: Tonights Tradeoff1MitchAlsup1
11 Sep 24     i`* Re: Tonights Tradeoff4MitchAlsup1
12 Sep 24     i `* Re: Tonights Tradeoff3Thomas Koenig
12 Sep 24     i  `* Re: Tonights Tradeoff2BGB
12 Sep 24     i   `- Re: Tonights Tradeoff1Robert Finch
11 Sep 24     `* Re: Tonights Tradeoff77MitchAlsup1
15 Sep 24      `* Re: Tonights Tradeoff76Robert Finch
16 Sep 24       `* Re: Tonights Tradeoff75Robert Finch
24 Sep 24        `* Re: Tonights Tradeoff - Background Execution Buffers74Robert Finch
24 Sep 24         `* Re: Tonights Tradeoff - Background Execution Buffers73MitchAlsup1
26 Sep 24          `* Re: Tonights Tradeoff - Background Execution Buffers72Robert Finch
26 Sep 24           `* Re: Tonights Tradeoff - Background Execution Buffers71MitchAlsup1
27 Sep 24            `* Re: Tonights Tradeoff - Background Execution Buffers70Robert Finch
4 Oct 24             `* Re: Tonights Tradeoff - Background Execution Buffers69Robert Finch
4 Oct 24              +* Re: Tonights Tradeoff - Background Execution Buffers66Anton Ertl
4 Oct 24              i`* Re: Tonights Tradeoff - Background Execution Buffers65Robert Finch
5 Oct 24              i `* Re: Tonights Tradeoff - Background Execution Buffers64Anton Ertl
9 Oct 24              i  `* Re: Tonights Tradeoff - Background Execution Buffers63Robert Finch
9 Oct 24              i   +* Re: Tonights Tradeoff - Background Execution Buffers3MitchAlsup1
9 Oct 24              i   i+- Re: Tonights Tradeoff - Background Execution Buffers1Robert Finch
12 Oct 24              i   i`- Re: Tonights Tradeoff - Background Execution Buffers1BGB
12 Oct 24              i   +* Re: Tonights Tradeoff - Carry and Overflow58Robert Finch
12 Oct 24              i   i`* Re: Tonights Tradeoff - Carry and Overflow57MitchAlsup1
12 Oct 24              i   i `* Re: Tonights Tradeoff - Carry and Overflow56BGB
12 Oct 24              i   i  `* Re: Tonights Tradeoff - Carry and Overflow55Robert Finch
13 Oct 24              i   i   +* Re: Tonights Tradeoff - Carry and Overflow3MitchAlsup1
13 Oct 24              i   i   i`* Re: Tonights Tradeoff - ATOM2Robert Finch
13 Oct 24              i   i   i `- Re: Tonights Tradeoff - ATOM1MitchAlsup1
13 Oct 24              i   i   +- Re: Tonights Tradeoff - Carry and Overflow1BGB
31 Oct 24              i   i   `* Page fetching cache controller50Robert Finch
31 Oct 24              i   i    +- Re: Page fetching cache controller1MitchAlsup1
6 Nov 24              i   i    `* Re: Q+ Fibonacci48Robert Finch
17 Apr 25              i   i     `* Re: register sets47Robert Finch
17 Apr 25              i   i      `* Re: register sets46Stephen Fuld
17 Apr 25              i   i       +- Re: register sets1Robert Finch
17 Apr 25              i   i       `* Re: register sets44MitchAlsup1
18 Apr 25              i   i        `* Re: register sets43Robert Finch
18 Apr 25              i   i         `* Re: register sets42MitchAlsup1
20 Apr 25              i   i          `* Re: register sets41Robert Finch
21 Apr 25              i   i           `* Re: auto predicating branches40Robert Finch
21 Apr 25              i   i            `* Re: auto predicating branches39Anton Ertl
21 Apr 25              i   i             +- Is an instruction on the critical path? (was: auto predicating branches)1Anton Ertl
21 Apr 25              i   i             `* Re: auto predicating branches37MitchAlsup1
22 Apr 25              i   i              `* Re: auto predicating branches36Anton Ertl
22 Apr 25              i   i               +- Re: auto predicating branches1MitchAlsup1
22 Apr 25              i   i               `* Re: auto predicating branches34Anton Ertl
22 Apr 25              i   i                `* Re: auto predicating branches33MitchAlsup1
23 Apr 25              i   i                 +* Re: auto predicating branches3Stefan Monnier
23 Apr 25              i   i                 i`* Re: auto predicating branches2Anton Ertl
25 Apr 25              i   i                 i `- Re: auto predicating branches1MitchAlsup1
23 Apr 25              i   i                 `* Re: auto predicating branches29Anton Ertl
23 Apr 25              i   i                  `* Re: auto predicating branches28MitchAlsup1
24 Apr 25              i   i                   `* Re: asynch register rename27Robert Finch
27 Apr 25              i   i                    `* Re: fractional PCs26Robert Finch
27 Apr 25              i   i                     `* Re: fractional PCs25MitchAlsup1
28 Apr 25              i   i                      `* Re: fractional PCs24Robert Finch
28 Apr 25              i   i                       +* Re: fractional PCs13MitchAlsup1
29 Apr 25              i   i                       i`* Re: fractional PCs12Robert Finch
5 May 25              i   i                       i `* Re: control co-processor11Robert Finch
5 May 25              i   i                       i  `* Re: control co-processor10Al Kossow
5 May 25              i   i                       i   `* Re: control co-processor9Stefan Monnier
6 May 25              i   i                       i    +* Re: control co-processor2MitchAlsup1
7 May 25              i   i                       i    i`- Re: control co-processor1MitchAlsup1
7 May 25              i   i                       i    `* Scan chains (was: control co-processor)6Stefan Monnier
7 May 25              i   i                       i     +* Re: Scan chains (was: control co-processor)2Al Kossow
7 May 25              i   i                       i     i`- Re: Scan chains1Stefan Monnier
7 May 25              i   i                       i     `* Re: Scan chains3MitchAlsup1
7 May 25              i   i                       i      `* Re: Scan chains2Stefan Monnier
8 May 25              i   i                       i       `- Re: Scan chains1MitchAlsup1
29 Apr 25              i   i                       `* Re: fractional PCs10Robert Finch
29 Apr 25              i   i                        `* Re: fractional PCs9MitchAlsup1
30 Apr 25              i   i                         `* Re: fractional PCs8Robert Finch
30 Apr 25              i   i                          +* Re: fractional PCs6Thomas Koenig
1 May 25              i   i                          i+- Re: fractional PCs1Robert Finch
2 May 25              i   i                          i`* Re: fractional PCs4moi
2 May 25              i   i                          i +* Re: millicode, extracode, fractional PCs2John Levine
2 May 25              i   i                          i i`- Re: millicode, extracode, fractional PCs1moi
2 May 25              i   i                          i `- Re: fractional PCs1moi
30 Apr 25              i   i                          `- Re: fractional PCs1MitchAlsup1
13 Oct 24              i   `- Re: Tonights Tradeoff - Background Execution Buffers1Anton Ertl
4 Oct 24              +- Re: Tonights Tradeoff - Background Execution Buffers1BGB
6 Oct 24              `- Re: Tonights Tradeoff - Background Execution Buffers1MitchAlsup1

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal