Re: auto predicating branches

Liste des GroupesRevenir à c arch 
Sujet : Re: auto predicating branches
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.arch
Date : 22. Apr 2025, 06:10:10
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Apr22.071010@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13
User-Agent : xrn 10.11
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 21 Apr 2025 6:05:32 +0000, Anton Ertl wrote:
>
Robert Finch <robfi680@gmail.com> writes:
Having branches automatically convert into
predicates when they branch forward a short distance <7 instructions.
>
If-conversion in hardware is a good idea, if done well, because it
involves issues that tend to be unknown to compilers:
>
I had little trouble teaching Brian how to put if-conversion into the
compiler with my PRED instructions. Alleviating HW from having to bother
other than being able to execute PREDicated clauses.

Compilers certainly can perform if-conversion, and there have been
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable.  E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches
<https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.

* How predictable is the condition?  If the condition is very well
  predictable, if-conversion is not a good idea, because it turns the
  control dependency (which does not cost latency when the prediction
  is correct) into a data dependency.  Moreover, in this case the
  if-conversion increases the resource consumption.  Compilers are not
  good at predicting the predictability AFAIK.
>
Rather than base the choice on the predictability of the condition,
It is based on whether FETCH will pass the join-point before the
condition resolves. On an 8-wide machine this might be "THE next
cycle".

On a CPU with out-of-order execution, the instruction fetcher often
runs many dozens of instructions ahead of the functional units which
produce the conditions, so your criterion could cover pretty big IFs
And, given that you want the compiler to do it, the compiler would
have to know about that.  Ok, what decision will you take in what
case, and why?

* Is the condition available before or after the original data
  dependencies?  And if afterwards, by how many cycles?  If it is
  afterwards and the branch prediction would be correct, the
  if-conversion means that the result of the instruction is available
  later, which may reduce IPC.
>
Generally, it only adds latency--if the execution window is not staled
at either end this does not harm IPC.

If the additional latency is on the critical path and execution is
dependency-limited, this reduces IPC.  And yes, this will result in
the buffers (especially the schedulers) filling up and stalling the
front end.

                                OTOH, if the branch prediction would
  be incorrect, the recovery also depends on when the condition
  becomes available,
>
There is no "recovery" from PREDication, just one clause getting
nullified.

I apparently wrote that in a misunderstandable way.  Here's another
attempt: When comparing the branching variant to the predicated
(if-converted) variant, if the branching variant would be
mispredicted, it is always at a disadvantage wrt. latency compared to
the predicated variant, because the branching variant restarts from
the instruction fetch when the condition becomes available, while the
predicated variant is already fetched and decoded and waits in a
scheduler for the condition.

Note that in the binary-search case linked-to above, that's also the
case, but in the branchy version the benefit comes from the correct
predictions and the lack of data-dependencies between the loads: In
those cases the cache-missing load does not depend on the previous
cache-missing load (unlike the branchless version), resulting in an
overall shorter latency.

                     and the total latency is higher in the case of no
  if-conversion.  The compiler may do an ok job at predicting whether
  a condition is available before or after the original data
  dependencies (I don't know a paper that evaluates that), but without
  knowing about the prediction accuracy of a specific condition that
  does not help much.
>
So the hardware should take predictability of a condition and the
availability of the condition into consideration for if-conversion.
>
My argument is that this is a SW decision (in the compiler) not a
HW decision (other than providing the PREDs).

That's a position, not an argument.  Do you have an argument for your
position?

Since PREDs are not
predicted (unless you think they are predicted BOTH ways) they do
not diminish the performance of the branch predictors.

Nor increase it.  But it sounds like you think that the compiler
should choose predication when the condition is not particularly
predictable.  How should the compiler know that?

The compiler choose PRED because FETCH reaches the join-point prior
to the branch resolving. PRED is almost always faster--and when
it has both then-clause and else-clause, it always saves a branch
instruction (jumping over the else-clause).

It appears that you have an underpowered front end in mind.  Even in
the wide machines of today and without predication, the front end
normally does not have a problem fetching at least as many
instructions as the rest of the machine can handle, as long as the
predictions are correct.  Even if the instruction fetcher cannot fetch
the full width in one cycle due to having more taken branches than the
instruction fetcher can handle or some other hiccup, this is usually
made up by delivering more instructions in other cycles than the rest
of the CPU can handle.  E.g., Skymont
<https://old.chipsandcheese.com/2024/06/15/intel-details-skymont/>
fetches from three sequential streams, 3*32bytes/cycle, and decodes
into 3*3 uops/cycle and stores them in 3 32-entry uop queues; the
renamer consumes 8 instructions from these queues, so as long as the
predictions are correct and the average fetching and decoding is >8
uops/cycle, the renamer will rarely see fewer than 8 uops available,
even if there is the occasional cycle where the taken branches are so
dense that the instruction fetcher cannot deliver enough relevant
bytes to the instruction decoder.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Date Sujet#  Auteur
4 Oct 24 * Re: Tonights Tradeoff - Background Execution Buffers78Robert Finch
4 Oct 24 +* Re: Tonights Tradeoff - Background Execution Buffers75Anton Ertl
4 Oct 24 i`* Re: Tonights Tradeoff - Background Execution Buffers74Robert Finch
5 Oct 24 i `* Re: Tonights Tradeoff - Background Execution Buffers73Anton Ertl
9 Oct 24 i  `* Re: Tonights Tradeoff - Background Execution Buffers72Robert Finch
9 Oct 24 i   +* Re: Tonights Tradeoff - Background Execution Buffers3MitchAlsup1
9 Oct 24 i   i+- Re: Tonights Tradeoff - Background Execution Buffers1Robert Finch
12 Oct 24 i   i`- Re: Tonights Tradeoff - Background Execution Buffers1BGB
12 Oct 24 i   +* Re: Tonights Tradeoff - Carry and Overflow67Robert Finch
12 Oct 24 i   i`* Re: Tonights Tradeoff - Carry and Overflow66MitchAlsup1
12 Oct 24 i   i `* Re: Tonights Tradeoff - Carry and Overflow65BGB
12 Oct 24 i   i  `* Re: Tonights Tradeoff - Carry and Overflow64Robert Finch
13 Oct 24 i   i   +* Re: Tonights Tradeoff - Carry and Overflow3MitchAlsup1
13 Oct 24 i   i   i`* Re: Tonights Tradeoff - ATOM2Robert Finch
13 Oct 24 i   i   i `- Re: Tonights Tradeoff - ATOM1MitchAlsup1
13 Oct 24 i   i   +- Re: Tonights Tradeoff - Carry and Overflow1BGB
31 Oct 24 i   i   `* Page fetching cache controller59Robert Finch
31 Oct 24 i   i    +- Re: Page fetching cache controller1MitchAlsup1
6 Nov 24 i   i    `* Re: Q+ Fibonacci57Robert Finch
17 Apr 25 i   i     `* Re: register sets56Robert Finch
17 Apr 25 i   i      +* Re: register sets53Stephen Fuld
17 Apr 25 i   i      i+- Re: register sets1Robert Finch
17 Apr 25 i   i      i+* Re: register sets46MitchAlsup1
18 Apr 25 i   i      ii`* Re: register sets45Robert Finch
18 Apr 25 i   i      ii `* Re: register sets44MitchAlsup1
20 Apr 25 i   i      ii  `* Re: register sets43Robert Finch
21 Apr 25 i   i      ii   `* Re: auto predicating branches42Robert Finch
21 Apr 25 i   i      ii    `* Re: auto predicating branches41Anton Ertl
21 Apr 25 i   i      ii     +- Is an instruction on the critical path? (was: auto predicating branches)1Anton Ertl
21 Apr 25 i   i      ii     `* Re: auto predicating branches39MitchAlsup1
22 Apr 25 i   i      ii      `* Re: auto predicating branches38Anton Ertl
22 Apr 25 i   i      ii       +- Re: auto predicating branches1MitchAlsup1
22 Apr 25 i   i      ii       `* Re: auto predicating branches36Anton Ertl
22 Apr 25 i   i      ii        `* Re: auto predicating branches35MitchAlsup1
23 Apr 25 i   i      ii         +* Re: auto predicating branches3Stefan Monnier
23 Apr 25 i   i      ii         i`* Re: auto predicating branches2Anton Ertl
25 Apr 25 i   i      ii         i `- Re: auto predicating branches1MitchAlsup1
23 Apr 25 i   i      ii         `* Re: auto predicating branches31Anton Ertl
23 Apr 25 i   i      ii          `* Re: auto predicating branches30MitchAlsup1
24 Apr 25 i   i      ii           `* Re: asynch register rename29Robert Finch
27 Apr 25 i   i      ii            `* Re: fractional PCs28Robert Finch
27 Apr 25 i   i      ii             `* Re: fractional PCs27MitchAlsup1
28 Apr 25 i   i      ii              `* Re: fractional PCs26Robert Finch
28 Apr 25 i   i      ii               +* Re: fractional PCs15MitchAlsup1
29 Apr 25 i   i      ii               i`* Re: fractional PCs14Robert Finch
5 May 25 i   i      ii               i `* Re: control co-processor13Robert Finch
5 May 25 i   i      ii               i  `* Re: control co-processor12Al Kossow
5 May 25 i   i      ii               i   `* Re: control co-processor11Stefan Monnier
6 May 25 i   i      ii               i    +* Re: control co-processor3MitchAlsup1
7 May 25 i   i      ii               i    i+- Re: control co-processor1MitchAlsup1
15 Jul 25 i   i      ii               i    i`- Re: control co-processor1MitchAlsup1
7 May 25 i   i      ii               i    `* Scan chains (was: control co-processor)7Stefan Monnier
7 May 25 i   i      ii               i     +* Re: Scan chains (was: control co-processor)2Al Kossow
7 May 25 i   i      ii               i     i`- Re: Scan chains1Stefan Monnier
7 May 25 i   i      ii               i     +* Re: Scan chains3MitchAlsup1
7 May 25 i   i      ii               i     i`* Re: Scan chains2Stefan Monnier
8 May 25 i   i      ii               i     i `- Re: Scan chains1MitchAlsup1
15 Jul 25 i   i      ii               i     `- Re: Scan chains1MitchAlsup1
29 Apr 25 i   i      ii               `* Re: fractional PCs10Robert Finch
29 Apr 25 i   i      ii                `* Re: fractional PCs9MitchAlsup1
30 Apr 25 i   i      ii                 `* Re: fractional PCs8Robert Finch
30 Apr 25 i   i      ii                  +* Re: fractional PCs6Thomas Koenig
1 May 25 i   i      ii                  i+- Re: fractional PCs1Robert Finch
2 May 25 i   i      ii                  i`* Re: fractional PCs4moi
2 May 25 i   i      ii                  i +* Re: millicode, extracode, fractional PCs2John Levine
2 May 25 i   i      ii                  i i`- Re: millicode, extracode, fractional PCs1moi
2 May 25 i   i      ii                  i `- Re: fractional PCs1moi
30 Apr 25 i   i      ii                  `- Re: fractional PCs1MitchAlsup1
15 Jul 25 i   i      i`* Re: register sets5John Savard
15 Jul 25 i   i      i `* Re: register sets4MitchAlsup1
19 Jul 25 i   i      i  `* Re: register sets3Robert Finch
19 Jul 25 i   i      i   `* Re: register sets2Anton Ertl
19 Jul 25 i   i      i    `- Re: register sets1MitchAlsup1
15 Jul 25 i   i      `* Re: register sets2John Savard
15 Jul 25 i   i       `- Re: register sets1MitchAlsup1
13 Oct 24 i   `- Re: Tonights Tradeoff - Background Execution Buffers1Anton Ertl
4 Oct 24 +- Re: Tonights Tradeoff - Background Execution Buffers1BGB
6 Oct 24 `- Re: Tonights Tradeoff - Background Execution Buffers1MitchAlsup1

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal