Re: Interpreters and indirect-branch prediction

Liste des GroupesRevenir à c arch 
Sujet : Re: Interpreters and indirect-branch prediction
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 14. Nov 2024, 03:33:19
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vh3nhr$2gegf$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13
User-Agent : Mozilla Thunderbird
On 11/13/2024 2:49 PM, BGB wrote:
On 11/13/2024 1:36 PM, MitchAlsup1 wrote:
On Wed, 13 Nov 2024 8:20:27 +0000, Anton Ertl wrote:
>
mitchalsup@aol.com (MitchAlsup1) writes:
For something like a bytecode interpreter, the prediction accuracy
of the jump predictor is going to be epically bad.
>
And with "jump predictor", you probably mean the indirect-branch
predictor.
>
On hardware of up to around 2005 where the indirect branches were at
best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
with one shared indirect branch (as generated by the usual
for...{switch...{...}} idiom in C) results in misprediction rates of
80% [ertl&gregg03jilp].
>
That certainly qualifies as bad.
>
 Can note that in my branch predictors, I ended up with a special case for "predict the opposite of last time" to try to deal with branches which alternated between being taken and not taken (which could get greater than a 50% mispredict with the more traditional approaches, like the 2-bit state model; the transitory states would allow it to adapt to these cases).
 A more elaborate state machine could potentially predict simple RLE patterns, but designing the state-transition graph would be harder.
 So, say:
   000: Weakly Not Taken
   001: Strong Not Taken
   010: Weakly Transitory Not Taken
   011: Strong Transitory Not Taken
   100: Weakly Taken
   101: Strong Taken
   110: Weakly Transitory Taken
   111: Strong Transitory Taken
 Where the transitory states always flip-flop.
 Had done better than a 2-bit state machine (with no transitory states). Adding additional "strength" levels did not help with accuracy in past attempts (actually seemed to make it slightly worse).
   Could maybe augment this with a 2-bit periodic field:
   00: Period 0, 0000, 1111, 1010, 0101
   01: Period 1, 001001, 010010, 100100, -, 110110, 101101, 011011
   1z: Period 2, 00010001, ..., 11101110, ...
 Though, with the longer periods always assuming a "weak" state, here the pattern also needs to be able to express the phase of the pattern.
  Best state transition diagram is less obvious, correct prediction likely jumps to the next stage of the predicted pattern, whereas a mispredict jumps to a different pattern.
  One other possibility being to use 5 or 6 bits of state and throw it at a genetic algorithm and see what the genetic algorithm comes up with (probably using some variants of the existing state machine as an initial seed guess).
 Say, ranking each possible state machine based on how well it can predict a series of simple repetitive bit patterns of various periods.
 Tempting to consider using a "majority of 8" mutator, as in this case one may want to limit mutation rate (and "majority of 3" might not be stable enough, majority of 5 lacks a cheap way to evaluate, but "majority of 8" allows representing each bit as a byte and using a lookup table). Well, or maybe "majority of 7".
  May try messing with it if anyone is interested.
 ...
 
Well, went and messed around with trying to get a genetic algorithm to solve this...
Example of the sorts of results I am getting:
                 6'b00000_0: tPreExCntB=6'b00001_1;
                 6'b00000_1: tPreExCntB=6'b00110_0;
                 6'b00001_0: tPreExCntB=6'b00001_0;
                 6'b00001_1: tPreExCntB=6'b00100_0;
                 6'b00010_0: tPreExCntB=6'b00000_0;
                 6'b00010_1: tPreExCntB=6'b00111_0;
                 6'b00011_0: tPreExCntB=6'b00010_0;
                 6'b00011_1: tPreExCntB=6'b00111_1;
                 6'b00100_0: tPreExCntB=6'b00011_0;
                 6'b00100_1: tPreExCntB=6'b00100_1;
                 6'b00101_0: tPreExCntB=6'b11100_0;
                 6'b00101_1: tPreExCntB=6'b01100_1;
                 6'b00110_0: tPreExCntB=6'b01011_0;
                 6'b00110_1: tPreExCntB=6'b00100_1;
                 6'b00111_0: tPreExCntB=6'b00010_1;
                 6'b00111_1: tPreExCntB=6'b01110_1;
                 6'b01000_0: tPreExCntB=6'b11111_1;
                 6'b01000_1: tPreExCntB=6'b11100_0;
                 6'b01001_0: tPreExCntB=6'b11101_1;
                 6'b01001_1: tPreExCntB=6'b11010_1;
                 6'b01010_0: tPreExCntB=6'b10011_1;
                 6'b01010_1: tPreExCntB=6'b00000_0;
                 6'b01011_0: tPreExCntB=6'b00000_1;
                 6'b01011_1: tPreExCntB=6'b10101_1;
                 6'b01100_0: tPreExCntB=6'b11111_0;
                 6'b01100_1: tPreExCntB=6'b00001_0;
                 6'b01101_0: tPreExCntB=6'b00000_0;
                 6'b01101_1: tPreExCntB=6'b11100_1;
                 6'b01110_0: tPreExCntB=6'b10011_0;
                 6'b01110_1: tPreExCntB=6'b10110_0;
                 6'b01111_0: tPreExCntB=6'b00000_1;
                 6'b01111_1: tPreExCntB=6'b00110_1;
                 6'b10000_0: tPreExCntB=6'b10001_0;
                 6'b10000_1: tPreExCntB=6'b00111_1;
                 6'b10001_0: tPreExCntB=6'b01001_0;
                 6'b10001_1: tPreExCntB=6'b11000_1;
                 6'b10010_0: tPreExCntB=6'b01001_0;
                 6'b10010_1: tPreExCntB=6'b00111_1;
                 6'b10011_0: tPreExCntB=6'b00100_0;
                 6'b10011_1: tPreExCntB=6'b10111_1;
                 6'b10100_0: tPreExCntB=6'b00100_0;
                 6'b10100_1: tPreExCntB=6'b00100_1;
                 6'b10101_0: tPreExCntB=6'b00010_1;
                 6'b10101_1: tPreExCntB=6'b01101_1;
                 6'b10110_0: tPreExCntB=6'b00011_1;
                 6'b10110_1: tPreExCntB=6'b01001_1;
                 6'b10111_0: tPreExCntB=6'b00011_1;
                 6'b10111_1: tPreExCntB=6'b01010_0;
                 6'b11000_0: tPreExCntB=6'b00111_0;
                 6'b11000_1: tPreExCntB=6'b00110_1;
                 6'b11001_0: tPreExCntB=6'b01111_1;
                 6'b11001_1: tPreExCntB=6'b00000_1;
                 6'b11010_0: tPreExCntB=6'b00111_1;
                 6'b11010_1: tPreExCntB=6'b10111_0;
                 6'b11011_0: tPreExCntB=6'b11010_0;
                 6'b11011_1: tPreExCntB=6'b10110_1;
                 6'b11100_0: tPreExCntB=6'b10000_0;
                 6'b11100_1: tPreExCntB=6'b01001_0;
                 6'b11101_0: tPreExCntB=6'b00110_1;
                 6'b11101_1: tPreExCntB=6'b10101_1;
                 6'b11110_0: tPreExCntB=6'b00011_0;
                 6'b11110_1: tPreExCntB=6'b00000_0;
                 6'b11111_0: tPreExCntB=6'b00111_1;
                 6'b11111_1: tPreExCntB=6'b01011_1;
The logic is admittedly difficult to decipher from the bit pattern.
Seems it is able to achieve 100% accuracy at 1..3 bit repeating patterns, but is unable to fully learn 4..7 bit patterns.
Granted, longer patterns would need more working state to be able to "remember" the position in the pattern.
In this case, there is 5 bits of state, with the final bit as the input/output bit (input is what state we are in, and the branch direction; the output is the next state, and the predicted branch direction for the next branch).
General ranking process was to put it in a random initial state, expose it to 16-48 bits from a randomly selected pattern, followed by 16-48 bits of the target pattern. After this, it is used to predict each bit of the pattern, counting up the number of mispredicted bits.
Accuracy seems to be higher than with 3 or 4 bits of state (which can only learn 1/2 bit patterns).
Here, 6 bits doesn't seem to have much advantage over 5 bits, but the drawback of not fitting into a LUT6.
It seems 7 bits could learn 4 and 5 bit patterns, but likely isn't worth the cost.
Not sure how well it would do at actual branch prediction. Since, as noted, it was trained on simple repeating patterns rather than actual branch data.

  
                        With one a non-shared indirect branch per VM
instruction implementation (as in typical threaded-code
implementations), we measured 50%-61% mispredictions for BTB with
2-bit counters in simulation [ertl&gregg03jilp] and similar results on
actual CPUs.
>
50% misprediction rate qualifies as bad.
>
We simulated 2-level history-based indirect branch predictors
[ertl&gregg03jilp], and they typically have less than 10%
mispredictions.
>
10% misprediction rate qualifies as "not so bad".
 

Date Sujet#  Auteur
23 Oct 24 * Reverse engineering of Intel branch predictors34Thomas Koenig
23 Oct 24 +* Re: Reverse engineering of Intel branch predictors24MitchAlsup1
28 Oct 24 i`* Re: Reverse engineering of Intel branch predictors23Stefan Monnier
5 Nov 24 i `* Re: Reverse engineering of Intel branch predictors22MitchAlsup1
11 Nov 24 i  +* Re: Reverse engineering of Intel branch predictors2Thomas Koenig
11 Nov 24 i  i`- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
11 Nov 24 i  `* Re: Reverse engineering of Intel branch predictors19Stefan Monnier
11 Nov 24 i   `* Re: Reverse engineering of Intel branch predictors18MitchAlsup1
12 Nov 24 i    `* Re: Reverse engineering of Intel branch predictors17Stefan Monnier
12 Nov 24 i     `* Re: Reverse engineering of Intel branch predictors16MitchAlsup1
12 Nov 24 i      +* Re: Reverse engineering of Intel branch predictors13Stefan Monnier
12 Nov 24 i      i`* Re: Reverse engineering of Intel branch predictors12MitchAlsup1
13 Nov 24 i      i +* Re: Reverse engineering of Intel branch predictors7Stefan Monnier
13 Nov 24 i      i i`* Re: Reverse engineering of Intel branch predictors6Terje Mathisen
13 Nov 24 i      i i `* Re: Reverse engineering of Intel branch predictors5Stefan Monnier
13 Nov 24 i      i i  `* Re: Reverse engineering of Intel branch predictors4Thomas Koenig
13 Nov 24 i      i i   +* Re: Reverse engineering of Intel branch predictors2Stefan Monnier
14 Nov 24 i      i i   i`- Re: Reverse engineering of Intel branch predictors1Thomas Koenig
14 Nov 24 i      i i   `- Interpreters and indirect-branch prediction (was: Reverse ...)1Anton Ertl
13 Nov 24 i      i `* Interpreters and indirect-branch prediction4Anton Ertl
13 Nov 24 i      i  `* Re: Interpreters and indirect-branch prediction3MitchAlsup1
13 Nov 24 i      i   `* Re: Interpreters and indirect-branch prediction2BGB
14 Nov 24 i      i    `- Re: Interpreters and indirect-branch prediction1BGB
13 Nov 24 i      `* Re: Reverse engineering of Intel branch predictors2Brett
13 Nov 24 i       `- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
1 Nov 24 `* Re: Reverse engineering of Intel branch predictors9Waldek Hebisch
1 Nov 24  +- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
5 Nov 24  `* Re: Reverse engineering of Intel branch predictors7Stefan Monnier
5 Nov 24   +- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
8 Nov 24   `* Re: Reverse engineering of Intel branch predictors5Waldek Hebisch
8 Nov 24    +* Re: Reverse engineering of Intel branch predictors3MitchAlsup1
10 Nov 24    i`* Re: Reverse engineering of Intel branch predictors2Waldek Hebisch
10 Nov 24    i `- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
11 Nov 24    `- Re: Reverse engineering of Intel branch predictors1Stefan Monnier

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal