On Tue, 25 Mar 2025 17:54:42 +0000, Dan Cross wrote:
In article <0343529d63f68c76b5e0d227ef6e39dd@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
--------------------------
>
Say you have a critical section (of your description) that was
preceded by an if-statement that was predicted to enter CS, but
1,000 cycles later the branch is resolved to not enter the CS.
{{Branch was dependent on an MMI/O location down the PCIe tree
which is why it took so long to resolve.}}
>
HW Backs the core up so that the RF and all memory locations
contain their original values at the point of the mispredicted
branch.
>
Can thread running on core even tell if you entered CS or not !
>
No, your thread cannot.
>
Again, how is that relevant? The branch predictor can only
handle speculative execution of so-many instructions.
Around 300 at present.
Know of any critical sections that long?
Attempts
to perform more instructions than that will stall the pipeline.
Obviously.
A critical section as defined by software may contain more than
that number of instructions.
Sure, its possible.
But if DECODE cannot push the entire critical section into the
execution window, then BY DEFINITION, it MUST be able to make
the entire critical section appear as if you never started on
it !!!
IP gets backed up as if you never entered CS
RF gets backed up as if you never entered CS
LOCK variable never got written to SNOOPable cache where others can see
it
Cache memory gets put back in the "you never got here" states too.
Thus, it is like it never happened:: both to the core and to all
interested 3rd parties.
Nobody saw the Lock variable change value.
And
Nobody saw any STs from the CS change anything in memory.
Thus, it is as if CS never happened, it was all speculative, and
got put-right when CS was not allowed to finish.
BUT-----------------------------
Code running in CS does believe the the LOCK variable was updated
AND that STs from CS have been performed by the exit of CS
------------------------------------------------------------
You see SW sees the nonNeumann model. 1 instruction is Fetched
and executed in its entirely before proceeding to the next inst-
ruction. You see individual instructions--as if single stepping.
HW, on the other hand sees at least 3 points in time
a) instructions has not arrived to be execcuted
b) instruction is executing
c) instruction has completed executing
And the time span from 1) to c) can be hundreds of cycles
while there are 300 (or more) instruction contending for
execution resources.
In additions, certain instructions may have more than 1 b) units
of timing--for example a LD-OP-ST has a LD execution, an OP
execution, and a ST execution.
But look at the execution of a ST in the shadow of a branch.
We want ST to be "performed" as soon as possible, so that a
subsequent LD can obtain the ST-data even before we are allowed
to write ST-data into the cache. So, your notion of what can be
backed up (or not) appears not to be able to see all the funny
business HW goes through to make the core smell fast.
After the LD receives ST-data, the branch may resolve as a
bad prediction, and the ST is not allowed to modify cache,
also the LD was not allowed to see data that was never to be
written. And HW clean up making it all SMELL like it never
happened.
-----------------------------------------------------------
Interested 3rd parties are another problem to be solved.
In the midst of a CS a SNOOP comes along and touches a memory
location you are using in CS.
a) You can make the CS appear never to have happened
b) you can delay the SNOOP response
c) You can wait for the branch to resolve to choose
...between a) and b)
d) same as c) but coherence protocol has a NaK in it.
(c) is only an option for more advance cache coherence protocols.
(d) allows a CS in the manifestation part (Sts) to have priority
(sic.) over interfering SNOOP requests--a soft guarantee of for-
ward progress.
A critical section, as defined by
software, may touch more cache lines than your atomic event
proposal can handle.
Sure, but then you would only be using my ATOMICs for locking/
unlocking not for the work going on.
Thus, for your architecture to be useful,
you must provide some mechanism for handling those
eventualities.
As stated about 15 responses ago, just use them as ATOMIC primitive
generators.
Your stuff sounds fine for getting into the critsec; but it
doesn't help you once you're in it, and it seems like it may
hurt you if some higher-priority thing comes along and yanks
your lock out from under you.
I conceded that point at least 2 days ago.
- Dan C.