Newsportal USENET - Re: Arguments for a sane ISA 6-years later

On 7/29/2024 12:25 AM, BGB wrote:

On 7/28/2024 10:32 PM, Chris M. Thomasson wrote:
On 7/26/2024 10:00 AM, Anton Ertl wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 7/25/2024 1:09 PM, BGB wrote:
At least with a weak model, software knows that if it doesn't go through
the rituals, the memory will be stale.
>
There is no guarantee of staleness, only a lack of stronger ordering
guarantees.
>
The weak model is ideal for me. I know how to program for it
>
And the fact that this model is so hard to use that few others know
how to program for it make it ideal for you.
>
and it's more efficient
>
That depends on the hardware.
>
Yes, the Alpha 21164 with its imprecise exceptions was "more
efficient" than other hardware for a while, then the Pentium Pro came
along and gave us precise exceptions and more efficiency. And
eventually the Alpha people learned the trick, too, and 21264 provided
precise exceptions (although they did not admit this) and more
efficieny.
>
Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware. That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.
>
By contrast, one can design hardware for strong ordering such that the
slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.
>
and sometimes use cases do not care if they encounter "stale" data.
>
Great. Unless these "sometimes" cases are more often than the cases
where you perform some atomic operation or barrier because of
potential, but not actual communication between cores, the weak model
is still slower than a well-implemented strong model.
>
A strong model? You mean I don't have to use any memory barriers at all? Tell that to SPARC in RMO mode... How strong? Even the x86 requires a membar when a store followed by a load to another location shall be respected wrt order. Store-Load. #StoreLoad over on SPARC. ;^)
>
If you can force everything to be #StoreLoad (*) and make it faster than a handcrafted algo on a very weak memory system, well, hats off! I thought it was easier for a HW guy to implement weak consistency? At the cost of the increased complexity wrt programming the sucker! ;^)
>
Programming for a weak model isn't that hard...
Well, unless the program is built around a "naive lock free" strategy (where the threads manipulate members in a data-structure or similar and assume that the other threads will see the updates in a more-or-less consistent way).

lock/wait-free algorithms are very nice. Yes they can be fairly hard, but can be done for sure; stable and working in 100% correct order. The good ones are hard to beat using all locking logic. Try to beat RCU using a read write lock? I have some interesting algorithms that work like a charm.

Though, one does have the issue that one can't just use cheap spinlocks.

One note... Spinlocks work in a very weak memory model for sure. You just need the right memory barrier logic... For instance, SPARC in RMO mode wrt locking a spinlock and/or mutex requires a #LoadStore | #LoadLoad membar _after_ the atomic logic that actually locks it occurs. It also requires a release membar #LoadStore | #StoreStore _before_ the atomic logic that unlocks it takes place. Take note that #StoreLoad is _not_ required for a spinlock or a mutex in this context...
However... There is "special" mutex logic that actually requires a #StoreLoad! Peterson's algorithm for example. Iirc, it needs a #StoreLoad because it depends on a store followed by a load to another location to hold true. This is a bit different than other locking algorithms...
There there are more "exotic" methods such as so-called asymmetric mutexes. They can have fast paths and slow paths, so to speak. It's almost getting into the realm of RCU here... A fast path can be memory barrier free. The slow path can make things consistent with the use of so called "remote" memory barriers. It's funny that Windows seems to have one:
https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers
;^)
The slow path is meant to not be frequently used, hence the term asymmetric. On par with read/write logic... :^)
Should have some more time to respond to the rest of your post tonight or tomorrow. I am a bit busy right now.

>
(*) Not just #StoreLoad for full consistency, you would need :
>
MEMBAR #StoreLoad | #LoadStore | #StoreStore | #LoadLoad
>
right?
FWIW:
I did figure out a way to "affordably" implement register banking.
So, first attempt, naive strategy:
   Simply increased register array sizes from 64 to 256;
   Result (on XC7A100T):
     LUT cost went from 90% to 120%;
     LUTRAM cost went from 22% to 70%.
     Clearly, this wasn't going to work...
Note that, when anything goes over 100%, the graph turns red, and the "Implementation" stage is going to fail (and that last 20% isn't going to just disappear...).
So, ended up needing a more complex strategy:
   Expand register tags to 4 bits, also encoding the current bank;
   Add a banked register array, 256x 64-bits;
   If a register port isn't in the correct bank:
     Set parameters to signal that it needs to be swapped out;
     Stall the pipeline.
   Store the old register value to the array,
     while also fetching the register value from the array.
   If a swap-operation has fetched a register value from the array,
     store it into the main register array
       Currently by overriding the Lane 1 write port.
       TBD: May move to the Lane 3 port.
   Still need to work out the specifics of the mechanism for moving to/from the banked registers (may not be encoded directly with the existing register numbering; so will likely need to be indirect).
Also there is currently likely to be wonk, and this will make interrupts and system calls faster at the expense of likely making task-switches slightly slower.
Basically, since now rather than being able to load/store the registers directly, one is going to need to load/store them and also MOV them via temporary registers (adding some extra clock cycles).
Also, a to-be addressed issue:
To actually use the mechanism, would need different logic for the interrupt handlers (and context switching).
As designed though, the mechanism itself is backwards compatible with my existing behavior (unless actually told to use the bank swapping, everything will behave as it did before).
   But, I am still having doubts as to whether or not this makes sense.
I had thought RISC-V's Privileged spec had defined per-mode bank switching, but I can't seem to find any mention of this when I went back to look at it now.
It appears instead that they were actually using a similar "save and restore everything on each interrupt" strategy as what BJX2 had been using thus far.
Most obvious difference was that apparently they allowed interrupts to be layered across modes:
   User Mode interrupts go to Supervisor Mode;
   Supervisor mode interrupts go to Machine Mode.
Contrast to the BJX2 core which only has an equivalent of Machine Mode interrupts (and treats User and Supervisor Mode as basically the same, differing mostly in that Supervisor mode has access to privileged instructions).
...

Date	Sujet	#	Auteur
24 Jul 24	Arguments for a sane ISA 6-years later	63	MitchAlsup1
25 Jul 24	Re: Arguments for a sane ISA 6-years later	62	BGB
25 Jul 24	Re: Arguments for a sane ISA 6-years later	57	Chris M. Thomasson
26 Jul 24	Re: Arguments for a sane ISA 6-years later	56	Anton Ertl
26 Jul 24	Re: Arguments for a sane ISA 6-years later	20	BGB
29 Jul 24	Re: Arguments for a sane ISA 6-years later	19	Anton Ertl
29 Jul 24	Intel overvoltage (was: Arguments for a sane ISA 6-years later)	2	Thomas Koenig
29 Jul 24	Re: Intel overvoltage	1	BGB
29 Jul 24	Re: Arguments for a sane ISA 6-years later	16	BGB
30 Jul 24	Re: Arguments for a sane ISA 6-years later	15	Anton Ertl
30 Jul 24	Re: Arguments for a sane ISA 6-years later	14	BGB
30 Jul 24	Re: Arguments for a sane ISA 6-years later	2	Chris M. Thomasson
30 Jul 24	Re: Arguments for a sane ISA 6-years later	1	BGB
1 Aug 24	Re: Arguments for a sane ISA 6-years later	11	Anton Ertl
1 Aug 24	Re: Arguments for a sane ISA 6-years later	1	Michael S
1 Aug 24	Re: Arguments for a sane ISA 6-years later	8	MitchAlsup1
1 Aug 24	Re: Arguments for a sane ISA 6-years later	1	Michael S
2 Aug 24	Re: Arguments for a sane ISA 6-years later	6	MitchAlsup1
2 Aug 24	Re: Arguments for a sane ISA 6-years later	1	Michael S
4 Aug 24	Re: Arguments for a sane ISA 6-years later	4	MitchAlsup1
5 Aug 24	Re: Arguments for a sane ISA 6-years later	3	Stephen Fuld
5 Aug 24	Re: Arguments for a sane ISA 6-years later	2	Stephen Fuld
5 Aug 24	Re: Arguments for a sane ISA 6-years later	1	MitchAlsup1
1 Aug 24	Re: Arguments for a sane ISA 6-years later	1	BGB
26 Jul 24	Re: Arguments for a sane ISA 6-years later	20	MitchAlsup1
27 Jul 24	Re: Arguments for a sane ISA 6-years later	1	BGB
29 Jul 24	Memory ordering (was: Arguments for a sane ISA 6-years later)	18	Anton Ertl
29 Jul 24	Re: Memory ordering	15	MitchAlsup1
29 Jul 24	Re: Memory ordering	6	Chris M. Thomasson
29 Jul 24	Re: Memory ordering	5	MitchAlsup1
30 Jul 24	Re: Memory ordering	4	Michael S
31 Jul 24	Re: Memory ordering	3	Chris M. Thomasson
31 Jul 24	Re: Memory ordering	2	Chris M. Thomasson
31 Jul 24	Re: Memory ordering	1	Chris M. Thomasson
30 Jul 24	Re: Memory ordering	8	Anton Ertl
30 Jul 24	Re: Memory ordering	2	Chris M. Thomasson
30 Jul 24	Re: Memory ordering	1	Chris M. Thomasson
31 Jul 24	Re: Memory ordering	5	MitchAlsup1
31 Jul 24	Re: Memory ordering	1	Chris M. Thomasson
1 Aug 24	Re: Memory ordering	3	Anton Ertl
1 Aug 24	Re: Memory ordering	2	MitchAlsup1
2 Aug 24	Re: Memory ordering	1	Anton Ertl
29 Jul 24	Re: Memory ordering	2	Chris M. Thomasson
30 Jul 24	Re: Memory ordering	1	Chris M. Thomasson
29 Jul 24	Re: Arguments for a sane ISA 6-years later	13	Chris M. Thomasson
29 Jul 24	Re: Arguments for a sane ISA 6-years later	9	BGB
29 Jul 24	Re: Arguments for a sane ISA 6-years later	8	Chris M. Thomasson
29 Jul 24	Re: Arguments for a sane ISA 6-years later	1	Chris M. Thomasson
29 Jul 24	Re: Arguments for a sane ISA 6-years later	2	BGB
29 Jul 24	Re: Arguments for a sane ISA 6-years later	1	Chris M. Thomasson
30 Jul 24	Re: Arguments for a sane ISA 6-years later	4	jseigh
30 Jul 24	Re: Arguments for a sane ISA 6-years later	3	Chris M. Thomasson
31 Jul 24	Re: Arguments for a sane ISA 6-years later	2	jseigh
31 Jul 24	Re: Arguments for a sane ISA 6-years later	1	Chris M. Thomasson
29 Jul 24	Memory ordering (was: Arguments for a sane ISA 6-years later)	1	Anton Ertl
29 Jul 24	Re: Arguments for a sane ISA 6-years later	2	MitchAlsup1
29 Jul 24	Re: Arguments for a sane ISA 6-years later	1	BGB
6 Aug 24	Re: Arguments for a sane ISA 6-years later	2	Chris M. Thomasson
6 Aug 24	Re: Arguments for a sane ISA 6-years later	1	Chris M. Thomasson
25 Jul 24	Re: Arguments for a sane ISA 6-years later	4	MitchAlsup1
26 Jul 24	Re: Arguments for a sane ISA 6-years later	1	BGB
28 Jul 24	Re: Arguments for a sane ISA 6-years later	2	Paul A. Clayton
28 Jul 24	Re: Arguments for a sane ISA 6-years later	1	MitchAlsup1