Re: arm ldxr/stxr vs cas

Liste des GroupesRevenir à c arch 
Sujet : Re: arm ldxr/stxr vs cas
De : jseigh_es00 (at) *nospam* xemaps.com (jseigh)
Groupes : comp.arch
Date : 11. Sep 2024, 14:22:11
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vbs5i5$3kege$1@dont-email.me>
References : 1 2 3 4 5 6 7 8
User-Agent : Mozilla Thunderbird
On 9/11/24 00:15, Paul A. Clayton wrote:
On 9/9/24 3:14 AM, Terje Mathisen wrote:
jseigh wrote:
>
I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.
>
Just make sure you never have multiple locks residing inside the same cache line!
 Never?
 I suspect at least theoretically conditions could exist where
having more than one lock within a cache line would be beneficial.
 If lock B is always acquired after lock A, then sharing a cache
line might (I think) improve performance. One would lose
prefetched capacity for the data protected by lock A and lock B.
This assumes simple locks (e.g., not readers-writer locks).
 It seems to me that the pingpong problem may be less important
than spatial locality depending on the contention for the cache
line and the cache hierarchy locality of the contention
(pingponging from a shared level of cache would be less
expensive).
I've been performance testing on a 4 core 8 hw thread cpu and
generally it's faster if I run on 4 hw threads, 1 per cpu,
than on all 8 hw threads.  But if I run on 2 hw threads, both
on same core, it's 2x faster than 2 hw threads on 2 cores,
1 per core.  Probably because they're running out of L1/L2
core only cache rather than L3 global cache.
 If work behind highly active locks is preferentially or forcefully
localized, pingponging would be less of a problem, it seems.
Instead of an arbitrary core acquiring a lock's cache line and
doing some work, the core could send a message to the natural owner of the cache line to do the work.
 If communication between cores was low latency and simple messages
used little bandwidth, one might also conceive of having a lock
manager that tracks the lock state and sends a granted or not-
granted message back. This assumes that the memory location of the
lock itself is separate from the data guarded by the lock.
 Being able to grab a snapshot of some data briefly without
requiring (longer-term) ownership change might be useful even
beyond lock probing (where a conventional MESI would change the
M-state cache to S forcing a request for ownership when the lock
is released). I recall some paper proposed expiring cache line
ownership to reduce coherence overhead.
 Within a multiple-cache-line atomic operation/memory transaction,
I _think_ if the write set is owned, the read set could be grabbed
as such snapshots. I.e., I think any remote write to the read set
could be "after" the atomic/transaction commits. (Such might be
too difficult to get right while still providing any benefit.)
Well there's HTM (hardware transactional memory) but apparently
that's hard to get right and it's limited in snap shot size.
In software there's a large body of lock-free data structures
using various deferred reclamation schemes, RCU, hazard pointers,
etc...  RCU has zero read access cost overhead, hazard pointers
have almost zero cost, about 3x the cost of a pipelined load.
Hazard pointers got a lot faster when need for a store/load memory
barrier was gotten rid of.
These lock-free data structures can be quite huge, e.g.
a lock-free map with millions of entries use to cache a redis
database with billions of entries.  Certainly bigger than
you can fit into hw cache.
 (Weird side-thought: I wonder if a conservative filter might be
useful for locking, particularly for writer locks. On the one
hand, such would increase the pingpong in the filter when writer
locks are set/cleared; on the other hand, reader locks could use
a remote increment within the filter check atomic to avoid slight
cache pollution.)
There are reader/writer bakery style spin locks.  Bakery style
spin locks are more cache friendly than regular spin locks.
An interlock operation, fetch_and_add, is only done once
to get the wait ticket and then the code just spins testing
the next value which should be pretty efficient on strongly
coherent cache.
An early description of rw spin locks
https://groups.google.com/g/comp.programming/c/tHkE6R4joe0/m/1NyR2OkkHJ4J
Joe Seigh

Date Sujet#  Auteur
2 Sep 24 * arm ldxr/stxr vs cas58jseigh
2 Sep 24 +* Re: arm ldxr/stxr vs cas4Chris M. Thomasson
2 Sep 24 i`* Re: arm ldxr/stxr vs cas3Chris M. Thomasson
2 Sep 24 i `* Re: arm ldxr/stxr vs cas2jseigh
2 Sep 24 i  `- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
4 Sep 24 +* Re: arm ldxr/stxr vs cas50MitchAlsup1
5 Sep 24 i+* Re: arm ldxr/stxr vs cas3Chris M. Thomasson
5 Sep 24 ii`* Re: arm ldxr/stxr vs cas2MitchAlsup1
5 Sep 24 ii `- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
5 Sep 24 i`* Re: arm ldxr/stxr vs cas46jseigh
5 Sep 24 i +- Re: arm ldxr/stxr vs cas1Anton Ertl
5 Sep 24 i `* Re: arm ldxr/stxr vs cas44MitchAlsup1
5 Sep 24 i  `* Re: arm ldxr/stxr vs cas43Chris M. Thomasson
5 Sep 24 i   `* Re: arm ldxr/stxr vs cas42jseigh
6 Sep 24 i    +- Re: arm ldxr/stxr vs cas1MitchAlsup1
6 Sep 24 i    +* Re: arm ldxr/stxr vs cas20Chris M. Thomasson
6 Sep 24 i    i`* Re: arm ldxr/stxr vs cas19MitchAlsup1
7 Sep 24 i    i `* Re: arm ldxr/stxr vs cas18jseigh
8 Sep 24 i    i  `* Re: arm ldxr/stxr vs cas17Chris M. Thomasson
8 Sep 24 i    i   `* Re: arm ldxr/stxr vs cas16Chris M. Thomasson
8 Sep 24 i    i    `* Re: arm ldxr/stxr vs cas15Chris M. Thomasson
8 Sep 24 i    i     `* Re: arm ldxr/stxr vs cas14MitchAlsup1
8 Sep 24 i    i      +* Re: arm ldxr/stxr vs cas4Chris M. Thomasson
8 Sep 24 i    i      i+- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
8 Sep 24 i    i      i`* Re: arm ldxr/stxr vs cas2jseigh
8 Sep 24 i    i      i `- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
8 Sep 24 i    i      `* Re: arm ldxr/stxr vs cas9Chris M. Thomasson
8 Sep 24 i    i       +* Re: arm ldxr/stxr vs cas6Michael S
8 Sep 24 i    i       i+- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
8 Sep 24 i    i       i+- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
9 Sep 24 i    i       i`* Re: arm ldxr/stxr vs cas3Michael S
9 Sep 24 i    i       i `* Re: arm ldxr/stxr vs cas2Michael S
9 Sep 24 i    i       i  `- Re: arm ldxr/stxr vs cas1Michael S
8 Sep 24 i    i       +- Re: arm ldxr/stxr vs cas1MitchAlsup1
8 Sep 24 i    i       `- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
9 Sep 24 i    `* Re: arm ldxr/stxr vs cas20Terje Mathisen
9 Sep 24 i     +* Re: arm ldxr/stxr vs cas11jseigh
9 Sep 24 i     i+* Re: arm ldxr/stxr vs cas6Chris M. Thomasson
10 Sep 24 i     ii`* Re: arm ldxr/stxr vs cas5jseigh
10 Sep 24 i     ii `* Re: arm ldxr/stxr vs cas4Chris M. Thomasson
10 Sep 24 i     ii  `* Re: arm ldxr/stxr vs cas3jseigh
11 Sep 24 i     ii   `* Re: arm ldxr/stxr vs cas2Chris M. Thomasson
11 Sep 24 i     ii    `- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
10 Sep 24 i     i`* Re: arm ldxr/stxr vs cas4Terje Mathisen
10 Sep 24 i     i `* Re: arm ldxr/stxr vs cas3jseigh
10 Sep 24 i     i  +- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
10 Sep 24 i     i  `- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
9 Sep 24 i     +- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
11 Sep 24 i     `* Re: arm ldxr/stxr vs cas7Paul A. Clayton
11 Sep 24 i      +* Re: arm ldxr/stxr vs cas2Chris M. Thomasson
11 Sep 24 i      i`- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
11 Sep 24 i      +* Re: arm ldxr/stxr vs cas2jseigh
11 Sep 24 i      i`- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
11 Sep 24 i      `* Re: arm ldxr/stxr vs cas2Stefan Monnier
12 Sep 24 i       `- Re: arm ldxr/stxr vs cas1Chris M. Thomasson
9 Sep 24 `* Re: arm ldxr/stxr vs cas3jseigh
11 Sep 24  `* Re: arm ldxr/stxr vs cas2jseigh
11 Sep 24   `- Re: arm ldxr/stxr vs cas1Chris M. Thomasson

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal