Sujet : Re: portable proxy collector test...
De : chris.m.thomasson.1 (at) *nospam* gmail.com (Chris M. Thomasson)
Groupes : comp.archDate : 09. Dec 2024, 21:47:39
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vj7l1b$im6l$1@dont-email.me>
References : 1 2 3 4 5 6
User-Agent : Mozilla Thunderbird
On 12/9/2024 4:28 AM, jseigh wrote:
On 12/8/24 18:31, Chris M. Thomasson wrote:
On 12/6/2024 2:43 PM, jseigh wrote:
On 12/6/24 16:12, Chris M. Thomasson wrote:
On 12/6/2024 11:55 AM, Brett wrote:
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
I am wondering if anybody can try to compile and run this C++11 code of
mine for a portable word-based proxy collector, a sort of poor mans RCU,
on an ARM based system? I don't have access to one. I am interested in
the resulting output.
>
https://godbolt.org
>
https://pastebin.com/raw/CYZ78gVj
(raw text link, no ads... :^)
[...]
>
It seems that all of the atomics are LDREX/STREX wrt fetch_add/sub . Even with relaxed memory order. Are the LDREX/STREX similar to the LOCK prefix on an x86/64?
>
https://godbolt.org/z/EPGYWve71
>
It has loops for this in the ASM code. Adding a loop in there can change things from wait-free to lock-free. Humm...
>
Which compiler did you choose. armv8? Try ARM64.
>
The newer arms have new atomics
cas and atomic fetch ops.
>
LDREX/STREX is the older load store reserved.
On the newer stuff it's ldxr/stxr not ldrex/strex.
>
Well, for a ARM64 gcc 14.2.0 a relaxed fetch_add I get __aarch64_ldadd8_acq_rel in the asm.
>
https://godbolt.org/z/YzPdM8j33
>
acq_rel barrier for a relaxed membar? Well, that makes me go grrrrrr!
>
It has to be akin to the LOCK prefix over on x86. I want it relaxed damn it! ;^)
Apart of the memory ordering, if you are using atomic_fetch_add you
are going to get an interlocked instruction which is probably
overkill and has more overhead than you want. Atomic ops
assume other cpus might be trying atomic rmw ops on other
cpus which is not the case for userspace rcu. You want
an atomic relaxed load, and atomic relaxed store of the
incrmented value. It will be faster.
I can keep the debug statistics on a per-thread basis. Instead of using a global counter, each thread has a per thread counter. Then those are all summed up at the end of the program to gain the real counts. I forgot what that was called. Split counters? It's a well known technique. These statistics are only there to give me a feel as to what is going on. They can be completely removed for a release build, so to speak.
Fwiw my proxy wrt this particular test needs fetch_add for the way it acquires and releases a collector object:
______________________
collector& acquire()
{
// increment the master count _and_ obtain current collector.
std::uint32_t current =
m_current.fetch_add(ct_ref_inc, std::memory_order_acquire);
// decode the collector index.
return m_collectors[current & ct_proxy_mask];
}
void release(collector& c)
{
// decrement the collector.
std::uint32_t count =
c.m_count.fetch_sub(ct_ref_inc, std::memory_order_release);
// check for the completion of the quiescence process.
if ((count & ct_ref_mask) == ct_ref_complete)
{
// odd reference count and drop-to-zero condition detected!
g_debug_release_collect.fetch_add(1, std::memory_order_relaxed);
prv_quiesce_complete(c);
}
}
______________________
Damn! I need to find more time to work on this. ;^o