On 12/3/24 04:01, Anton Ertl wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/17/2024 7:17 AM, Anton Ertl wrote:
jseigh <jseigh_es00@xemaps.com> writes:
Or maybe disable reordering or optimization altogether
for those target architectures.
>
So you want to throw out the baby with the bathwater.
>
No, keep the weak order systems and not throw them out wrt a system that
is 100% seq_cst? Perhaps? What am I missing here?
Disabling optimization altogether costs a lot; e.g., look at
<http://www.complang.tuwien.ac.at/anton/bentley.pdf>: if you compare
the lines for clang-3.5 -O0 with clang-3.5 -O3, you see a factor >2.5
for the tsp9 program. For gcc-5.2.0 the difference is even bigger.
That's why jseigh and people like him (I have read that suggestion
several times before) love to suggest disabling optimization
altogether. It's a straw man that does not even need beating up. Of
course they usually don't show results for the supposed benefits of
the particular "optimization" they advocate (or the drawbacks of
disabling it), and jseigh follows this pattern nicely.
That wasn't a serious suggestion.
The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected. If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
If you are writing code with concurrent shared data access then
you need let the compiler know. One way is with locks.
Another way for lock-free data structures with with
memory barriers. Even if you had cst hardware you
still need to tell the compiler so cst hardware doesn't
buy you any less effort from a programming point of view.
If you are arguing lock-free programming with memory barrriers
is hard, let's use locks for everything (disregarding that
locks have acquire/release semantics that the compiler has
to be aware of and programmers aren't always aware of), you
might want to consider the following performance timings
on some stuff I've been playing with.
unsafe 53.344 nsecs ( 0.000) 54.547 nsecs ( 0.000)*
smr 53.828 nsecs ( 0.484) 55.485 nsecs ( 0.939)
smrlite 53.094 nsecs ( 0.000) 54.329 nsecs ( 0.000)
arc 306.674 nsecs ( 253.330) 313.931 nsecs ( 259.384)
rwlock 730.012 nsecs ( 676.668) 830.340 nsecs ( 775.793)
mutex 2,881.690 nsecs ( 2,828.346) 3,305.382 nsecs ( 3,250.835)
smr is smrproxy, something like user space rcu. smrlite is smr
is smr w/o thread_local access so I have an idea how much that
adds to overhead. arc is arcproxy, lock-free reference count
based deferred reclamation. rwlock and mutex are what their
names would suggest. unsafe is no synchronization to get a
base timing on the reader loop body.
2nd col is per loop read lock/unlock average cpu time
3rd col is with unsafe time subtracted out
4th col is average elapsed time
5th col is with unsafe time subtracted out.
cpu time doesn't measure lock wait time so elapsed time
gives some indication of that.
8 reader threads, 1 writer thread
smrproxy is the version that doesn't need the cst_seq
memory barrier so it is pretty fast (you are welcome).
arc, rwlock, and mutex use interlocked instructions which
cause cache thrashing. mutex will not scale well with
number of threads on top of that. rwlock depends on
how much write locking is going on. With few write
updates, it will look more like arc.
Timings are for 8 reader threads, 1 writer thread on
4 core/8 hw thread machine.
There's going to be applications where that 2 to 3+ order
difference of overhead is going to matter a lot.
Joe Seigh