On 04/29/2024 08:19 AM, Stefan Ram wrote:
paavo512 <paavo@osa.pri.ee> wrote or quoted:
|Anyway, multithreading performance is a non-issue for Python so far as
|the Python interpreter runs in a single-threaded regime anyway, under a
|global GIL lock. They are planning to get rid of GIL, but this work is
|still in development AFAIK. I'm sure it will take years to stabilize the
|whole Python zoo without GIL.
>
The GIL only prevents multiple Python statements from being
interpreted simultaneously, but if you're waiting on inputs (like
sockets), it's not active, so that could be distributed across
multiple cores.
>
With asyncio, however, you can easily handle the application
for threads to "wait in parallel" for thousands of sockets in a
single thread, and there are fewer opportunities for errors than
with multithreading.
>
Additionally, there are libraries like numpy that use true
multithreading internally to distribute computational tasks
across multiple cores. By using such libraries, you can take
advantage of that. (Not to mention the AI libraries that have their
work done in highly parallel fashion by graphics cards.)
>
If you want real threads, you could probably work with Cython
sometimes.
>
Other languages like JavaScript seem to have an advantage there
because they don't know a GIL, but with JavaScript, for example,
it's because it always runs in a single thread overall. And in
the languages where there are threads without a GIL, you quickly
realize that programming correct non-trivial programs with
parallel processing is error-prone.
>
Often in Python you can use "ThreadPoolExecutor" to start
multiple threads. If the GIL then becomes a problem (which is
not the case if you're waiting on I/O), you can easily swap it
out for "ProcessPoolExecutor": Then processes are used instead
of threads, and there is no GIL for those.
>
If four cores are available, by dividing up compute-intensive tasks
using "ProcessPoolExecutor", you can expect a speedup factor of two
to eight.
>
With the Celery library, tasks can be distributed across multiple
processes that can also run on different computers. See, for
example, "Parallel Programming with Python" by Jan Palach.
>
It sort of seems there are two approaches to
the parallel, and the asynchronous.
There's, "divide-and-conquer", and "information-cooperation".
The linear-speedup of the embarrassingly parallel
in the divide-and-conquer, or single-instruction-multiple-data,
is a pretty great thing.
Notions like map-reduce when the count of values
per key is about same and thusly the computing
the aggregates (summaries, digests, aggregate
and analytic functions) can be accomplished by
horizontal scaling (more boxes with same resources),
is another usual divide-and-conquer approach
(horizontal scaling).
Once upon a time there was this great idea called
"Aglets" or "mobile agents" or "mobile code". This
is basically that a functional program is distributed
to nodes, to run on the facilities of the nodes with
some resources, then to return to the "aglet-hive"
what results can be composed. This is also usually
called anything the "agent" or "instrumentation"
on the box. (The box, a process model, its processes,
their threads, their "inter-thread calls", their "inter-process
calls", their network, a node, a box. Aglets are the
little caps or tape at the end of shoe-laces, here
with regards to notions like "aggregate functions"
and "analytic functions".)
The cooperation is basically any notion of a callback.
The callback is one of the fundamental notions of
flow-of-control, and about the most elementary
notion of the functional paradigm in otherwise
the procedural or the imperative paradigm.
Otherwise for threads to fork, to divide, then
whether they join, is a callback.
So, first learning the idea of a callback is like,
"you mean I need to provide a different entry
point for this code to return and then where
I'm at is exiting forever as if in a shell process
model exec'ing another process and resulting
that this process becomes that one", and it's
like "yeah, you just give it a callback address
and what results is that's where it goes".
It's functional.
(Functional/event-driven, procedural/imperative.)
Some people learn functional first, and others
procedural first. It's hard to say how people
think, in their mental models of the things,
which is pretty much always flow-machines.
The chip, or the old planar integrated-circuit
the usually standard logic the chip, is systolic,
the systolic flow driven by the systolic clock,
that most people have a flow-model of code.
So, there's callbacks, and then there's funnels
and distributors, say, then as with regards to
something like a "Clos network", any kind of
usual model of data-flow, it's a flow-machine.
Funnels/sprinklers: Venturi effect.
In flow machines, there's basically something
like "Ford-Fulkerson flow algorithm", which is
a hypothetical sort of algorithm that formalizes
and optimizes flow.
Threads fork, and they also join.
The only hardware threads are independent cores,
and, their semantics of memory barriers,
according to their clocks. The rest is organization
of context and routine and state and stack,
and for the general purpose usually pre-emptive,
or, "time-sharing".
Or, you know, "nodes".
It's a time-sharing system.
So, there's processes and a process model,
there's the inter-process, then there's threading
models, and the re-entrant and shared and
the mutex, according to ordering and serial
guarantees or "delivery", it's message-passing
of course, vis-a-vis "the core" or memory,
a monad or a purely functional state,
it's a distributed system of nodes.
Once there was an initiative called "Parallel C",
language and compiler extensions to support
language constructs embodying the notions of
the parallel.
Somebody came up with pi calculus, process
calculus, communicating sequential processes
and the like. I've heard of Djikstra's law yet I
forget it, and any entry point is a "goto".
In clusters, there's a usual notion of message-passing,
often organized about the process model among nodes
of the cluster. There's MPI and old Silicon Grid Engine.
Once there was Wolfpack cluster. The clusters are
often mounted in a rack together and about things
like Infiniband networking and NUMA memory.
"HPC" they call it, though that includes both
clusters the horizontally scale-able, and also
computers of the super-scalar word variety.
The process model, is the usual way of the old
control-plane way to organize processes with
shared resources and separate quotas, and
to support independent process spaces,
making for fork as spawn and otherwise
though pipes and message-slots what make
for join. Then there are thread, OS threads,
and in some cases fibers, OS threads, as
about processes, OS threads.
Some usual virtual machines or runtimes like
ye olde Java, have threads and synchronization
for barriers and monitors and mutexes and what,
about system calls and barriers and monitors and
mutexes, in the process model.
The interpreted runtimes are usually "single-threaded",
about though the notions of event loops and responsiveness.
That follows from "a Win32 app the message pump", then
mostly these days since "a JavaScript binding for the script
element binding of an HTML with HTTP user-agent, for
UI-Events according to old W3C now whatwg, and maybe
it's ECMAScript and with modules or something, and then
also there's new-fangled web workers which are threads for
ECMAScript or JavaScript which are about the same".
Writing algorithms in event loops is a sort of
exercise in frustration, in a sense. Yet, when
recursion is figured out as having to build a state
instead of just filling the stack, it's a thing.
(A single-threaded thing.)
These days most distributed algorithms are sort
of advised for "horizontal scaling" and "eventual
consistency" with often "opportunistic locks" in
a world of "Murphy guarantees the un-ordered".
Then transactions of the more critical sort are
often "boxes with huge RAM rollback segments"
or as with regards to "matching and reconciliation",
after the fact.
Then of course obligatory about C++, or, C/C++,
it's about OS threads and, "guarantees".
Here my approach is "re-routines". "Re-routines:
it's a co-routine, though instead of suspending
it just quits, and instead of switching it just
builds its own monad for the recusion, and
instead of having callbacks, it's always callbacks,
and instead of having futures everywhere,
it's futures everywhere. Try, try again."
In the process model in the runtime though,
it's mostly about what services the bus DMA.
"Systems", programming.
Boxes is nodes, ....