On 7/6/2025 9:09 AM, Theo wrote:
The issue is one of memory capacity and bandwidth. Many applications have a
large (GB) dataset that doesn't partition nicely up between multiple nodes.
You can of course put GDDR or HBM on an FPGA, but it's the same problem -
only a few devices must be shared by numerous cores. Ultimately memory
throughput beats latency hands down, especially for large datasets. This
was not such a problem in the Transputer's day, which is why that
architecture made sense.
Exactly. Partitioning an application "by task" only makes sense if the tasks
are orthogonal -- AND can have their own dedicated resources. The memory
interface determines performance. In software "communication" drives
performance and complexity -- the more things have to share the poorer
the design.
[The exception SIMD]
So, you need to think about the medium used for interconnect fabric.
Shared memory has pitfalls, too -- because it has such a high bandwidth
and protection mechanisms -- in software or hardware -- tend to want to
be light weight (e.g., you wouldn't want a monitor per datum).
Remember, the mantra is to partition to MINIMIZE sharing.
If the "sharers" need a fat pipe between them, then the medium must
directly (or indirectly) support that.
If the "sharers" require a short pipe, then that becomes an issue
(which need not be mutually exclusive with the pipe's width).
So, you have to partition the application to ensure things can
communicate "fast enough" and "soon enough" -- and still address
the folly of "one shared address space".
This can lead to suboptimal hardware implementations. E.g., I have a CPU
per camera instead of a "camera CPU" handling ALL cameras. So, 20 copies
of the same hardware and software all doing essentially the same thing;
but, I can just as easily have *40* copies (whereas a "camera CPU" would
eventually be taxed computationally -- will the scale of your
application change, over time? how will that affect your partitioning?).
And, once you have a abundance of computational ability, you then need
to address WHAT gets done WHERE and how that decision will change, over
time. E.g., if the application isn't currently using a camera, then how
can the resources PHYSICALLY set aside for that camera be used to achieve
some other goal? (Ditto any other physical I/Os) Will your rejiggering
of "virtual" resource allocation still "fit" the above communication criteria?
The appeal (from a complexity, reliability, maintainability point of view)
of a well-partitioned system is illusory given practical constraints.
Think of all the MIPS wasted by the CPU in your thermostat that the
CPU in your refrigerator could supply! And, how many are wasted,
there, that could be used by your television/STB? Ah, but SO much
easier to design something that is JUST a thermostat or JUST a
refrigerator... than to design something that can be all of the above!