On Mon, 3/24/2025 1:48 PM, Chris wrote:
Paul <nospam@needed.invalid> wrote:
On Mon, 3/24/2025 2:21 AM, rbowman wrote:
On Mon, 24 Mar 2025 00:36:17 -0000 (UTC), Chris wrote:
>
Paul <nospam@needed.invalid> wrote:
On Sun, 3/23/2025 7:17 AM, Joel wrote:
>
It's clear why Microsoft would use x86 emulation with ARM, countless
reasons, but who cares about their Copilot bullshit, put Linux for ARM
on that mother fucker.
>
Some day, you'll be able to run an AI locally.
>
You can. Have a look at Ollama. Totally local and open source. Works
well too!
>
Training and inference are two different things. Other than toy datasets I
doubt much training will happen locally.
>
Realistically, I think it's going to be quite a while,
if ever, before we can put together a decent box for inference.
Not quite sure what exactly you mean by "inference", the latest Mac Studio
that can go up to 512GB RAM is certainly heading in the right direction for
local LLM training.
There's two elements.
Memory bandwidth <=== needs terabytes/sec
Cores
The problem is, the PCIe Rev5 path to System Memory is
too slow. The Cores are on the video card (like 26000
cores, when a CPU NPU has a lot fewer cores).
A video card has 20x the core performance of an NPU.
Right now, there is a race on, at several memory companies,
to build stacked HBM3 or so. Traditionally, those are put on
the same substrate, next to the cores. Will they continue
to do it that way ? As it is restrictive to be trying to jam
all the memory, right next to the cores. The serial interconnect,
could be up at around 100 gbit/sec per serial interface. And there
are a lot of those, jumping to the next chip. The technology for this,
was first proved out on FPGA chips, at 56Gbit/sec and 112Gbit/sec.
That s a say of doing comms between FPGA chips, to build larger
arrays of chips.
This is an example of how to do it. Each "computer" is made of an entire wafer
of logic gates, and is water cooled from the back side. I have no idea how
the memory gets connected to this thing. It has a limited amount of local
memory, which is still a lot bigger than what video cards have for their
internal memory.
https://en.wikipedia.org/wiki/Cerebrashttps://www.cerebras.ai/press-release/cerebras-announces-third-generation-wafer-scale-engine 4 trillion transistors
900,000 AI cores
125 petaflops of peak AI performance
44GB on-chip SRAM
5nm TSMC process
External memory: 1.5TB, 12TB, or 1.2PB
Trains AI models up to 24 trillion parameters
Cluster size of up to 2048 CS-3 systems # That is 2048 systems at 26KW electricity each = 53MW
"Ordinary" chips are one inch on edge. They don't yield well
if made larger than that. That company has perfected cooling
an entire wafer, without the wafer cracking. And their design
intent has changed from "general computing" or "supercomputing"
to "AI". That's what the latest wafer represents, is a shift
to entering the AI market.
26kW for a single wafer system, that's 1.3x the entire incomer on your house :-)
So really, the question there is not the "core prowess", it's the memory
interconnect method that matters most. An "ordinary" interconnect simply
will not do, and will ruin the performance of the thing.
A full sized system ("skynet") would cost $2 billion to $4 billion or so.
Cheap, really. And 53MW is peanuts.
Paul