In article <
2025May3.081100@mips.complang.tuwien.ac.at>,
Anton Ertl <
anton@mips.complang.tuwien.ac.at> wrote:
cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <2025May2.073450@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
I think it's the same thing as Greenspun's tenth rule: First you find
that a classical DMA engine is too limiting, then you find that an A53
is too limiting, and eventually you find that it would be practical to
run the ISA of the main cores. In particular, it allows you to use
the toolchain of the main cores for developing them,
>
These are issues solveable with the software architecture and
build system for the host OS.
>
Certainly, one can work around many bad decisions, and in reality one
has to work around some bad decisions, but the issue here is not
whether "the issues are solvable", but which decision leads to better
or worse consequences.
I don't know that either would be "better" or "worse" under any
objective criteria. They would simply be different.
The important characteristic is
that the software coupling makes architectural sense, and that
simply does not require using the same ISA across IPs.
>
IP? Internet Protocol?
When we discuss hardware designs at this level, reusable
components that go into the system are often referred to as "IP
cores" or just "IPs". For example, a UART might be an IP.
Think of them as building blocks that go into, say, a SoC.
Software Coupling sounds to me like a concept
from Constantine out of my Software engineering class.
I have no idea who or what that is, but it seems unrelated.
I guess you
did not mean either, but it's unclear what you mean.
It's a very common term in this context.
https://en.wikipedia.org/wiki/Semiconductor_intellectual_property_coreIn any case, I have made arguments why it would make sense to use the
same ISA as for the OS for programming the cores that replace DMA
engines. I will discuss your counterarguments below, but the most
important one to me seems to be that these cores would cost more than
with a different ISA. There is something to that, but when the
application ISA is cheap to implement (e.g., RV64GC), that cost is
small; it may be more an argument for also selecting the
cheap-to-implement ISA for the OS/application cores.
Ok.
Indeed, consider AMD's Zen CPUs; the PSP/ASP/whatever it's
called these days is an ARM core while the big CPUs are x86.
I'm pretty sure there's an Xtensa DSP in there to do DRAM and
timing and PCIe link training.
>
The PSPs are not programmable by the OS or application programmers, so
using the same ISA would not benefit the OS or application
programmers.
Its firmware ships in BIOS images. You can, in fact, interact
with it from the OS. The only thing that keeps it from being
programmable by the OS is signing keys.
By contrast, the idea for the DMA replacement engines is
that they are programmable by the OS and maybe the application
programmers, and that changes whether the same ISA is beneficial.
>
What is "ASP/whatever"?
The PSP, or "AMD Platform Security Processor", has many names.
AMD says that "PSP" is the "legacy name", and that the new name
is ASP, for "AMD Secure Processor", and that it provides
"runtime security services"; for example, the PSP implements a
TPM in firmware, and exposes a random number generator that x86
can access via the `RDRAND` instruction.
Similarly with the ME on Intel.
>
Last I read about it, ME uses a core developed by Intel with IA-32 or
AMD64; but in any case, the ME is not programmable by OS or
application programmers, either.
I was under the impression that it started out as an ARM core,
but I may be mistaken.
In any case, where do you think its firmware comes from?
A BMC might be running on whatever.
>
Again, a BMC is not programmable by OS or application programmers.
The people working on OpenBMC disagree.
We increasingly see ARM
based SBCs that have small RISC-V microcontroller-class cores
embedded in the SoC for exactly this sort of thing.
>
That's interesting; it points to RISC-V being cheaper to implement
than ARM. As for "that sort of thing", they are all not programmable
by OS or application programmers, so see above.
No, the entire point is to provide an off-load for things that
are real-time. They are absolutely meant to be "programmable by
OS or application programmers", which is exactly the sort of
scenario that Mitch's proposed cores would be used for.
Is a GPU programmable? Yes. Does it use the same ISA as the
general purpose compute core? No.
Our hardware RoT
>
?
Root of Trust.
The problem is when such service cores are hidden (as they are
in the case of the PSP, SMU, MPIO, and similar components, to
use AMD as the example) and treated like black boxes by
software. It's really cool that I can configure the IO crossbar
in useful way tailored to specific configurations, but it's much
less cool that I have to do what amounts to an RPC over the SMN
to some totally undocumented entity somewhere in the SoC to do
it. Bluntly, as an OS person, I do not want random bits of code
running anywhere on my machine that I am not at least aware of
(yes, this includes firmware blobs on devices).
>
Well, one goes with the other. If you design the hardware for being
programmed by the OS programmers, you use the same ISA for all the
cores that the OS programmers program,
That's a categorical statement that is not well supported. That
may be what is _usually_ done. It is not what _has_ to be done,
or even what _should_ be done.
You may feel that ths is the way things should be done, but the
arguments you've presented so far are not persuasive.
whereas if you design the
hardware as programmed by "firmware" programmers, you use a
cheap-to-implement ISA and design the whole thing such that it is
opaque to OS programmers and only offers some certain capabilities to
OS programmers.
There is little fundamental difference between "firmware" and
the "OS". I would further argue that this model of walling off
bits of system programmed with "firmware" from the OS a dated
way of thinking about systems that is actively harmful. See
Roscoe's OSDI'21 keynote, here:
https://www.usenix.org/conference/osdi21/presentation/fri-keynoteInsisting that we use the congealed model we currently use
because that's how it is done is circular reasoning.
And that's not just limited to ISAs. A very successful example is the
way that flash memory is usually exposed to OSs: as a block device
like a plain old hard disk, and all the idiosyncracies of flash are
hidden in the device behind a flash translation layer that is
implemented by a microcontroller on the device.
You're conflating a hardware interface with firmware.
What's "SMN"?
The "System Management Network." This is the thing that AMD
uses inside the SoC to talk between the different components
that make up the system (that is, between the different IPs in
the SoC). SMN is really a network of AXI buses, but it's how
one can, say, read and write registers on various components.
If you look at, for example,
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/55803-ppr-family-17h-model-31h-b0-processors.pdfAnd you look at the enry for the SMU registers, you'll see
that they have an "aliasSMN" entry in the instance table; those
can be decoded to a 32-bit number. That is the SMN address of
that register. For example, `SMU::THM::THM_TCON_CUR_TMP` is the
thermal register maintained by the SMU that encodes the current
temperature (in normalized units that are scaled from e.g.
degrees C, to accommodate different operating temperature ranges
between different physical parts). Anyway, if one were to
decode the address in the instance table, one would see that
that register is at SMN address 0x0005_9800. One accesses SMN
via an address/data pair of registers on a special BDF (0/0/0)
in PCI config space. If you write that address to offset 0x60
for 0/0/0, and then read form offset 0x64 on 0/0/0, you'll get
the contents of that register. You can use either port IO or
ECAM for such accesses.
Similarly, consider `PCS::DXIO::PCS_GOPX16_PCS_STATUS1`, which
is a register with multiple instances for each XGMI PCS (before
you ask, "PCS" is "Physical Coding Sublayer" and xGMI is the
socket-to-socket [external] Global Memory Interface). That is,
these are the SerDes (Serializer/Deserializer) for communicating
between sockets. Anwyway, the SMN address that corresponds to
PCS 21, serdes aggregator 1, is 0x12ff_0050.
and you can also
use the facilities of the main cores (e.g., debugging features that
may be absent of the I/O cores) during development.
>
This is interesting, but we've found it more useful going the
other way around. We do most of our debugging via the SP.
Since The SP is also responsible for system initialization and
holding x86 in reset until we're reading for it to start
running, it's the obvious nexus for debugging the system
holistically.
>
Sure, for debugging on the core-dump level that's useful. I was
thinking about watchpoint and breakpoint registers and performance
counters that one may not want to implement on the DMA-replacement
core, but that is implemented on the OS/application cores.
I assumed you were talking about remote hardware debugging
interfaces. You seem to be talking about just running a
debugger or profiler on the IO offload core. That's a much
simpler use case.
Marking the binaries that should be able to run on the IO service
processors with some flag, and letting the component of the OS that
assigns processes to cores heed this flag is not rocket science.
>
I agree, that's easy. And yet, mistakes will be made, and there
will be tension between wanting to dedicate those CPUs to IO
services and wanting to use them for GP programs: I can easily
imagine a paper where someone modifies a scheduler to move IO
bound programs to those cores. Using a different ISA obviates
most of that, and provides an (admittedly modest) security benefit.
>
If there really is such tension, that indicates that such cores would
be useful for general-purpose use. That makes the case for using the
same ISA even stronger.
Incorrect. It makes it weaker: the whole point is to have
coprocessor cores that are dedicated to IO processing that are
not used for GP compute. As Mitch said, they're already far
away from DRAM; using them for compute is going to suck. They
are there to offload IO processing from the big cores; don't
make it easier to abuse their existence.
As for "mistakes will be made", that also goes the other way: With a
separate toolchain for the DMA-replacement ISA, there is lots of
opportunity for mistakes.
I meant runtime mistakes. You can't run x86 code on them if
they're not an x86 core.
As for "security benefit", where is that supposed to come from?d
You can't run x86 code on them if they're not an x86 core.
What
attack scenario do you have in mind where that "security benefit"
could materialize?
Someone figures out how to exploit a flaw in the OS whereby some
user thread can execute on an IO coprocessor core, and they
figure out you can speculate on IO transactions, allowing them
to exfiltrate data directly from the IO source.
But, if the OS _cannot_ schedule a user process there, because
it's running an entirely different ISA, then that cannot happen.
And if I already have to modify or configure the OS to
accommodate the existence of these things in the first place,
then accommodating an ISA difference really isn't that much
extra work. The critical observation is that a typical SMP view
of the world no longer makes sense for the system architecture,
and trying to shoehorn that model onto the hardware reality is
just going to cause frustration.
>
The shared-memory multiprocessing view of the world is very
successful, while distributed-memory computers are limited to
supercomputing and other areas where hardware cost still dominates
over software cost (i.e., where the software crisis has not happened
yet); as an example of the lack of success of the distributed-memory
paradigm, take the PlayStation 3; programmers found it too hard to
work with, so they did not use the hardware well, and eventually Sony
decided to go for an SMP machine for the PlayStation 4 and 5.
The SoCs you are talking about are already, literally,
"distributed memory computers". See above about the SMN.
OTOH, one can say that the way many peripherals work on
general-purpose computers is more along the lines of
distributed-memory; but that's probably due to the relative hardware
and software costs for that peripheral. Sure, the performance
characteristics are non-uniform (NUMA) in many cases, but 1) caches
tend to smooth over that, and 2) most of the code is not
performance-critical, so it just needs to run, which is easier to
achieve with SMP and harder with distributed memory.
>
Sure, people have argued for advantages of other models for decades,
like you do now, but SMP has usually won.
Bluntly, you're making a lot of assumptions and drawing
conclusions from those assumptions.
On the other hand, you buy a motherboard with said ASIC core,
and you can boot the MB without putting a big chip in the
socket--but you may have to deal with scant DRAM since the
big centralized chip contains teh memory controller.
>
A neat hack for bragging rights, but not terribly practical?
>
Very practical for updating the firmware of the board to support the
big chip you want to put in the socket (called "BIOS FlashBack" in
connection with AMD big chips).
>
"BIOS", as loaded from the EFS by the ABL on the PSP on EPYC
class chips, is usually stored in a QSPI flash on the main
board (though starting with Turin you _can_ boot via eSPI).
Strictly speaking, you don't _need_ an x86 core to rewrite that.
On our machines, we do that from the SP, but we don't use AGESA
or UEFI: all of the platform enablement stuff done in PEI and
DXE we do directly in the host OS.
>
EFS? ABL? QSPI? eSPI? PEI? DXE?
Umm, those are the basic components of the "BIOS" and
surrounding stack as implemented on AMD systems with AGESA and
UEFI. If you are unaware of what these mean, perhaps you should
spend a little bit of time reading up on how the things you are
frankly making a lot of assumptions about actually work.
In this case, I'm happy to explain a bit, but, frankly, your
response makes it painfully obvious that you really need to
do your own homework here.
* EFS: Embedded File System. This is the filesystem-like format
that AMD uses for the data stored in flash that is loaded by
the PSP.
* ABL: AGESA Boot Loader. This is a software component that
runs on the PSP that reads and interprets the "BIOS" image
in the EFS on flash and loads the x86 code that runs from the
reset vector into DRAM.
* QSPI: Quad SPI. This is the physical interface used to access
the flash that holds the EFS. It is lined out from the socket
and thus the CPU so that the PSP can access it. Other things
can also access it via a series of muxes; for example, on OCP
boards like Ruby it's accessable across the DC-SCM connector
to the BMC so that the BMC can update flash.
* eSPI: enhanced Serial Peripheral Interface. See the Intel
spec. Supported in Genoa, and now in Turin, it's possible to
boot and AMD EPYC CPU over eSPI. eSPI is lined out from the
package.
* PEI: The "Pre-EFI Initialization" phase of UEFI (Unified
Extensible Firmware Interface -- the "modern" BIOS). This is
the phase where most of the platform enablement stuff is done;
for example, the PCIe buses are initialized and links are
trained, for example here:
https://github.com/openSIL/openSIL/blob/main/xUSL/Mpio/Common/MpioInitFlow.c#L508* DXE: The "Driver Execution Environment" phase of UEFI, where
individual _devices_ are found an initialized.
https://uefi.org/specs/PI/1.9/V1_Overview.htmlAnyway, what you do in your special setup does not detract from the
fact that being able to flash the firmware without having a working
main core has turned out to be so useful that out of 218 AM5
motherboards offered in Austria <https://geizhals.at/?cat=mbam5>, 203
have that feature.
Sure. It's useful. You just don't need to have an x86 core to
do it.
Also, on AMD machines, again considering EPYC, it's up to system
software running on x86 to direct either the SMU or MPIO to
configure DXIO and the rest of the fabric before PCIe link
training even begins (releasing PCIe from PERST is done by
either the SMU or MPIO, depending on the specific
microarchitecture). Where are these cores, again? If they're
close to the devices, are they in the root complex or on the far
side of a bridge? Can they even talk to the rest of the board?
>
The core that does the flashing obviously is on the board, not on the
CPU package (which may be absent). I do not know where on the board
it is.
I was referring to Mitch's proposed co-processor cores. The
point was, that if they're on the distant end of an IO bus that
isn't even configured, and not somehow otherwise connected to
the flash part that holds the BIOS, then they're not going to
help you flash the BIOS without the a socket being populated so
that you've got something that can set up that IO bus so that
those cores can connect to anything useful. You seem to be
assuming that they're just going to start, in the absense of
the main package, but again, that's a big assumption.
Typically only one USB port can be used for that, so that may
indicate that a special path may be used for that without initializing
all the USB ports and the other hardware that's necessary for that; I
think that some USB ports are directly connected to the CPU package,
so those would not work anyway.
Like I said, you could have an electromechanical interlock that
lets the IO coprocessors boot independently and talk directly to
the flash mux if the socket is not populated. The interface by
which you get the flash image is immaterial at that point. But
it's not at all clear to me that Mitch had anything like that in
mind.
In a case where we did not have that
feature, and the board did not support the CPU, we had to buy another
CPU to update the firmware
<https://www.complang.tuwien.ac.at/anton/asus-p10s-c4l.html>. That's
especially relevant for AM4 boards, because the support chips make it
hard to use more than 16MB Flash for firmware, but the firmware for
all supported big chips does not fit into 16MB. However, as the case
mentioned above shows, it's also relevant for Intel boards.
>
You shouldn't need to boot the host operating system to do that,
though I get on most consumer-grade machines you'll do it via
something that interfaces with AGESA or UEFI.
>
In the bad old days you had to boot into DOS and run a DOS program for
flashing the BIOS. Or worse, Windows; not very useful if you don't
have Windows installed on the computer (DOS at least could be booted
from a floppy disk). My last few experiences in that direction were
firmware flashing as a "BIOS" feature, and the flashback feature
(which has it's own problems, because communication with the user is
limited).
>
Most server-grade
machines will have a BMC that can do this independently of the
main CPU,
>
And just in another posting you wrote "but not terribly practical?".
The board I mentioned above where we had to buy a separate CPU for
flashing mentioned a BMC on the feature list, but when we looked in
the manual, we found that the BMC is not delivered with the board, but
has to be bought separately. There was also no mention that one can
use the BMC for flashing the BIOS.
Sounds like a problem with the vendor.
and I should be clear that I'm discounting use cases
for consumer grade boards, where I suspect something like this
is less interesting than on server hardware.
>
What makes you think so? And what do you mean with "something like
this"?
"Something like this" meaning a dedicated IO coprocessor on the
far side of the root complex for offloading IO handling.
If you can't see why that might have more applications in the
data center than on the desktop, I don't know what to tell you.
Maybe there are consumer use cases I'm not aware of.
1) "BIOS flashback" is a mostly-standard feature in AM5 (i.e.,
consumer-grade) boards.
Of course.
2) DMA has been a standard feature in various forms on consumer
hardware since the first IBM PC in 1981, and replacing the DMA engines
with cores running a general-purpose ISA accessible to OS designers
will not be limited to servers;
I don't think that was the suggestion.
if hardware designers and OS
developers put development time into that, there is no reason for
limiting that effort to servers. The existence of the LPE-Cores on
Meteor Lake (not a server chip) and the in-order ARM cores on various
smartphone SOCs, the existence of P-Cores and E-Cores on Intel
consumer-grade CPUs, while the server versions of these CPUs have the
E-Cores disabled, and the uniformity of cores on the dedicated server
CPUs indicates that non-uniform cores seem to be hard to sell in
server space.
The systems you just mentioned were designed for minimizing
power consumption, something that's very useful in the consumer
space (e.g., for battery operated applications, like phones and
laptops) and less useful in the data center space. However,
having dedicated coprocessors to offload things like IO has a
long history in the mainframe world, but that hasn't filtered
down to the server space in part because it's not well-supported
by software.
- Dan C.