On 2/3/2025 1:41 PM, Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
That is fine for code that is being actively maintained and backward
data structure compatibility is not required (like those inside a kernel).
>
However for x86 there was a few billion lines of legacy code that likely
assumed 2-byte alignment, or followed the fp64 aligned to 32-bits advice,
and a C language that mandates structs be laid out in memory exactly as
specified (no automatic struct optimization). Also I seem to recall some
amount of squawking about SIMD when it required naturally aligned buffers.
As SIMD no longer requires alignment, presumably code no longer does so.
Looking at Intel's optimization manual, they state in
"15.6 DATA ALIGNMENT FOR INTEL® AVX"
"Assembly/Compiler Coding Rule 65. (H impact, M generality) Align
data to 32-byte boundary when possible. Prefer store alignment
over load alignment."
and further down, about AVX-512,
"18.23.1 Align Data to 64 Bytes"
"Aligning data to vector length is recommended. For best results,
when using Intel AVX-512 instructions, align data to 64 bytes.
When doing a 64-byte Intel AVX-512 unaligned load/store, every
load/store is a cache-line split, since the cache-line is 64
bytes. This is double the cache line split rate of Intel AVX2
code that uses 32-byte registers. A high cache-line split rate in
memory-intensive code can cause poor performance."
This sounds reasonable, and good advice if you want to go
down SIMD lane.
This is, ironically, a place where SIMD via ganged registers has an advantage over SIMD via large monolithic registers.
With ganged registers, it means one can load/store them piecewise as needed, and use unaligned loads/stores (with the larger forms being able to actively require natural alignment).
Though, granted, large monolithic registers are a more popular option vs ganged registers.
And, you can make the registers larger without either effectively halving the number of longer registers, or needing to double the number of shorter registers.
But, at the cost that much of the high-order bits of the registers will be essentially wasted for code operating on narrower vectors.
Say, if one has:
64x 64-bit vectors (group of 1);
32x 128-bit vectors (group of 2);
16x 256-bit vectors (group of 4);
8x 512-bit vectors (group of 8).
If they wanted a 1024-bit vector, they can make a choice:
Live with only 4 vectors;
Expand the size of the register file to 128x 64-bit vectors;
Live with asymmetric wonk
Parts of the register space only being accessible at larger sizes.
...
Though, with monolithic registers, each doubling of the register size also effectively mandates either a whole new set of instructions to deal with the larger size, or some other way to encode or specify the size (or, "who knows, it is whatever it is, software can figure it out"...).
This is less true of ganged registers.
Say, if the CPU supported it, they could add, say:
PADDX4.F //256-bit Binary32 ADD
PSUBX4.F //256-bit Binary32 SUB
PMULX4.F //256-bit Binary32 MUL
...
While leaving everything else the same as before.
The addition of wider load/store operations being optional.
Don't have 256-bit Ld/St, use 128-bit Ld/St.
Need fully unaligned access, use 64-bit Ld/St's.
...
And also making it easy for narrower implementations to simply crack the instructions into 128-bit vector operations internally (which may actually be implemented as two 64 bit vector ops running in parallel).
But, say, the pipeline could be designed internally around 64-bit vector ops, with a 4-wide machine able to do 256-bit vector operations mostly by supporting a 64-bit vector operation on each lane.
And, you can more easily "pretend" in the compiler to have whichever vector size you want. Code asks for 256 bit vectors but target only has 128? Just fake it using 128-bit ops.
But, granted, most ISAs aren't doing SIMD this way.
...
Also in going from 32 to 64 bits, data structures that contain pointers
now could find those 8-byte pointers aligned on 4-byte boundaries.
This is mandated by the relevant ABI, and ABIs usually mandate
alignment on natural boundaries.
While the Linux kernel may not use many misaligned values,
I'd guess there is a lot of application code that does.
Unless it is generating external binary data (a _very_ bad idea,
XDR was developed for a reason), there is no big reason to use
unaligned data, unless somebody is playing fast and loose
with C pointer types, and that is a bad idea anyway.
Often needed for speed in many cases.
Alternatively, a compiler could use it to implement somthing like
memcpy or memmove when it knows that unaligned accesses are safe.
Basically required unless you want them to be slow.
The aligned-only versions will almost invariably be slower, potentially significantly slower.
But it would be really interesting to have a access to a system
where unaligned accesses trap, in order to find (and fix) ABI
issues and some undefined behavior on the C side.
It may make sense to add some form of categorical separations:
Pointers that may be unaligned;
Pointers that must be aligned.
Trapping on unaligned being a reasonable option for the latter case.
Really needs to be per-pointer or per-access though, and not a global flag (which makes it kind of useless).
Some compilers have __aligned and __unaligned keywords.
Something like "[[aligned]]" and "[[unaligned]]" could also make sense, with the default likely depending on type and implementation...