Sujet : Re: Cost of handling misaligned access
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.archDate : 03. Feb 2025, 09:10:09
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vnptl6$15pgm$1@dont-email.me>
References : 1 2 3
User-Agent : Mozilla Thunderbird
On 2/3/2025 12:55 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 2/2/2025 10:45 AM, EricP wrote:
Digging deeper with performance counters reveals executing each unaligned
load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses.
Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."
>
>
An emulation fault, or something similarly nasty...
>
>
At that point, even turning any potentially unaligned load or store into
a runtime call is likely to be a lot cheaper.
There are lots of potentially unaligned loads and stores. There are
very few actually unaligned loads and stores: On Linux-Alpha every
unaligned access is logged by default, and the number of
unaligned-access entries in the logs of our machines was relatively
small (on average a few per day). So trapping actual unaligned
accesses was faster than replacing potential unaligned accesses with
code sequences that synthesize the unaligned access from aligned
accesses.
Don't make every C pointer unaligned.
Rather, have something like an explicit "__unaligned" keyword or similar, and then use the runtime call for these pointers.
But, yeah, assuming one can't just have hardware with natively unaligned pointers.
Of course, if the cost of unaligned accesses is that high, you will
avoid them in cases like block copies where cheap unaligned accesses
would otherwise be beneficial.
Yeah.
Though "memcpy()" is usually a "simple to fix up" scenario.
A harder case is for LZ decompression, where byte-for-byte copying is slow, but typically both the source and destination will often be at pretty much arbitrary alignment for each copy operation (with the vast majority of LZ matches being a small number of bytes).
Granted, on most traditional systems, LZ compression is infrequent (IOW: not something someone is just throwing around all over the place in an attempt to make IO speeds faster).
But, apparently, older mentality was more like "decompression is slow".
And not so much "our media devices are slow, but a sufficiently fast LZ compressor can make them faster" (and then throwing LZ at a whole bunch of IO related use cases...).
But, yeah, I predict though that if one tries to run an LZ decoder that was written to assume unaligned pointers, on a CPU that does trap-and-emulate on misaligned pointers, it is going to be very slow.
- anton