Sujet : Re: Cost of handling misaligned access
De : chris.m.thomasson.1 (at) *nospam* gmail.com (Chris M. Thomasson)
Groupes : comp.archDate : 02. Feb 2025, 23:44:13
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vnosfu$t4ra$1@dont-email.me>
References : 1 2
User-Agent : Mozilla Thunderbird
On 2/2/2025 10:51 AM, MitchAlsup1 wrote:
On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
As you can see in the article below, the cost of NOT handling misaligned
accesses in hardware is quite high in cpu clocks.
>
To my eye, the incremental cost of adding hardware support for
misaligned
to the AGU and cache data path should be quite low. The alignment
shifter
is basically the same: assuming a 64-byte cache line, LD still has to
shift any of the 64 bytes into position 0, and reverse for ST.
A handful of gates to detect misalignedness and recognize the line and
page crossing misalignments.
The alignment shifters are twice as big.
Now, while I accept these costs, I accept that others may not. I accept
these costs because of the performance issues when I don't.
The incremental cost is in a sequencer in the AGU for handling cache
line and possibly virtual page straddles, and a small byte shifter to
left shift the high order bytes. The AGU sequencer needs to know if the
line straddles a page boundary, if not then increment the 6-bit physical
line number within the 4 kB physical frame number, if yes then increment
virtual page number and TLB lookup again and access the first line.
(Slightly more if multiple page sizes are supported, but same idea.)
For a load AGU merges the low and high fragments and forwards.
>
I don't think there are line straddle consequences for coherence because
there is no ordering guarantees for misaligned accesses.
Generally stated as:: Misaligned accesses cannot be considered ATOMIC.
Try it on an x86/x64. Straddle a l2 cache line and use it with a LOCK'ed RMW. It should assert the BUS lock.
The hardware cost appears trivial, especially within an OoO core.
So there doesn't appear to be any reason to not handle this.
Am I missing something?
>
https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550- microarchitecture/
>
[about half way down]
>
"Before accessing cache, load addresses have to be checked against
older stores (and vice versa) to ensure proper ordering. If there is a
dependency, P550 can only do fast store forwarding if the load and store
addresses match exactly and both accesses are naturally aligned.
Any unaligned access, dependent or not, confuses P550 for hundreds of
cycles. Worse, the unaligned loads and stores don’t proceed in parallel.
An unaligned load takes 1062 cycles, an unaligned store takes
741 cycles, and the two together take over 1800 cycles.
>
This terrible unaligned access behavior is atypical even for low power
cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
dependent accesses that are both misaligned.
>
Digging deeper with performance counters reveals executing each
unaligned
load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses.
Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."