Liste des Groupes | Revenir à c arch |
As you can see in the article below, the cost of NOT handling misalignedIMO, the main costs of unaligned access in hardware:
accesses in hardware is quite high in cpu clocks.
To my eye, the incremental cost of adding hardware support for misaligned
to the AGU and cache data path should be quite low. The alignment shifter
is basically the same: assuming a 64-byte cache line, LD still has to
shift any of the 64 bytes into position 0, and reverse for ST.
The incremental cost is in a sequencer in the AGU for handling cache
line and possibly virtual page straddles, and a small byte shifter to
left shift the high order bytes. The AGU sequencer needs to know if the
line straddles a page boundary, if not then increment the 6-bit physical
line number within the 4 kB physical frame number, if yes then increment
virtual page number and TLB lookup again and access the first line.
(Slightly more if multiple page sizes are supported, but same idea.)
For a load AGU merges the low and high fragments and forwards.
I don't think there are line straddle consequences for coherence because
there is no ordering guarantees for misaligned accesses.
The hardware cost appears trivial, especially within an OoO core.For an OoO core, any cost difference in the L1 cache here is likely to be negligible.
So there doesn't appear to be any reason to not handle this.
Am I missing something?
https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550- microarchitecture/An emulation fault, or something similarly nasty...
[about half way down]
"Before accessing cache, load addresses have to be checked against
older stores (and vice versa) to ensure proper ordering. If there is a
dependency, P550 can only do fast store forwarding if the load and store
addresses match exactly and both accesses are naturally aligned.
Any unaligned access, dependent or not, confuses P550 for hundreds of
cycles. Worse, the unaligned loads and stores don’t proceed in parallel.
An unaligned load takes 1062 cycles, an unaligned store takes
741 cycles, and the two together take over 1800 cycles.
This terrible unaligned access behavior is atypical even for low power
cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
dependent accesses that are both misaligned.
Digging deeper with performance counters reveals executing each unaligned
load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses.
Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."
Les messages affichés proviennent d'usenet.