Sujet : Re: Cost of handling misaligned access
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 02. Feb 2025, 19:55:01
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <1adf8aa36637c78f79fb711dca7a0572@www.novabbs.org>
References : 1
User-Agent : Rocksolid Light
On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-microarchitecture/
>
[about half way down]
>
"Before accessing cache, load addresses have to be checked against
older stores (and vice versa) to ensure proper ordering. If there is a
dependency, P550 can only do fast store forwarding if the load and store
addresses match exactly and both accesses are naturally aligned.
Any unaligned access, dependent or not, confuses P550 for hundreds of
cycles. Worse, the unaligned loads and stores don’t proceed in parallel.
An unaligned load takes 1062 cycles, an unaligned store takes
741 cycles, and the two together take over 1800 cycles.
>
This terrible unaligned access behavior is atypical even for low power
cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
dependent accesses that are both misaligned.
>
Digging deeper with performance counters reveals executing each
unaligned
load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses.
Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."
1800 cycles divided by 505 instructions is 3.6 cycles per instruction
or 0.277 instructions per cycle--compared to an extra cycle or two
when HW does it all by itself.