Liste des Groupes | Revenir à c arch |
On 2/2/2025 10:45 AM, EricP wrote:All of which vanish when the HW does misaligned accesses.As you can see in the article below, the cost of NOT handling misaligned>
accesses in hardware is quite high in cpu clocks.
>
To my eye, the incremental cost of adding hardware support for
misaligned
to the AGU and cache data path should be quite low. The alignment
shifter
is basically the same: assuming a 64-byte cache line, LD still has to
shift any of the 64 bytes into position 0, and reverse for ST.
>
The incremental cost is in a sequencer in the AGU for handling cache
line and possibly virtual page straddles, and a small byte shifter to
left shift the high order bytes. The AGU sequencer needs to know if the
line straddles a page boundary, if not then increment the 6-bit physical
line number within the 4 kB physical frame number, if yes then increment
virtual page number and TLB lookup again and access the first line.
(Slightly more if multiple page sizes are supported, but same idea.)
For a load AGU merges the low and high fragments and forwards.
>
I don't think there are line straddle consequences for coherence because
there is no ordering guarantees for misaligned accesses.
>
IMO, the main costs of unaligned access in hardware:
Cache may need two banks of cache lines
lets call them "even" and "odd".
an access crossing a line boundary may need both an even and odd
line;
slightly more expensive extract and insert logic.
>
The main costs of not having unaligned access in hardware:
Code either faults or performs like dog crap;
Some pieces of code need convoluted workarounds;
Some algorithms have no choice other than to perform like crap.
>
>
Even if most of the code doesn't need unaligned access, the parts that
do need it, significantly need it to perform well.
>
Well, at least excluding wonk in the ISA, say:
A load/store pair that discards the low-order bits;
An extract/insert instruction that operates on a register pair using the
LOB's of the pointer.
>
In effect, something vaguely akin (AFAIK) to what existed on the DEC
Alpha.
>
>The hardware cost appears trivial, especially within an OoO core.>
So there doesn't appear to be any reason to not handle this.
Am I missing something?
>
For an OoO core, any cost difference in the L1 cache here is likely to
be negligible.
>
>
For anything much bigger than a small microcontroller, I would assume
designing a core that handles unaligned access effectively.
>
>https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550->
microarchitecture/
>
[about half way down]
>
"Before accessing cache, load addresses have to be checked against
older stores (and vice versa) to ensure proper ordering. If there is a
dependency, P550 can only do fast store forwarding if the load and store
addresses match exactly and both accesses are naturally aligned.
Any unaligned access, dependent or not, confuses P550 for hundreds of
cycles. Worse, the unaligned loads and stores don’t proceed in parallel.
An unaligned load takes 1062 cycles, an unaligned store takes
741 cycles, and the two together take over 1800 cycles.
>
This terrible unaligned access behavior is atypical even for low power
cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
dependent accesses that are both misaligned.
>
Digging deeper with performance counters reveals executing each
unaligned
load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses.
Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."
>
An emulation fault, or something similarly nasty...
>
>
At that point, even turning any potentially unaligned load or store into
a runtime call is likely to be a lot cheaper.
>
Say:
__mem_ld_unaligned:
ANDI X15, X10, 7
BEQ .aligned, X15, X0
SUB X14, X10, X15
LW X12, 0(X14)
LW X13, 8(X14)
SLLI X14, X15, 3
LI X17, 64
SUB X16, X17, X14
SRL X12, X12, X14
SLL X13, X13, X16
OR X10, X12, X13
RET
.aligned:
LW X10, 0(X10)
RET
>
The aligned case being because SRL with 64 will simply give the input
(since (64&63)==0), causing it to break.
>
>
Though not supported by GCC or similar, dedicated __aligned and
__unaligned keywords could help here, to specify which pointers are
aligned (no function call), unaligned (needs function call) and default
(probably aligned).
....
Les messages affichés proviennent d'usenet.