On Sat, 21 Dec 2024 23:22:35 +0000, Jonathan Thornburg wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
FORTRAN COMMON blocks require misaligned accesses to double precision
data.
R E Q U I R E in that it is neither optional nor wise to emulate with
exceptions. It is just barely tolerable using LD/ST Left/Right
instructions
out of the compiler.
>
I, personally, went through enough PAIN with misalignment, that over
time my mood swung from "aligned only" to "completely misaligned"::
a) because there is no performant* SW workaround
b) it is SO easy to fix in HW.
c) once fixed in HW, any SW burden is so small as to be barely
..measurable.
>
I'm not so sure (b) is true. Some cases are moderately easy to handle
in hardware (e.g., misaligned loads that stay within a single L1 D-cache
line), but some cases are harder (e.g., misaligned writes that cross L1
D-cache line boundaries) and might need a microcode trap (awkward if the
design wasn't otherwise using microcode). And some cases are even
harder
While there is no concept of Millicode or Microcode::
There are several sequencing components::
a) determining if the access is misaligned:: This takes 8 gates and 2
gates
of delay from an adder already comprising 2000 gates. The misaligned
assertion
comes 5-6 gates BEFORE the higher order 32-bits come out of the adder::
I consider this part ignorable.
b) accessing the cache optimally in the presence of misaligned accesses.
b.1) if the access does not cross a cache port boundary, then all the
problems are confined to the alignment of the data.
b.2) if the access crosses a port boundary but not a line boundary,
access 2 successive ports, and allow Aligner to sort out the problem.
b.3) if the access crosses a line boundary but not a page boundary,
access 2 successive ports incrementing the line address of the second
port.
b.4) if the access crosses a page boundary, you are going to have to
access the cache twice, once for the first page, once for the second.
So, only page crossing REQUIRES 2 accesses; and 99% (Made up number)
are performed in a single cycle. {{Try that with some kind of SW
workaround}}
{{Oh, and BTW; this is a good place to check that the access rights
to both pages are compatible with the rights in both PTEs.}}
So,
AGEN adder is 8 gates bigger out of 2,000 total gates
Cache port control logic is 2× as big out of 90 gates
Cache staging flip-flops in stage ALIGN is 2× as big
LD Aligner is bigger ~1.75×
Tag, TLB, DATA RAMs are exactly the same size and ~9× larger
....than of the other cache pipeline logic area
{{And you add 25-odd gates in the Miss Buffers}}
x86 has been doing this for 3 decades. It is well worn logic at this
point.
It was at AMD where I saw how easy this was for HW to simply "make the
problem disappear" compared to all the ways SW uses to work around "not
being able to access misaligned data". Once you have done it once, you
have the logic and test vectors to insure you don't shoot yourself in
the foot.
Any competent programmer will ALIGN his data to the extend possible
there is no reason to penalize {Compiler, assembler, linker, ld.so,...}
just because you want to take 5 days out of design.
So, My design:
Aligned data is always best, Misaligned data comes at very low cost.
SW overhead = 0
Your design:
Aligned data works just fine, Misaligned data is a complete nightmare
throughout the entire SW stack, and causes large uncertainty in result
deliver time. SW overhead = significant.
How many days of SW development are required to make up for the 5 days
of HW design to simply eradicate the problem.
You would not buy a car without anti-lock brakes--even though you will
only use the feature once or twice in your ownership of the vehicle !?!
Why would you buy a CPU that is not similar?
(e.g., misaligned writes crossing L1 D-cache line boundaries where the
two lines are owned by different CPUs in a cache-coherent
multiprocessor)
and might need a millicode trap. And some cases may require going all
the
way up to the OS (e.g., misaligned writes that cross virtual-memory-page
boundaries where one page is ok but the other is non-resident).
Millicode is so DEC ALPHA. Fixing the problem in HW does not require
anything but the 5 sequences I illustrated above--this amount of
sequencers are invisible in the cache pipeline as a whole.
So, allowing this in the architecture has several costs:
* extra hardware implementation effort to make sure the "hardware" cases
don't cost an extra gate delay or two on some critical path
AMD had done all of this by 1997. {don't know about when Intel had it
licked}
But, yes, if you have a balls-to-the-wall pipeline (R2000) adding a gate
of
delay would degrade performance by ~5%. This has only been shown to be
an
issue when the cache pipeline is 2 stages and one is trying to get::
Forward->AGEN->RAMS->ALIGN->resultbus in 2 cycles.
MIPS had to use direct mapped caches to meet this timing, and had to
sample SRAM chips on its own test head to measure if the SRAMs had
pin timings appropriate to R2000 timings.
Once you have set-associativity or allow for 3 cycles {note current
Intel cores are using 5 cycles.} your argument fails.
While your argument might succeed in 2µ-through-90nm, wires have become
so slow that in many cases adding a gat of pure delay does not slow
anything because the cache pipeline has been engineered O F F the/any
critical path. So, while RISC-V persists with the 2 cycle cache pipe-
line, the big boys have migrated to longer pipelines and build execution
windows to absorb the added latencies.
* extra complexity and debugging time in hardware and in system software
(think about writing and *debugging* and *verifying*
microcode/millicode
trap handlers for all those messy write-crossing-cache/page-boundary
cases, especially their interactions with multiprocessor cache
coherency)
There is N O M I L L I C O D E. There is a sequencer that can take 1
of
5 paths over the AGEN-CACHE-ALIGN stages of the pipeline. SW has to do
nothing to enable this, or overcome poor/bad use of ISA.
* this extra effort means a longer design time and/or greater design
cost,
and hence (so long as the state-of-the-art of competing systems is
still
steadily improving with time) that means a net lower price/performance
relative to competing systems
So does IEEE 754 floating point !! It is significantly more logic
intensive
than IBM or CRAY or Univac floating points. Yet, currently, some larger
cores contain 4-8 of these floating point units (8-16 is you separate
FMUL/FDIV from FADD/FSUB/FCMP).
And, because of the traps
There are N O T R A P S, no exceptions (other than expected), no
interrupt dependencies, no mispredict repair dependencies, no
coherence dependencies, ...
and their overheads (which will likely differ
significantly across different implementations of the same architecture,
e.g., different multiprocessor cache-coherency protocols), any code that
actually *uses* unaligned accesses -- especially unaligned writes --
isn't
performance-portable unless the actual dynamic frequency of unaligned
operations is very low.
UnTrue.
So yes, allowing unaligned access does help "dusty deck" Fortran code...
but it comes at a significant cost.
Less than 0.1% is a significant cost ?!?!?