On 10/12/24 2:37 PM, MitchAlsup1 wrote:
On Sat, 12 Oct 2024 18:17:18 +0000, Brett wrote:
[snip]
Worst case the source and dest are in cache, and the count is 150 cycles
away in memory. So hundreds of chars could be copied until the value is
loaded and that count value could be say 5.
The instruction cannot start until the count in known. You don't start
an FMAC until all 3 operands are ready, either.
This is not _strictly_ true. Some ARM implementations start an
FMADD before the addend is available when it is known that it
will be available in time. This allows dependent accumulation
with a latency equal to the ADD part.
One might even be able to start the shift to align addend and
product early as this value is easy to calculate for normal FP
values.
In many microarchitectures, an operation will be scheduled to
execute when an L1 cache hit would be expected to make an operand
available. I.e., the instruction "starts" before the operand is
actually available.
With branch prediction, a branch instruction is "started" before
the condition has been evaluated. Your statement implies that
My 66000 MM implementations will not do such prediction.
In the case of a memory copy, performing rollback of
misspeculation is potentially much easier than in the general case
of a loop with store operations.
Memory copy also facilitates deeper speculation. The source data
can be preserved in memory more readily than arbitrary sequences
of register contents. If both source and destination start points
are known, destination reads can be translated into source reads
within a speculation domain. (The source could also be prefetched
before the destination is known.)
It does seem that My 66000's MM does not completely eliminate the
potential for faster special case software even if every
implementation is perfect. Software might know that the tail part
of a cache block that is not overwritten is dead data. This can
avoid a read for ownership of the last destination block, software
could do a cache block zero for the last block and then copy the
data over that. This special case might apply for appending to a
buffer.
I do not know that adding a MM instruction variant to handle that
special case would be worthwhile.
I am skeptical that all implementations of MM would be perfect,
i.e., perform at least as well as software more specifically
controlling hardware if such control had been provided by the ISA.
E.g., ISA support for byte-masks for stores might not only allow
non-contiguous stores (such as updating more than one field in a
structure while leaving other intermediately placed fields
unchanged) but might have higher performance than a general MM if
the source happened to be replicated in a register.
"Hard cases make bad law" may be generalized to special cases make
bad (general) interfaces. Clean interfaces that can be implemented
almost optimally have advantages over complicated interfaces that
can theoretically handle more cases optimally **if one uses the
proper (highly specific) incantation!!!**