Liste des Groupes | Revenir à c arch |
On 4/9/24 8:28 PM, MitchAlsup1 wrote:As noted, in my case, the whole thing of Ld/St sequences, and memcpy slides, mostly applies to constant cases.BGB-Alt wrote:[snip]I wonder if it would be useful to have an immediate count form ofThings like memcpy/memmove/memset/etc, are function calls in cases when not directly transformed into register load/store sequences.>
My 66000 does not convert them into LD-ST sequences, MM is a single instruction.
memory move. Copying fixed-size structures would be able to use an
immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.
All seems a bit complicated to me.I still feel that this atomicity should somehow be included withDid end up with an intermediate "memcpy slide", which can handle medium size memcpy and memset style operations by branching into a slide.>
MMs and MSs that do not cross page boundaries are ATOMIC. The entire system
sees only the before or only the after state and nothing in between.
ESM just because they feel related, but the benefit seems likely
to be extremely small. How often would software want to copy
multiple regions atomically or combine region copying with
ordinary ESM atomicity?? There *might* be some use for an atomic
region copy and an updating of a separate data structure (moving a
structure and updating one or a very few pointers??). For
structures three cache lines in size where only one region
occupies four cache lines, ordinary ESM could be used.
My feeling based on "relatedness" is not a strong basis for such
an architectural design choice.
(Simple page masking would allow false conflicts when smaller
memory moves are used. If there is a separate pair of range
registers that is checked for coherence of memory moves, this
issue would only apply for multiple memory moves _and_ all eight
of the buffer entries could be used for smaller accesses.)
[snip]In my case, yeah, there are two semi-separate register spaces here:I wonder how many instructions would have to have access to such aAs noted, on a 32 GPR machine, most leaf functions can fit entirely in scratch registers.>
Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without getting totally screwed.
set of "special registers" and if a larger number of extra
registers would be useful. (One of the issues — in my opinion —
with PowerPC's link register and count register was that they
could not be directly loaded from or stored to memory [or loaded \
with a constant from the instruction stream]. For counted loops,
loading the count register from the instruction stream would
presumably have allowed early branch determination even for deep
pipelines and small loop counts.) SP, FP, GOT, and TLS hold
"stable values", which might facilitate some microarchitectural
optimizations compared to more frequently modified register names.
(I am intrigued by the possibility of small contexts for some multithreaded workloads, similar to how some GPUs allow variable context sizes.)
Les messages affichés proviennent d'usenet.