Liste des Groupes | Revenir à c arch |
On 2/17/2025 11:07 PM, Robert Finch wrote:On 2025-02-17 8:00 p.m., BGB wrote:On 2/14/2025 3:52 PM, MitchAlsup1 wrote:On Fri, 14 Feb 2025 21:14:11 +0000, BGB wrote:
On 2/13/2025 1:09 PM, Marcus wrote:-------------
The problem arises when the programmer *deliberately* does unaligned
loads and stores in order to improve performance. Or rather, if the
programmer knows that the hardware supports unaligned loads and
stores,
he/she can use that to write faster code in some special cases.
Pretty much.
This is partly why I am in favor of potentially adding explicit
keywords
for some of these cases, or to reiterate:
__aligned:
Inform compiler that a pointer is aligned.
May use a faster version if appropriate.
If a faster aligned-only variant exists of an instruction.
On an otherwise unaligned-safe target.
__unaligned: Inform compiler that an access is unaligned.
May use a runtime call or similar if necessary,
on an aligned-only target.
May do nothing on an unaligned-safe target.
None: Do whatever is the default.
Presumably, assume aligned by default,
unless target is known unaligned-safe.
It would take LESS total man-power world-wide and over-time to
simply make HW perform misaligned accesses.
I think the usual issue is that on low-end hardware, it is seen asI always include support for unaligned accesses even with a ‘low-end’
"better" to skip out on misaligned access in order to save some cost
in the L1 cache.
CPU. I think it is not that expensive and sure makes some things a lot
easier when handled in hardware. For Q+ it just runs two bus cycles if
the data spans a cache line and pastes results together as needed.
I had went aligned-only with some 32-bit cores in the past.
Whole CPU core fit into less LUTs than I currently spend on just the L1
D$...
Granted, some of these used a very minimal L1 cache design:
Only holds a single cache line.
The smallest cores I had managed had used a simplified SH-based design:
Fixed-length 16 bit instructions, with 16 registers;
Only (Reg) and (Reg, R0) addressing;
Aligned only;
No shift or multiply;
Where, say:
SH-4 -> BJX1-32 (Added features)
SH-4 -> B32V (Stripped down)
BJX1-32 -> BJX1-64A (64-bit, Modal Encoding)
B32V -> B64V (64-bit, Encoding Space Reorganizations)
B64V ~> BJX1-64C (No longer Modal)
Where, BJX1-64C was the end of this project (before I effectively did a
soft-reboot).
Then transition phase:
B64V -> BtSR1 (Dropped to 32-bit, More Encoding Changes)
Significant reorganization.
Was trying to get optimize for code density closer to MSP430.
BtSR1 -> BJX2 (Back to 64-bit, re-adding features from BJX1-64C)
A few features added for BtSR1 were dropped again in BJX2.
The original form of BJX2 was still a primarily 16-bit ISA encoding, but
at this point pretty much mutated beyond recognition (and relatively few
instructions were still in the same places that they were in SH-4).
For example (original 16-bit space):
0zzz:
SH-4: Ld/St (Rm,R0); also 0R and 1R spaces, etc.
BJX2: Ld/St Only (Rm) and (Rm,R0)
1zzz:
SH-4: Store (Rn, Disp4)
BJX2: 2R ALU ops
2zzz:
SH-4: Store (@Rn, @-Rn), ALU ops
BJX2: Branch Ops (Disp8), etc
3zzz:
SH-4: ALU ops
BJX2: 0R and 1R ops
4zzz:
SH-4: 1R ops
BJX2: Ld/St (SP, Disp4); MOV-CR, LEA
5zzz:
SH-4: Load (Rm, Disp4)
BJX2: Load (Unsigned), ALU ops
6zzz:
SH-4: Load (@Rm+ and @Rm), ALU
BJX2: FPU ops, CMP-Imm4
7zzz:
SH-4: ADD Imm8, Rn
BJX2: (XGPR 32-bit Escape Block)
8zzz:
SH-4: Branch (Disp8)
BJX2: Ld/St (Rm, Disp3)
9zzz:
SH-4: Load (PC-Rel)
BJX2: (XGPR 32-bit Escape Block)
Azzz:
SH-4: BRA Disp12
BJX2: MOV Imm12u, R0
Bzzz:
SH-4: BSR Disp12
BJX2: MOV Imm12n, R0
Czzz:
SH-4: Some Imm8 ops
BJX2: ADD Imm8, Rn
Dzzz:
SH-4: Load (PC-Rel)
BJX2: MOV Imm8, Rn
Ezzz:
SH-4: MOV Imm8, Rn
BJX2: (32-bit Escape, Predicated Ops)
Fzzz:
SH-4: FPU Ops
BJX2: (32-bit Escape, Unconditional Ops)
For the 16-bit ops, SH-4 had more addressing modes than BJX2:
SH-4: @Reg, @Rm+, @-Rn, @(Reg,R0), @(Reg,Disp4) @(PC,Disp8)
BJX2: (Rm), (Rm,R0), (Rm,Disp3), (SP,Disp4)
Although it may seem like it, I didn't just completely start over on the
layout, but rather it was sort of an "ant-hill reorganization".
Say, for example:
1zzz and 5zzz were merged into 8zzz, reducing Disp by 1 bit
2zzz and 3zzz was partly folded into 0zzz and 1zzz
8zzz's contents were moved to 2zzz
4zzz and part of 0zzz were merged into 3zzz
...
A few CR's are still in the same places and SR still has a similar
layout I guess, ...
Early on, there was the idea that the 32-bit ops were prefix-modified
versions of the 16-bit ops, but early on this symmetry broke and the 16
and 32-bit encoding spaces became independent of each other.
Though, the 32-bit F0 space still has some amount of similarity to the
16-bit space.
Later on I did some testing and performance comparisons, and realized
that using 32-bit encodings primarily (or exclusively) gave
significantly better performance than relying primarily or exclusively
on 16-bit ops. And at this point the ISA transitioned from a primarily
16-bit ISA (with 32-bit extension ops) to a primarily 32-bit ISA with a
16-bit encoding space. This transition didn't directly effect encodings,
but did effect how the ISA developed from then going forward (more so,
there was no longer an idea that the 16-bit ISA would need to be able to
exist standalone; but now the 32-bit ISA did need to be able to exist
standalone).
But, now newer forms of BJX2 (XG2 and XG3) have become almost
unrecognizable from early BJX2 (as an ISA still primarily built around
16-bit encodings).
Except that XG2's instruction layout still carries vestiges of its
origins as a prefix encoding. But, XG3 even makes this part disappear
(by reorganizing the bits to more closely resemble RISC-V's layout).
Well, and there is:
ZnmX -> ZXnm
But:
F0nm_ZeoX
I prefer my strategy instead:Q+ encodes rounding mode the same way as RISCV as there are lots of bit
FADD/FSUB/FMUL:
Hard-wired Round-Nearest / RNE.
Does not modify FPU flags.
FADDG/FSUBG/FMULG:
Dynamic Rounding;
May modify FPU flags.
Can note that RISC-V burns 3 bits for FPU instructions always encoding
a rounding mode (whereas in my ISA, encoding a rounding mode other
than RNE or DYN requiring a 64-bit encoding).
available in the instruction. Burning bits on the rounding mode seems
reasonable to me when bits are available.
Initially:
3 bits of entropy were eaten by the 16-bit space;
2 more bits were eaten by predication and WEX.
So, the initial ISA design for 32-bit ops had 5 less bits than in RISC-V
land.
XG2 reclaimed the 16-bit space, but used the bits to expand all the
register fields to 6 bits.
Not many bits left to justify burning on a rounding mode.
And, my Imm/Disp fields were generally 3 bits less than RV.
Modified the PRED modifier in Q+ to take a predicate bit from one of
three registers used to supply bits. Previously an array of two-bit mask
values encoded in the instruction indicated to 1) ignore the predicate
bit 2) execute if predicate true or 3) execute if predicate false.
Since there were three reg specs available in the PRED modifier, it
seemed to make more sense to specify three regs instead of one. So now
it works 1) as before 2) execute if bit in Ra is set, 3) execute if bit
in Rb is set, 3) execute if bit in Rc is set.
The same register may be specified for Ra, Rb, and Rc. Since there is
sign inversion available, the original operation may be mimicked by
specifying Ra, ~Ra.
In BJX2, all 32-bit instructions encode predication in 2 bits in each
instruction.
In XG3, the space that would have otherwise encoded WEX was instead left
to RISC-V (to create a conglomerate ISA).
But, there is also the possibility to use XG3 by itself without any
RISC-V parts in the mix.
Les messages affichés proviennent d'usenet.