Liste des Groupes | Revenir à c arch |
On 1/30/2025 5:48 PM, MitchAlsup1 wrote:Since the compiler can see that the loop is always executed; theOn Thu, 30 Jan 2025 20:00:22 +0000, BGB wrote:>
>So, recent features added to my core ISA: None.>
Reason: Not a whole lot that brings much benefit.
>
>
Have ended up recently more working on the RISC-V side of things,
because there are still gains to be made there (stuff is still more
buggy, less complete, and slower than XG2).
>
>
On the RISC-V side, did experiment with Branch-compare-Immediate
instructions, but unclear if I will carry them over:
Adds a non-zero cost to the decoder;
Cost primarily associated with dealing with a second immed.
Effect on performance is very small (< 1%).
I find this a little odd--My 66000 has a lot of CPM #immed-BC
a) so I am sensitive as this is break even wrt RISC-V
b) But perhaps the small gains is due to something about
.. how the pair runs down the pipe as opposed to how the
.. single runs down the pipe.
>
Issue I had seen is mostly, "How often does it come up?":
Seemingly, around 100-150 or so instructions between each occurrence on
average (excluding cases where the constant is zero; comparing with zero
being more common).
>
What does it save:
Typically 1 cycle that might otherwise be spent loading the value into a
register (if this instruction doesn't end up getting run in parallel
with another prior instruction).
>
>
In the BGBCC output, the main case it comes up is primarily in "for()"
loops (followed by the occasional if-statement), so one might expect
this would increase its probability of having more of an effect.
>
But, seemingly, not enough tight "for()" loops and similar in use for it
to have a more significant effect.
>
So, in the great "if()" ranking:
if(x COND 0) ... //first place
if(x COND y) ... //second place
if(x COND imm) ... //third place
>
However, a construct like:
for(i=0; i<10; i++)
{ ... }
Will emit two of them, so they are not *that* rare either.
Still, a lot rarer in use than:I should note:: I have a whole class of conditional branches that
val=ptr[idx];
Though...
>
Have noted though that simple constant for-loops are a minority, far
more often they are something like:
for(i=0; i<n; i++)
{ ... }
Which doesn't use any.
>
Or:
while(i--)
{ ... }
Which uses a compare with zero (in RV, can be encoded with the zero
register; in BJX2 it has its own dedicated instruction due to the lackI choose not to waste a register to hold zero. Once you have universal
of zero register; some of these were formally dropped in XG3 which does
have access to a zero register, and encoding an op using a ZR instead is
considered as preferable).
Huawei had a "less bad" encoding, but they burnt basically the entireDISP16 reaches farther...
User-1 block on it, so that isn't going to fly.
>
Generally, around 95% of the function-local branches can hit in a Disp9,
vs 98% for Disp12. So, better to drop to Disp9.
Quit listening to them, use your own judgement.>>
I suggest a psychiatrist.
>
People are pointing to charts gathered by mining binaries and being
like: "X10 and X11 are the two most commonly used registers".
>
But, this is like pointing at x86 and being like:
"EAX and ECX are the top two registers, who needs such obscure registers
as ESI and EDI"?...
>
>I consider a multi-word instruction to have an instruction-specifier
>>When I defined my own version of BccI (with a 64-bit encoding), how many>
new instructions did I need to define in the 32-bit base ISA: Zero.
How many 64-bit encodings did My 66000 need:: zero.
{Hint the words following the instruction specifier have no internal
format}
>
I consider the combination of Jumbo-Prefix and Suffix instruction to be
a 64-bit instruction.
However, have noted that XG3 does appear to be faster than the originalMy 66000
Baseline/XG1 ISA.
>
>
Where, to recap:
XG1 (Baseline):
16/32/64/96 bit encodings;
16-bit ops can access R0..R15 with 4b registers;
Only 2R or 2RI forms for 16-bit ops;
16-bit ISA still fairly similar to SuperH.
5-bit register fields by default;
6-bit available for an ISA subset.
Disp9u and Imm9u/n for most immediate form instructions;
32 or 64 GPRs, Default 32.
8 argument registers.
XG2:
32/64/96 bit encodings;
All 16-bit encodings dropped.
6-bit register fields (via a wonky encoding);
Same basic instruction format as XG1,
But, 3 new bits stored inverted in the HOB of instr words;
Mostly Disp10s and Imm10u/n;
64 GPRs native;
16 argument registers.
XG3:
Basically repacked XG2;
Can exist in same encoding space as RISC-V ops;
Aims for ease of compatibility with RV64G.
Encoding was made "aesthetically nicer"
All the register bits are contiguous and non-inverted;
Most immediate fields are also once again contiguous;
...
Partly reworks branch instructions;
Scale=4, usually relative to BasePC (like RV);
Uses RV's register numbering space (and ABI);
Eg: SP at R2 vs R15, ...
(Partly carried over from XG2RV, which is now defunct).
64 GPRs, but fudged into RV ABI rules;
Can't rebalance ABI without breaking RV compatibility;
Breaking RV compatibility defeating its point for existing.
8 argument registers (because of RV ABI).
Could in theory expand to 16, but would make issues.
Despite being based on XG2,
BGBCC treats XG3 as an extension to RISC-V.
>
>
Then, RV:
16/32; 48/64/96 (Ext)
Has 16-bit ops:
Which are horribly dog-chewed,
and only manage a handful of instructions.
Many of the ops can only access X8..X15;
With GCC, enabling RVC saves around 20% off the ".text" size.
Imm12s and Disp12s for most ops;
Lots of dog-chew in the encodings (particular Disp fields);
JAL is basically confetti.
...
In its basic form, RV is the worst performing option here, but peopleImagine that, an ISA that requires more instructions takes more cycles
actually care about RISC-V, so supporting it is value-added.
Imagine accessing an external array with 64-bit virtual address space::>Seems like a mostly pointless uphill battle trying to convince anyone of>
things that (at least to me) seem kinda obvious.
Do not waste you time teaching pigs to put on lipstick. ...
>
Theoretically, people who are working on trying to improve performance,
should also see obvious things, namely, that the primary issues
negatively effecting performance are:
The lack of Register-Indexed Load/Store;
Cases where immediate and displacement fields are not big enough;
Lack of Load/Store Pair.
>
If you can fix a few 10%+ issues, this will save a whole lot more than
focusing on 1% issues.
>
Better to go to the 1% issues *after* addressing the 10% issues.
>
>
If 20-30% of the active memory accesses are for arrays, and one needs to
do, SLLI+ADD+Ld/St, this sucks.
>
If your Imm12 fails, and you need to do:
LUI+ADDI+Op
This also sucks.
>
If your Disp12 fails, and you do LUI+ADD+Ld/St, likewise.
>
They can argue, but with Zba, we can do:
SHnADD+Ld/St
But, this is still worse than a single Ld/St.
If these issues are addressed, there is around a 30% speedup, even with
a worse compiler.
Les messages affichés proviennent d'usenet.