Liste des Groupes | Revenir à c arch |
BGB wrote:Pretty much, this is the problem.
On 4/20/2024 5:03 PM, MitchAlsup1 wrote:BGB wrote:
>
Compilers are notoriously unable to outguess a good branch predictor.
>Errm, assuming the compiler is capable of things like general-case inlining and loop-unrolling.I was thinking of simpler things, like shuffling operators between independent (sub)expressions to limit the number of register-register dependencies.Like, in-order superscalar isn't going to do crap if nearly every instruction depends on every preceding instruction. Even pipelining can't help much with this.Pipelining CREATED this (back to back dependencies). No amount of
pipelining can eradicate RAW data dependencies.
Possibly true.The compiler can shuffle the instructions into an order to limit the number of register dependencies and better fit the pipeline. But, then, most of the "hard parts" are already done (so it doesn't take much more for the compiler to flag which instructions can run in parallel).Compiler scheduling works for exactly 1 pipeline implementation and
is suboptimal for all others.
But... Also makes the CPU too big and expensive to fit into most consumer/hobbyist grade FPGAs.Meanwhile, a naive superscalar may miss cases that could be run in parallel, if it is evaluating the rules "coarsely" (say, evaluating what is safe or not safe to run things in parallel based on general groupings of opcodes rather than the rules of specific opcodes; or, say, false-positive register alias if, say, part of the Imm field of a 3RI instruction is interpreted as a register ID, ...).Granted, seemingly even a naive approach is able to get around 20% ILP out of "GCC -O3" output for RV64G...But, the GCC output doesn't seem to be quite as weak as some people are claiming either.ties the code to a specific pipeline structure, and becomes effectively moot with OoO CPU designs).>
OoO exists, in a practical sense, to abstract the pipeline out of the compiler; or conversely, to allow multiple implementations to run the
same compiled code optimally on each implementation.
>Granted, but OoO isn't cheap.But it does get the job done.
I aimed for Scalar and LIW.So, a case could be made that a "general use" ISA be designed without the use of explicit bundling. In my case, using the bundle flags also requires the code to use an instruction to signal to the CPU what configuration of pipeline it expects to run on, with the CPU able to fall back to scalar (or superscalar) execution if it does not match.>
Sounds like a bridge too far for your 8-wide GBOoO machine.
>For sake of possible fancier OoO stuff, I upheld a basic requirement for the instruction stream:
The semantics of the instructions as executed in bundled order needs to be equivalent to that of the instructions as executed in sequential order.In this case, the OoO CPU can entirely ignore the bundle hints, and treat "WEXMD" as effectively a NOP.This would have broken down for WEX-5W and WEX-6W (where enforcing a parallel==sequential constraint effectively becomes unworkable, and/or renders the wider pipeline effectively moot), but these designs are likely dead anyways.And, with 3-wide, the parallel==sequential order constraint remains in effect.For the most part, thus far nearly everything has ended up as "Mode 2", namely:>
3 lanes;
Lane 1 does everything;
Lane 2 does Basic ALU ops, Shift, Convert (CONV), ...
Lane 3 only does Basic ALU ops and a few CONV ops and similar.
Lane 3 originally also did Shift, dropped to reduce cost.
Mem ops may eat Lane 3, ...
Try 6-lanes:
1,2,3 Memory ops + integer ADD and Shifts
4 FADD ops + integer ADD and FMisc
5 FMAC ops + integer ADD
6 CMP-BR ops + integer ADD
>As can be noted, my thing is more a "LIW" rather than a "true VLIW".Mine is neither LIW or VLIW but it definitely is LBIO through GBOoO
Possibly true.So, MEM/BRA/CMP/... all end up in Lane 1.Lanes 2/3 effectively ending up used for fold over most of the ALU ops turning Lane 1 mostly into a wall of Load and Store instructions.Where, say:Modeless.
Mode 0 (Default):
Only scalar code is allowed, CPU may use superscalar (if available).
Mode 1:
2 lanes:
Lane 1 does everything;
Lane 2 does ALU, Shift, and CONV.
Mem ops take up both lanes.
Effectively scalar for Load/Store.
Later defined that 128-bit MOV.X is allowed in a Mode 1 core.
>Had defined wider modes, and ones that allow dual-lane IO and FPU instructions, but these haven't seen use (too expensive to support in hardware).>Had ended up with the ambiguous "extension" to the Mode 2 rules of allowing an FPU instruction to be executed from Lane 2 if there was not an FPU instruction in Lane 1, or allowing co-issuing certain FPU instructions if they effectively combine into a corresponding SIMD op.>In my current configurations, there is only a single memory access port.>
This should imply that your 3-wide pipeline is running at 90%-95% memory/cache saturation.
>If you mean that execution is mostly running end-to-end memory operations, yeah, this is basically true.Comparably, RV code seems to end up running a lot of non-memory ops in Lane 1, whereas BJX2 is mostly running lots of memory ops, with Lane 2 handling most of the ALU ops and similar (and Lane 3, occasionally).One of the things that I notice with My 66000 is when you get all the constants you ever need at the calculation OpCodes, you end up with FEWER instructions that "go random places" such as instructions that
<well> paste constants together. This leave you with a data dependent
string of calculations with occasional memory references. That is::
universal constants gets rid of the easy to pipeline extra instructions
leaving the meat of the algorithm exposed.
I have an optional MOV.C instruction, but would need to restructure the code for generating the prologs to make use of them in this case.>
If you design around the notion of a 3R1W register file, FMAC and INSERT
fall out of the encoding easily. Done right, one can switch it into a 4R
or 4W register file for ENTER and EXIT--lessening the overhead of call/ret.
>Possibly.It looks like some savings could be possible in terms of prologs and epilogs.As-is, these are generally like:Why not an instruction that saves LR and GBR without wasting instructions
MOV LR, R18
MOV GBR, R19
ADD -192, SP
MOV.X R18, (SP, 176) //save GBR and LR
MOV.X ... //save registers
to place them side by side prior to saving them ??
Correction:WEXMD 2 //specify that we want 3-wide execution here//Reload GBR, *1
MOV.Q (GBR, 0), R18
MOV 0, R0 //special reloc here
MOV.Q (GBR, R0), R18
MOV R18, GBR
It is gorp like that that lead me to do it in HW with ENTER and EXIT.Possibly.
Save registers to the stack, setup FP if desired, allocate stack on SP, and decide if EXIT also does RET or just reloads the file. This would require 2 free registers if done in pure SW, along with several MOVs...
If I were doing a more conventional ABI, I would likely use (PC, Disp33s) for accessing global variables.//Generate Stack Canary, *2
MOV 0x5149, R18 //magic number (randomly generated)
VSKG R18, R18 //Magic (combines input with SP and magic numbers)
MOV.Q R18, (SP, 144)...
function-specific stuff
...MOV 0x5149, R18
MOV.Q (SP, 144), R19
VSKC R18, R19 //Validate canary
...*1: This part ties into the ABI, and mostly exists so that each PE image can get GBR reloaded back to its own ".data"/".bss" sections (withUniversal displacements make GBR unnecessary as a memory reference can
be accompanied with a 16-bit, 32-bit, or 64-bit displacement. Yes, you can read GOT[#i] directly without a pointer to it.
I am not so sure that they could solve the "map multiple instances of the same binary into a single address space" issue, which is sort of the whole thing for why GBR is being used.multiple program instances in a single address space). But, does mean that pretty much every non-leaf function ends up needing to go through this ritual.Universal constant solves the underlying issue.
Possibly. I am using a conventional linear stack.*2: Pretty much any function that has local arrays or similar, serves to protect register save area. If the magic number can't regenerate a matching canary at the end of the function, then a fault is generated.My 66000 can place the callee save registers in a place where user cannot
access them with LDs or modify them with STs. So malicious code cannot
damage the contract between ABI and core.
I guess it could make sense to add a compiler stat for this...The cost of some of this starts to add up.In isolation, not much, but if all this happens, say, 500 or 1000 times or more in a program, this can add up.Was thinking about that last night. H&P "book" statistics say that call/ret
represents 2% of instructions executed. But if you add up the prologue and
epilogue instructions you find 8% of instructions are related to calling and returning--taking the problem from (at 2%) ignorable to (at 8%) a big
ticket item demanding something be done.
8% represents saving/restoring only 3 registers vis stack and associated SP
arithmetic. So, it can easily go higher.
....
Les messages affichés proviennent d'usenet.