Liste des Groupes | Revenir à c arch |
On 4/21/2024 1:57 PM, MitchAlsup1 wrote:BGB wrote:
>
One of the things that I notice with My 66000 is when you get all the constants you ever need at the calculation OpCodes, you end up with FEWER instructions that "go random places" such as instructions that
<well> paste constants together. This leave you with a data dependent
string of calculations with occasional memory references. That is::
universal constants gets rid of the easy to pipeline extra instructions
leaving the meat of the algorithm exposed.
Possibly true.
RISC-V tends to have a lot of extra instructions due to lack of big constants and lack of indexed addressing.You forgot the "every one an his brother" design of the ISA>
And, BJX2 has a lot of frivolous register-register MOV instructions.I empower you to get rid of them....
If you design around the notion of a 3R1W register file, FMAC and INSERT
fall out of the encoding easily. Done right, one can switch it into a 4R
or 4W register file for ENTER and EXIT--lessening the overhead of call/ret.
>Possibly.It looks like some savings could be possible in terms of prologs and epilogs.As-is, these are generally like:Why not an instruction that saves LR and GBR without wasting instructions
MOV LR, R18
MOV GBR, R19
ADD -192, SP
MOV.X R18, (SP, 176) //save GBR and LR
MOV.X ... //save registers
to place them side by side prior to saving them ??
I have an optional MOV.C instruction, but would need to restructure the code for generating the prologs to make use of them in this case.
Say:
MOV.C GBR, (SP, 184)
MOV.C LR, (SP, 176)
Though, MOV.C is considered optional.
There is a "MOV.C Lite" option, which saves some cost by only allowing it for certain CR's (mostly LR and GBR), which also sort of overlaps with (and is needed) by RISC-V mode, because these registers are in GPR land for RV.
But, in any case, current compiler output shuffles them to R18 and R19 before saving them.
WEXMD 2 //specify that we want 3-wide execution here//Reload GBR, *1
MOV.Q (GBR, 0), R18
MOV 0, R0 //special reloc here
MOV.Q (GBR, R0), R18
MOV R18, GBR
Correction:
>> MOV.Q (R18, R0), R18
It is gorp like that that lead me to do it in HW with ENTER and EXIT.
Save registers to the stack, setup FP if desired, allocate stack on SP, and decide if EXIT also does RET or just reloads the file. This would require 2 free registers if done in pure SW, along with several MOVs...
Possibly.No time like the present...
The partial reason it loads into R0 and uses R0 as an index, was that I defined this mechanism before jumbo prefixes existed, and hadn't updated it to allow for jumbo prefixes.
Well, and if I used a direct displacement for GBR (which, along with PC, is always BYTE Scale), this would have created a hard limit of 64 DLL's per process-space (I defined it as Disp24, which allows a more reasonable hard upper limit of 2M DLLs per process-space).In my case, restricting myself to 32-bit IP relative addressing, GOT can
Granted, nowhere near even the limit of 64 as of yet. But, I had noted that Windows programs would often easily exceed this limit, with even a fairly simple program pulling in a fairly large number of random DLLs, so in any case, a larger limit was needed.Due to the way linkages work in My 66000, each DLL gets its own GOT.
One potential optimization here is that the main EXE will always be 0 in the process, so this sequence could be reduced to, potentially:
MOV.Q (GBR, 0), R18
MOV.C (R18, 0), GBR
Early on, I did not have the constraint that main EXE was always 0, and had initially assumed it would be treated equivalently to a DLL.
//Generate Stack Canary, *2
MOV 0x5149, R18 //magic number (randomly generated)
VSKG R18, R18 //Magic (combines input with SP and magic numbers)
MOV.Q R18, (SP, 144)...
function-specific stuff
...MOV 0x5149, R18
MOV.Q (SP, 144), R19
VSKC R18, R19 //Validate canary
...*1: This part ties into the ABI, and mostly exists so that each PE image can get GBR reloaded back to its own ".data"/".bss" sections (withUniversal displacements make GBR unnecessary as a memory reference can
be accompanied with a 16-bit, 32-bit, or 64-bit displacement. Yes, you can read GOT[#i] directly without a pointer to it.
If I were doing a more conventional ABI, I would likely use (PC, Disp33s) for accessing global variables.Even those 128GB away ??
Problem is:Not a problem when each PE has a different set of mapping tables (at least
What if one wants multiple logical instances of a given PE image in a single address space?
PC REL breaks in this case, unless you load N copies of each PE image, which is a waste of memory (well, or use COW mappings, mandating the use of an MMU).
ELF FDPIC had used a different strategy, but then effectively turned each function call into something like (in SH):Which I do with::
MOV R14, R2 //R14=GOT
MOV disp, R0 //offset into GOT
ADD R0, R2 //adjust by offset
//R2=function pointer
MOV.L (R2, 0), R1 //function address
MOV.L (R2, 4), R3 //GOT
JSR R1
In the callee:
... save registers ...
MOV R3, R14 //put GOT into a callee-save register
...
In the BJX2 ABI, had rolled this part into the callee, reasoning that handling it in the callee (per-function) was less overhead than handling it in the caller (per function call).
Though, on the RISC-V side, it has the relative advantage of compiling for absolute addressing, albeit still loses in terms of performance.Compiling and linking to absolute addresses works "really well" when one needs to place different sections in different memory every time the
I don't imagine an FDPIC version of RISC-V would win here, but this is only assuming there exists some way to get GCC to output FDPIC binaries (most I could find, was people debating whether to add FDPIC support for RISC-V).
PIC or PIE would also sort of work, but these still don't really allow for multiple program instances in a single address space.Once you share the code and some of the data, the overhead of using different
multiple program instances in a single address space). But, does mean that pretty much every non-leaf function ends up needing to go through this ritual.Universal constant solves the underlying issue.
I am not so sure that they could solve the "map multiple instances of the same binary into a single address space" issue, which is sort of the whole thing for why GBR is being used.
Otherwise, I would have been using PC-REL...
*2: Pretty much any function that has local arrays or similar, serves to protect register save area. If the magic number can't regenerate a matching canary at the end of the function, then a fault is generated.My 66000 can place the callee save registers in a place where user cannot
access them with LDs or modify them with STs. So malicious code cannot
damage the contract between ABI and core.
Possibly. I am using a conventional linear stack.
Downside: There is a need either for bounds checking or canaries. Canaries are the cheaper option in this case.
The cost of some of this starts to add up.In isolation, not much, but if all this happens, say, 500 or 1000 times or more in a program, this can add up.Was thinking about that last night. H&P "book" statistics say that call/ret
represents 2% of instructions executed. But if you add up the prologue and
epilogue instructions you find 8% of instructions are related to calling and returning--taking the problem from (at 2%) ignorable to (at 8%) a big
ticket item demanding something be done.
8% represents saving/restoring only 3 registers vis stack and associated SP
arithmetic. So, it can easily go higher.
I guess it could make sense to add a compiler stat for this...
The save/restore can get folded off, but generally only done for functions with a larger number of registers being saved/restored (and does not cover secondary things like GBR reload or stack canary stuff, which appears to possibly be a significant chunk of space).
Goes and adds a stat for averages:
Prolog: 8% (avg= 24 bytes)
Epilog: 4% (avg= 12 bytes)
Body : 88% (avg=260 bytes)
With 959 functions counted (excluding empty functions/prototypes).
....
Les messages affichés proviennent d'usenet.