Re: Constant Stack Canaries

Liste des GroupesRevenir à c arch 
Sujet : Re: Constant Stack Canaries
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 02. Apr 2025, 06:43:39
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vsiit3$12k13$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10
User-Agent : Mozilla Thunderbird
On 4/1/2025 6:21 PM, MitchAlsup1 wrote:
On Tue, 1 Apr 2025 19:34:10 +0000, BGB wrote:
 
On 3/31/2025 3:52 PM, MitchAlsup1 wrote:
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
---------------------
PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.
>
As long as the relative distance is the same, it does.
>
>
Can't happen within a shared address space.
>
Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.
 Say I load a copy of the binary text at 0x24680000 and its data at
0x35900000 for a distance of 0x11280000 into the address space of
a process.
 Then I load another copy at 0x44680000 and its data at 55900000
into the address space of a different process.
 PC-rel addressing works in both cases--because the distance (-rel)
remains the same,
 and the MMU can translate the code to the same physical, and map
each area of data individually.
 Different virtual addresses, same code physical address, different
data virtual and physical addresses.
 
You can't do a duplicate mapping at another address, as this both wastes
VAS, and also any Abs64 base-relocs or similar would differ.
 A 64-bit VAS is a wasteable address space, whereas a 48-bit VAS is not.
 
OK.
PE/COFF had defined Abs64 relocs, but I am using a 48-bit VAS.
Would not have made sense to define separate Abs48 relocs, but much of the time, we can just assume the HOBs are zero.
Well, except for function pointers, where the base-reloc handling detects pointers into ".text" and does some special secret-sauce magic regarding the HOBs to make sure they are correctly tagged.
Binaries are not generally fully PIE though, but are instead base-relocated (more like EXE/DLL handling in Windows). Though, most things within the core proper are either PC-rel or GBR rel, and there are usually a relatively small number of base-relocations.
Things like DLL calls are essentially absolute addressed though. Where, mapping instances at different virtual addresses would be messy for things like DLL handling (in the absence of a GOT or similar).

You also can't CoW the data/bss sections, as this is no longer a shared
address space.
 You are trying to "get at" something here, but I can't see it (yet).
 
Shared address space assumes all processes have the same page tables and shared address mappings and TLB contents (though, ACL checking can be different, as the ACL/KRR stuff is not based on having separate contents in the page tables or TLB, *).
By definition, CoW can't be used in this constraint.
But, multiple VAS's adds new problems (both hassles and potential performance effects, so better here to delay this if possible).
*: A smaller 4-entry full-assoc cache is used for ACL checks, so it is more of a "what access does the current task have to this particular ACL" check. But, admittedly, some of this part is still TODO regarding making use of it in the OS.

>
So, alternative is to use GBR to access globals, with the data/bss
sections allocated independently of the binary.
>
This way, multiple processes can share the same mapping at the same
address for any executable code and constant data, with only the data
sections needing to be allocated.
>
>
Does mean though that one needs to save/restore the global pointer, and
there is a ritual for reloading it.
>
EXE's generally assume they are index 0, so:
   MOV.Q (GBR, 0), Rt
   MOV.Q (Rt, 0), GBR
Or, in RV terms:
   LD    X6, 0(X3)
   LD    X3, Disp33(X6)
Or, RV64G:
   LD    X6, 0(X3)
   LUI   X5, DispHi
   ADD   X5  X5, X6
   LD    X3, DispLo(X5)
>
>
For DLL's, the index is fixed up with a base-reloc (for each loaded
DLL), so basically the same idea. Typically a Disp33 is used here to
allow for a potentially large/unknown number of loaded DLL's. Thus far,
a global numbering scheme is used.
>
Where, (GBR+0) gives the address of a table of global pointers for every
loaded binary (can be assumed read-only from userland).
>
>
Generally, this is needed if:
   Function may be called from outside of the current binary and:
     Accesses global variables;
     And/or, calls local functions.
 I just use 32-bit of 64-bit displacement constants. Does not matter
how control arrived at this subroutine, it accesses its data as the
linker resolved addresses--without wasting a register.
 
GBR or GP is specially designated as a global pointer though.
Not so starved for registers that it would make sense to reclaim it as a GPR.
But, yeah, do need to care how control can arrive at a given function.

>
Though, still generally lower average-case overhead than the strategy
typically used by FDPIC, which would handle this reload process on the
caller side...
   SD    X3, Disp(SP)
   LD    X3, 8(X18)
   LD    X6, 0(X18)
   JALR  X1, 0(X6)
   LD    X3, Disp(SP)
 This is just::
      CALX    [IP,,#GOT[funct_num]-.]
 In the 32-bit linking mode this is a 2 word instruction, in the 64-bit
linking mode it is a 3 word instruction.
----------------
OK.
Neither BJX nor RISC-V have special instructions to deal with FDPIC call semantics.

>
Though, execl() effectively replaces the current process.
>
IMHO, a "CreateProcess()" style abstraction makes more sense than
fork+exec.
 You are 40 years late on that.
 
I am just doing it the Windows (or Cygwin) way...
Most POSIX style programs still work, but with a slightly higher risk of "stuff may catastrophically explode" (say, if one tries to use "fork()" to fold off copies of the parent process, and then returning from the call-frame that called "fork()").
Fork could be made to clone the global variables, though avoiding tangled addresses could be an issue (could maybe be done by relying on debuginfo or similar, to walk the globals and then redirect any pointers from the old data/bss into the new one; kinda SOL for anything on the heap though).
Better may just be to be like "yeah, fork() doesn't really work, don't use it...".

---------------
>
But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.
>
Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.
>
>
But, My66000 also isn't like, "Hey, how about 16-bit ops with 3 or 4 bit
register numbers".
>
Not sure the thinking behind the RV ABI.
 If RISC-V removed its 16-bit instructions, there is room in its ISA
to put my entire ISA along with all the non-compressed RISC-V inst-
ructions.
 
Yeah, errm, how do you think XG3 came about?...
I just sort of dropped the C instructions and shoved nearly the entirety of XG2 into that space.
There would still have been half the encoding space left, if predication were disallowed.
But, say, RV64G + XG3 (sans predication) + 2/3 of the 'C' extension, would be a bit picky...
Granted, did need to shuffle the bits for the ISAs to be encoding-compatible; and went a little further than the bare minimum to avoid dog chew (gluing them together with entirely mismatched encodings and disjoint register numbering would have been possible; but I wanted at least some semblance of encoding consistency between them).

---------------
>
Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.
>
Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.
>
>
Granted.
>
Each predicted branch adds 2 cycles.
 So, you loose 6 cycles on just under ½ of all subroutine calls,
while also executing 2-5 instructions manipulating your global
pointer.
 
Possibly, but I don't think it is quite that bad on average...
Would need to run some stats and do some math to try to figure out the percentages and relative impact from each of these.
But, even with all this, and using stack canaries (which add around 6 or so instructions when applicable), it is still outperforming GCC's RV64G output (along with smaller binaries).

>
Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...
>
But, say, 20 registers, it is more worthwhile.
>
ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.
>
>
Granted.
>
My strategy isn't perfect:
   Non-zero branching overheads, when the feature is used;
   Per-function load/store slides in prolog/epilog, when not used.
>
Then, the heuristic mostly becomes one of when it is better to use the
inline strategy (load/store slide), or to fold them off and use
calls/branches.
 My solution gets rid of the delimma:
a) the call code is always smaller
b) the call code never takes more cycles
 In addition, there is a straightforward way to elide the STs of ENTER
when the memory unit is still executing the previous EXIT.
 
OK.
I was trying to keep the CPU implementation from being too complicated.
In my case though, there is an advantage over plain RV64G:
   I have a Load/Store Pair, so need fewer Load/Store operations.
Though, my RV+Jx experiment does also have this...
Though were also variants defined for RV32 but not for RV64 (because apparently there was indecision about encodings, and some arguments from the "opcode fusion" camp that 64-bit RV processors could fuse groups of LD or SD instructions...).
Decided to leave out complaining about "opcode fusion" distractions (to actually addressing ISA issues) and seeming over reliance on SpecInt and CoreMark to drive ISA design choices...
Granted, one might say the same about Doom, but at least I am treating Doom more as a representation of a workload, and not the end-goal arbiter of what is added or dropped.

Does technically also work for RISC-V though (though seemingly GCC
always uses inline save/restore, but also the RV ABI has fewer
registers).

Date Sujet#  Auteur
30 Mar 25 * Constant Stack Canaries50Robert Finch
30 Mar 25 `* Re: Constant Stack Canaries49BGB
30 Mar 25  `* Re: Constant Stack Canaries48MitchAlsup1
31 Mar 25   +- Re: Constant Stack Canaries1Robert Finch
31 Mar 25   +- Re: Constant Stack Canaries1BGB
31 Mar 25   `* Re: Constant Stack Canaries45Stephen Fuld
31 Mar 25    `* Re: Constant Stack Canaries44BGB
31 Mar 25     +- Re: Constant Stack Canaries1Stephen Fuld
31 Mar 25     `* Re: Constant Stack Canaries42MitchAlsup1
31 Mar 25      `* Re: Constant Stack Canaries41BGB
31 Mar 25       `* Re: Constant Stack Canaries40MitchAlsup1
1 Apr 25        +* Re: Constant Stack Canaries10Robert Finch
1 Apr 25        i+* Re: Constant Stack Canaries6MitchAlsup1
1 Apr 25        ii`* Re: Constant Stack Canaries5Robert Finch
2 Apr 25        ii `* Re: Constant Stack Canaries4MitchAlsup1
2 Apr 25        ii  `* Re: Constant Stack Canaries3Robert Finch
2 Apr 25        ii   +- Re: Constant Stack Canaries1MitchAlsup1
4 Apr 25        ii   `- Re: Constant Stack Canaries1MitchAlsup1
1 Apr 25        i`* Re: Constant Stack Canaries3BGB
1 Apr 25        i `* Re: Constant Stack Canaries2Robert Finch
2 Apr 25        i  `- Re: Constant Stack Canaries1BGB
1 Apr 25        `* Re: Constant Stack Canaries29BGB
2 Apr 25         `* Re: Constant Stack Canaries28MitchAlsup1
2 Apr 25          +* Re: Constant Stack Canaries26Stefan Monnier
2 Apr 25          i`* Re: Constant Stack Canaries25BGB
3 Apr 25          i `* Re: Constant Stack Canaries24Stefan Monnier
3 Apr 25          i  `* Re: Constant Stack Canaries23BGB
4 Apr 25          i   `* Re: Constant Stack Canaries22Robert Finch
4 Apr 25          i    +- Re: Constant Stack Canaries1BGB
4 Apr 25          i    `* Re: Constant Stack Canaries20MitchAlsup1
5 Apr 25          i     `* Re: Constant Stack Canaries19Robert Finch
5 Apr 25          i      `* Re: Constant Stack Canaries18MitchAlsup1
5 Apr 25          i       +* Re: Constant Stack Canaries3Robert Finch
6 Apr 25          i       i+- Re: Constant Stack Canaries1MitchAlsup1
6 Apr 25          i       i`- Re: Constant Stack Canaries1Robert Finch
6 Apr 25          i       `* Re: Constant Stack Canaries14MitchAlsup1
7 Apr 25          i        `* Re: Constant Stack Canaries13MitchAlsup1
9 Apr 25          i         +- Re: Constant Stack Canaries1MitchAlsup1
15 Apr 25          i         `* Re: Constant Stack Canaries11MitchAlsup1
15 Apr 25          i          `* Re: Constant Stack Canaries10MitchAlsup1
16 Apr 25          i           `* Re: Constant Stack Canaries9MitchAlsup1
16 Apr 25          i            +* Virtualization layers (was: Constant Stack Canaries)2Stefan Monnier
16 Apr 25          i            i`- Re: Virtualization layers1MitchAlsup1
16 Apr 25          i            `* Re: Constant Stack Canaries6Stephen Fuld
17 Apr 25          i             `* Re: virtualization, Constant Stack Canaries5John Levine
17 Apr 25          i              +- Re: virtualization, Constant Stack Canaries1Stefan Monnier
17 Apr 25          i              +- Re: virtualization, Constant Stack Canaries1Stephen Fuld
17 Apr 25          i              `* Re: virtualization, Constant Stack Canaries2MitchAlsup1
17 Apr 25          i               `- Re: virtualization, Constant Stack Canaries1MitchAlsup1
2 Apr 25          `- Re: Constant Stack Canaries1BGB

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal