Newsportal USENET - Re: Constant Stack Canaries

On 4/1/2025 6:21 PM, MitchAlsup1 wrote:

On Tue, 1 Apr 2025 19:34:10 +0000, BGB wrote:

On 3/31/2025 3:52 PM, MitchAlsup1 wrote:
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
---------------------
PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.
>
As long as the relative distance is the same, it does.
>
>
Can't happen within a shared address space.
>
Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.
Say I load a copy of the binary text at 0x24680000 and its data at
0x35900000 for a distance of 0x11280000 into the address space of
a process.
Then I load another copy at 0x44680000 and its data at 55900000
into the address space of a different process.
PC-rel addressing works in both cases--because the distance (-rel)
remains the same,
and the MMU can translate the code to the same physical, and map
each area of data individually.
Different virtual addresses, same code physical address, different
data virtual and physical addresses.

You can't do a duplicate mapping at another address, as this both wastes
VAS, and also any Abs64 base-relocs or similar would differ.
A 64-bit VAS is a wasteable address space, whereas a 48-bit VAS is not.

OK.
PE/COFF had defined Abs64 relocs, but I am using a 48-bit VAS.
Would not have made sense to define separate Abs48 relocs, but much of the time, we can just assume the HOBs are zero.
Well, except for function pointers, where the base-reloc handling detects pointers into ".text" and does some special secret-sauce magic regarding the HOBs to make sure they are correctly tagged.
Binaries are not generally fully PIE though, but are instead base-relocated (more like EXE/DLL handling in Windows). Though, most things within the core proper are either PC-rel or GBR rel, and there are usually a relatively small number of base-relocations.
Things like DLL calls are essentially absolute addressed though. Where, mapping instances at different virtual addresses would be messy for things like DLL handling (in the absence of a GOT or similar).

You also can't CoW the data/bss sections, as this is no longer a shared
address space.
You are trying to "get at" something here, but I can't see it (yet).

Shared address space assumes all processes have the same page tables and shared address mappings and TLB contents (though, ACL checking can be different, as the ACL/KRR stuff is not based on having separate contents in the page tables or TLB, *).
By definition, CoW can't be used in this constraint.
But, multiple VAS's adds new problems (both hassles and potential performance effects, so better here to delay this if possible).
*: A smaller 4-entry full-assoc cache is used for ACL checks, so it is more of a "what access does the current task have to this particular ACL" check. But, admittedly, some of this part is still TODO regarding making use of it in the OS.

>
So, alternative is to use GBR to access globals, with the data/bss
sections allocated independently of the binary.
>
This way, multiple processes can share the same mapping at the same
address for any executable code and constant data, with only the data
sections needing to be allocated.
>
>
Does mean though that one needs to save/restore the global pointer, and
there is a ritual for reloading it.
>
EXE's generally assume they are index 0, so:
   MOV.Q (GBR, 0), Rt
   MOV.Q (Rt, 0), GBR
Or, in RV terms:
   LD    X6, 0(X3)
   LD    X3, Disp33(X6)
Or, RV64G:
   LD    X6, 0(X3)
   LUI   X5, DispHi
   ADD   X5 X5, X6
   LD    X3, DispLo(X5)
>
>
For DLL's, the index is fixed up with a base-reloc (for each loaded
DLL), so basically the same idea. Typically a Disp33 is used here to
allow for a potentially large/unknown number of loaded DLL's. Thus far,
a global numbering scheme is used.
>
Where, (GBR+0) gives the address of a table of global pointers for every
loaded binary (can be assumed read-only from userland).
>
>
Generally, this is needed if:
   Function may be called from outside of the current binary and:
     Accesses global variables;
     And/or, calls local functions.
I just use 32-bit of 64-bit displacement constants. Does not matter
how control arrived at this subroutine, it accesses its data as the
linker resolved addresses--without wasting a register.

GBR or GP is specially designated as a global pointer though.
Not so starved for registers that it would make sense to reclaim it as a GPR.
But, yeah, do need to care how control can arrive at a given function.

>
Though, still generally lower average-case overhead than the strategy
typically used by FDPIC, which would handle this reload process on the
caller side...
   SD    X3, Disp(SP)
   LD    X3, 8(X18)
   LD    X6, 0(X18)
   JALR X1, 0(X6)
   LD    X3, Disp(SP)
This is just::
    CALX    [IP,,#GOT[funct_num]-.]
In the 32-bit linking mode this is a 2 word instruction, in the 64-bit
linking mode it is a 3 word instruction.
----------------

OK.
Neither BJX nor RISC-V have special instructions to deal with FDPIC call semantics.

>
Though, execl() effectively replaces the current process.
>
IMHO, a "CreateProcess()" style abstraction makes more sense than
fork+exec.
You are 40 years late on that.

I am just doing it the Windows (or Cygwin) way...
Most POSIX style programs still work, but with a slightly higher risk of "stuff may catastrophically explode" (say, if one tries to use "fork()" to fold off copies of the parent process, and then returning from the call-frame that called "fork()").
Fork could be made to clone the global variables, though avoiding tangled addresses could be an issue (could maybe be done by relying on debuginfo or similar, to walk the globals and then redirect any pointers from the old data/bss into the new one; kinda SOL for anything on the heap though).
Better may just be to be like "yeah, fork() doesn't really work, don't use it...".

---------------
>
But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.
>
Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.
>
>
But, My66000 also isn't like, "Hey, how about 16-bit ops with 3 or 4 bit
register numbers".
>
Not sure the thinking behind the RV ABI.
If RISC-V removed its 16-bit instructions, there is room in its ISA
to put my entire ISA along with all the non-compressed RISC-V inst-
ructions.

Yeah, errm, how do you think XG3 came about?...
I just sort of dropped the C instructions and shoved nearly the entirety of XG2 into that space.
There would still have been half the encoding space left, if predication were disallowed.
But, say, RV64G + XG3 (sans predication) + 2/3 of the 'C' extension, would be a bit picky...
Granted, did need to shuffle the bits for the ISAs to be encoding-compatible; and went a little further than the bare minimum to avoid dog chew (gluing them together with entirely mismatched encodings and disjoint register numbering would have been possible; but I wanted at least some semblance of encoding consistency between them).

---------------
>
Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.
>
Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.
>
>
Granted.
>
Each predicted branch adds 2 cycles.
So, you loose 6 cycles on just under ½ of all subroutine calls,
while also executing 2-5 instructions manipulating your global
pointer.

Possibly, but I don't think it is quite that bad on average...
Would need to run some stats and do some math to try to figure out the percentages and relative impact from each of these.
But, even with all this, and using stack canaries (which add around 6 or so instructions when applicable), it is still outperforming GCC's RV64G output (along with smaller binaries).

>
Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...
>
But, say, 20 registers, it is more worthwhile.
>
ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.
>
>
Granted.
>
My strategy isn't perfect:
Non-zero branching overheads, when the feature is used;
Per-function load/store slides in prolog/epilog, when not used.
>
Then, the heuristic mostly becomes one of when it is better to use the
inline strategy (load/store slide), or to fold them off and use
calls/branches.
My solution gets rid of the delimma:
a) the call code is always smaller
b) the call code never takes more cycles
In addition, there is a straightforward way to elide the STs of ENTER
when the memory unit is still executing the previous EXIT.

OK.
I was trying to keep the CPU implementation from being too complicated.
In my case though, there is an advantage over plain RV64G:
I have a Load/Store Pair, so need fewer Load/Store operations.
Though, my RV+Jx experiment does also have this...
Though were also variants defined for RV32 but not for RV64 (because apparently there was indecision about encodings, and some arguments from the "opcode fusion" camp that 64-bit RV processors could fuse groups of LD or SD instructions...).
Decided to leave out complaining about "opcode fusion" distractions (to actually addressing ISA issues) and seeming over reliance on SpecInt and CoreMark to drive ISA design choices...
Granted, one might say the same about Doom, but at least I am treating Doom more as a representation of a workload, and not the end-goal arbiter of what is added or dropped.

Does technically also work for RISC-V though (though seemingly GCC
always uses inline save/restore, but also the RV ABI has fewer
registers).

Date	Sujet	#	Auteur
30 Mar 25	Constant Stack Canaries	50	Robert Finch
30 Mar 25	Re: Constant Stack Canaries	49	BGB
30 Mar 25	Re: Constant Stack Canaries	48	MitchAlsup1
31 Mar 25	Re: Constant Stack Canaries	1	Robert Finch
31 Mar 25	Re: Constant Stack Canaries	1	BGB
31 Mar 25	Re: Constant Stack Canaries	45	Stephen Fuld
31 Mar 25	Re: Constant Stack Canaries	44	BGB
31 Mar 25	Re: Constant Stack Canaries	1	Stephen Fuld
31 Mar 25	Re: Constant Stack Canaries	42	MitchAlsup1
31 Mar 25	Re: Constant Stack Canaries	41	BGB
31 Mar 25	Re: Constant Stack Canaries	40	MitchAlsup1
1 Apr 25	Re: Constant Stack Canaries	10	Robert Finch
1 Apr 25	Re: Constant Stack Canaries	6	MitchAlsup1
1 Apr 25	Re: Constant Stack Canaries	5	Robert Finch
2 Apr 25	Re: Constant Stack Canaries	4	MitchAlsup1
2 Apr 25	Re: Constant Stack Canaries	3	Robert Finch
2 Apr 25	Re: Constant Stack Canaries	1	MitchAlsup1
4 Apr 25	Re: Constant Stack Canaries	1	MitchAlsup1
1 Apr 25	Re: Constant Stack Canaries	3	BGB
1 Apr 25	Re: Constant Stack Canaries	2	Robert Finch
2 Apr 25	Re: Constant Stack Canaries	1	BGB
1 Apr 25	Re: Constant Stack Canaries	29	BGB
2 Apr 25	Re: Constant Stack Canaries	28	MitchAlsup1
2 Apr 25	Re: Constant Stack Canaries	26	Stefan Monnier
2 Apr 25	Re: Constant Stack Canaries	25	BGB
3 Apr 25	Re: Constant Stack Canaries	24	Stefan Monnier
3 Apr 25	Re: Constant Stack Canaries	23	BGB
4 Apr 25	Re: Constant Stack Canaries	22	Robert Finch
4 Apr 25	Re: Constant Stack Canaries	1	BGB
4 Apr 25	Re: Constant Stack Canaries	20	MitchAlsup1
5 Apr 25	Re: Constant Stack Canaries	19	Robert Finch
5 Apr 25	Re: Constant Stack Canaries	18	MitchAlsup1
5 Apr 25	Re: Constant Stack Canaries	3	Robert Finch
6 Apr 25	Re: Constant Stack Canaries	1	MitchAlsup1
6 Apr 25	Re: Constant Stack Canaries	1	Robert Finch
6 Apr 25	Re: Constant Stack Canaries	14	MitchAlsup1
7 Apr 25	Re: Constant Stack Canaries	13	MitchAlsup1
9 Apr 25	Re: Constant Stack Canaries	1	MitchAlsup1
15 Apr 25	Re: Constant Stack Canaries	11	MitchAlsup1
15 Apr 25	Re: Constant Stack Canaries	10	MitchAlsup1
16 Apr 25	Re: Constant Stack Canaries	9	MitchAlsup1
16 Apr 25	Virtualization layers (was: Constant Stack Canaries)	2	Stefan Monnier
16 Apr 25	Re: Virtualization layers	1	MitchAlsup1
16 Apr 25	Re: Constant Stack Canaries	6	Stephen Fuld
17 Apr 25	Re: virtualization, Constant Stack Canaries	5	John Levine
17 Apr 25	Re: virtualization, Constant Stack Canaries	1	Stefan Monnier
17 Apr 25	Re: virtualization, Constant Stack Canaries	1	Stephen Fuld
17 Apr 25	Re: virtualization, Constant Stack Canaries	2	MitchAlsup1
17 Apr 25	Re: virtualization, Constant Stack Canaries	1	MitchAlsup1
2 Apr 25	Re: Constant Stack Canaries	1	BGB