Newsportal USENET - Re: Constant Stack Canaries

On 4/1/2025 5:19 PM, Robert Finch wrote:

On 2025-04-01 5:21 p.m., BGB wrote:
On 3/31/2025 11:58 PM, Robert Finch wrote:
On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
>
On 3/31/2025 1:07 PM, MitchAlsup1 wrote:
-------------
Another option being if it could be a feature of a Load/Store Multiple.
>
Say, LDM/STM:
   6b Hi (Upper bound of register to save)
   6b Lo (Lower bound of registers to save)
   1b LR (Flag to save Link Register)
   1b GP (Flag to save Global Pointer)
   1b SK (Flag to generate a canary)
>
Q+3 uses a bitmap of register selection with four more bits selecting overlapping groups. It can work with up to 17 registers.
>
>
OK.
>
If I did LDM/STM style ops, not sure which strategy I would take.
>
The possibility of using a 96-bit encoding with an Imm64 holding a bit- mask of all the registers makes some sense...
>
>
>
ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.
>
Likely (STM):
   Pushes LR first (if bit set);
   Pushes GP second (if bit set);
   Pushes registers in range (if Hi>=Lo);
   Pushes stack canary (if bit set).
>
EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.
>
>
OK.
>
I guess one could debate whether an LDM could treat the Load-LR as "Load
LR" or "Load address and Branch", and/or have separate flags (Load LR vs
Load PC, with Load PC meaning to branch).
>
>
Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary
way of accessing globals, and each binary image has its own address
range here.
>
I use constants to access globals.
These comes in 32-bit and 64-bit flavors.
>
PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.
>
As long as the relative distance is the same, it does.
>
Vs, say, for PIE ELF binaries where it is needed to load a new copy for
each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).
>
Well, granted, because Linux and similar tend to load every new process
into its own address space and/or use CoW.
>
CoW and execl()
>
--------------
Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.
>
To quote Trevor Smith:: "Why would anyone want to do that" ??
>
>
Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a
contiguous range.
>
Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers
contiguous.
>
Say:
   R0..R3: Special
   R4..R15: Scratch
   R16..R31: Argument
   R32..R63: Callee Save
....
>
But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.
>
Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.
>
Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is basically
the strategy used by BGBCC. If multiple functions happen to save/ restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).
>
Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.
>
>
Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.
>
Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.
>
Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...
>
But, say, 20 registers, it is more worthwhile.
>
ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.
>
>
Granted, the folding strategy can still do canary values, but doing so
in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).
>
Canary values are in addition to ENTER and EXIT not part of them
IMHO.
>
In Q+3 there are push and pop multiple instructions. I did not want to add load and store multiple on top of that. They work great for ISRs, but not so great for task switching code. I have the instructions pushing or popping up to 17 registers in a group. Groups of registers overlap by eight. The instructions can handle all 96 registers in the machine. ENTER and EXIT are also present.
>
It is looking like the context switch code for the OS will take about 3000 clock cycles to run. Not wanting to disable interrupts for that long, I put a spinlock on the system’s task control block array. But I think I have run into an issue. It is the timer ISR that switches tasks. Since it is an ISR it pushes a subset of registers that it uses and restores them at exit. But when exiting and switching tasks it spinlocks on the task control block array. I am not sure this is a good thing. As the timer IRQ is fairly high priority. If something else locked the TCB array it would deadlock. I guess the context switching could be deferred until the app requests some other operating system function. But then the issue is what if the app gets stuck in an infinite loop, not calling the OS? I suppose I could make an OS heartbeat function call a requirement of apps. If the app does not do a heartbeat within a reasonable time, it could be terminated.
>
Q+3 progresses rapidly. A lot of the stuff in earlier versions was removed. The pared down version is a 32-bit machine. Expecting some headaches because of the use of condition registers and branch registers.
>
>
OK.
>
Ironically, I seem to have comparably low task-switch cost...
However, each system call is essentially 2 task switches, and it is still slow enough to negatively effect performance if they happen at all frequently.
>
System calls for Q+ are slightly faster (but not much) than task switches. I just have the system saving state on the stack. I don't bother saving the FP registers or some of the other system registers that the OS controls. So, it is a little bit shorter than the task switch code.
The only thing that can do a task switch in the system is the time-slicer.

In my case, task switch happens by capturing and restoring all of the registers (of which there are 64 main registers, and a few CR's).
No separate FPU or vector registers (the BJX GPR space, and RISC-V X+F spaces, being mostly equivalent).
The interrupt handlers only have access to physical addresses, and will block all other interrupts when running, so there is a need to get quickly from the user-program task to the syscall handler task, and then back again once done (though, maybe not immediately, as it may instead send the results back to the caller task, and then transfer control to a different task).
Timer interrupt can do scheduling, but mostly avoids doing so unless there is no other option (TestKern being mostly lacking in mutexes, which makes timer-driven preemptive multitasking a bit risky). However, usually programs will use system calls often enough that it is possible to schedule tasks this way, and generally a system call will not be made inside of a critical section.

So, say, one needs to try to minimize the number of unnecessary system calls (say, don't implement "fputs()" by sending 1 byte at a time, ...).
>
>
>
>
>
Unlike on a modern PC, one generally needs to care more about efficiency.
>
Hence, all the fiddling with low bit-depth graphics formats, and things like my recent fiddling with 2-bit ADPCM audio.
>
And, online, one is (if anything) more likely to find people complaining about how old/obsolescent ADPCM is (and/or arguing that people should store all their sound effects as Ogg/Vorbis or similar; ...).
>
>
Im not one much for music, although I play the tunes ocassionally. I'm little hard of hearing.

Not so much for music here, but more for storing sound-effects.
I can note I seem to have a form of reverse-slope hearing impairment...
Not a new thing, I have either always been this way, or it has happened very slowly.
I can seemingly hear most stuff OK though.
   Except, IRL, I can't hear tuning forks.
   Nor car engines.
   I don't hear the engines.
   I do hear the tires rolling on the ground.
   Nor refrigerators (mostly).
   I sometimes hear the relays when they start/stop,
   or a crackling sound from the radiator coil.
   Using phones sucks hard, can't hear crap...
   Not terribly musically inclined.
   But, instruments don't sound much different from "noise" sounds.
My ability to hear low-frequencies is a bit weird:
   Square or triangle waves, I hear these well;
   Sine waves, weakly, but I hear them in headphones if volume is high.
   If the volume isn't very high, sine waves become silent.
   Seemingly, these are harder to hear IRL.
I seem most sensitive to frequencies between around 2 to 8 kHz. Upper end of hearing seems to be around 17 kHz (lower end around 1kHz for pure sine waves). The lower ("absolute" limit seems to be around 8Hz, but more because at this point, a square wave turns from a "tone" into a series of discrete pops, 8-20 Hz being sort of a meta-range between being tonal and discrete pops).
Have noted that in YouTube videos where someone is messing with a CRT TV, I can still sometimes hear the squeal, particularly if the camera is close to the TV. Not seen a CRT IRL in a while though; no obvious sound from a VGA CRT monitor though (but, then again, I am using it ATM on an old rack server, which sounds kinda like a vacuum cleaner, so might be masking it if it is making a noise).
Have noted that I still understand speech fine with a 2-8 kHz bandpass (with steep fall-off). I don't understand speech at all with a 2kHz low-pass. So, whichever parts I use for intelligibility seem to be between 2 and 8kHz. Had noted if I split it into 2-4 or 4-8 kHz bands, either works, though individually each has a notably worse quality than combined 2-8 kHz.
The 1-2 kHz range can be heard, but doesn't seem to contain much as far as intelligibility goes, but its presence or absence does seem to alter vowel sounds slightly.
A 1-8 kHz bandpass sounds mostly natural to me. Though, cats seem to respond unfavorably to band-passed audio (if cats are neutral to the original, but tense up and dig in their claws if I play band-passed audio, it seems they hear a difference).
Although, I was using music as test-cases mostly as they can give a better idea of the relative audio quality than a short sound effect.
But, for things that are going to be embedded into an EXE or DLL, generally these are ideally kept at a few kB or less.
For long form audio, there is more reason to care about audio quality, but for something like a notification ding, not as much. Do preferably want it to not sound like "broken crap" though. And, if any speech is present, ideally it needs to be intelligible.
In terms of being small and "not sounding like crap":
   ADPCM:
   Works well enough, but can't go below 2 bits per sample.
   Delta-Signma:
   1 bit per sample, but sounds horrid much under 64 kHz.
MP3 and Vorbis work well at 96 to 128 kbps, but:
   Are complex and expensive formats to decode;
   Don't give acceptable results much below around 40 kbps.
At lower bitrates, the artifacts from MP3 and Vorbis can become rather obnoxious (lots of squealing and whistling and sounds like broken glass being shaken in a steel can).
I actually much prefer the sound of ADPCM for low bitrates. Muffled and gritty is still preferable to "rattling a steel can full of broken glass" (simple loss of quality rather than the addition of other more obnoxious artifacts).
From what I gather, the telephone network used 8kHz as a standard sampling rate, with one of several formats:
   u-Law, in the US
   A-Law, in Europe
   4-bit ADPCM, for lower-priority long-distance links;
   When not using u-Law or A-Law.
   2-bit ADPCM, for "overflow" links (*).
*: Apparently, if there were too many long distance calls over a given long-distance link, they would drop to a 2-bit ADPCM (running at 16 kbps).
I was testing with 16kHz 2-bit ADPCM, as while both 16kHz 2-bit and 8kHz 4-bit ADPCM are both 32 kbps, the 16kHz sounds better to me (and intelligibility is higher).
Though, if spoken language is not used, it makes sense to drop to 8kHz.
   Using 8kHz as standard is weak as intelligibility is a lot worse.
But, I guess the thinking was "minimum where you can still 'mostly' hear what they are saying...".
Even if 8kHz was standard on the telephone network, I can't easily understand what anyone is saying over the phone (speech is often very muffled and there is often a loud/obnoxious hiss).
Actually, weirdly, actual phone quality is somehow *worse* than my experiments with low bitrate ADPCM. Like, the low-bit depth ADPCM mostly just sounds "gritty" (without any obvious hiss). Like, the phone adds extra levels of badness beyond just any compression issues (probably also crappy microphones and speakers, etc, as well).
Using headphones with a phone is "slightly" better, but there is often still a rather loud/annoying hiss, even when the sound is coming from an artificial source.
Poor quality ADPCM, by itself, does not have this particular issue (actually, it almost seems as if the ADPCM somehow "enhances" the audio and compensates slightly for the low sample rate, making details easier to hear compared with "cleaner" PCM audio versions).
For sound-effects, could drop to 4kHz, but there is fairly significant distortion. Like, if you have a notification ding, it doesn't really sound like a bell anymore.
So, say (ADPCM modes):
   16kHz 4-bit: Mostly Good, but needs 64 kbps.
   16kHz 2-bit: Slightly muffled, gritty, 32 kbps.
   8kHz 4-bit: More obvious muffling (but not gritty);
   8kHz 2-bit: Muffled and gritty (16 kbps);
   4kHz 4-bit: Serious muffle / distortion, 16 kbps.
   4kHz 2-bit: Muffle + distortion + grit, 8 kbps.
Possible merit of 4kHz 2-bit is that it allows putting a bell sound effect in around 500 bytes. Downside is that it is no longer particularly recognizable as a bell (and goes more from "ding" to "plong").
At 4 kHz, speech is basically almost entirely unintelligible, but can still hear that speech is present (its "shape" can still be heard; but words are still recognizable; sort of like the muffling when people are talking in a different room, but one can still hear that they are saying "something").
At 2 kHz; it is barely recognizable as being speech (it sounds almost more like wind). Percussive sounds are still recognizable though (so, music is turned into "howling wind with drums").
Early 90s games (such as Doom) mostly used 11 kHz as standard.
   IMHO, 16kHz is a better quality/space tradeoff.
   Where, 22/32/44 can sound better, but may not be worth the overhead.
   Sample rates above 44 kHz are overkill though.
I, personally, can't hear the difference between 44 and 48 kHz audio.
   I suspect anything 48kHz and beyond is likely needless overkill.

>
Then again, I did note that I may need to find some other "quality metric" for audio, as RMSE isn't really working...
>
At least going by RMSE, the "best" option would be to use 8-bit PCM and then downsample it.
>
Say, 4kHz 8-bit PCM has a lower RMSE score than 2-bit ADPCM, but subjectively the 2-bit ADPCM sounds significantly better.
>
Say: for 16kHz, and a test file (using a song here):
   PCM8, 16kHz     : 121 (128 kbps)
   A-Law, 16kHz    : 284 (128 kbps)
   IMA 4bit, 16kHz : 617 (64 kbps)
   IMA 2bit, 16kHz : 1692 (32 kbps, *)
   ADLQ 2bit, 16kHz: 2000 (32 kbps)
   PCM8, 4kHz      : 242 (32 kbps)
>
However, 4kHz PCM8 sounds terrible vs either 2-bit IMA or ADLQ.
   Basically sounds muffled, speech is unintelligible.
   But, it would be the "best" option if going solely by RMSE.
>
Also A-Law sounds better than PCM8 (at the same sample rate).
   Even with the higher RMSE score.
>
Seems like it could be possible to do RMSE on A-Law samples as a metric, but if anything this is just kicking the can down the road slightly.
>
Granted, A-Law sounds better than 4-bit IMA, and 4-bit IMA sounds better than the 2-bit ADPCM's at least...
>
>
*: Previously it was worse, around 4500, but the RMSE score dropped after switching it to using a similar encoder strategy to ADLQ, namely doing a brute-force search over the next 3 samples to find the values that best approximate the target samples.
>
Though, which is "better", or whether or not even lower RMSE "improves" quality here, is debatable (the PCM8 numbers clearly throw using RMSE as a quality metric into question for this case).
>
Ideally I would want some metric that better reflects hearing perception and is computationally cheap.
>
...
>

Date	Sujet	#	Auteur
30 Mar 25	Constant Stack Canaries	50	Robert Finch
30 Mar 25	Re: Constant Stack Canaries	49	BGB
30 Mar 25	Re: Constant Stack Canaries	48	MitchAlsup1
31 Mar 25	Re: Constant Stack Canaries	1	Robert Finch
31 Mar 25	Re: Constant Stack Canaries	1	BGB
31 Mar 25	Re: Constant Stack Canaries	45	Stephen Fuld
31 Mar 25	Re: Constant Stack Canaries	44	BGB
31 Mar 25	Re: Constant Stack Canaries	1	Stephen Fuld
31 Mar 25	Re: Constant Stack Canaries	42	MitchAlsup1
31 Mar 25	Re: Constant Stack Canaries	41	BGB
31 Mar 25	Re: Constant Stack Canaries	40	MitchAlsup1
1 Apr 25	Re: Constant Stack Canaries	10	Robert Finch
1 Apr 25	Re: Constant Stack Canaries	6	MitchAlsup1
1 Apr 25	Re: Constant Stack Canaries	5	Robert Finch
2 Apr 25	Re: Constant Stack Canaries	4	MitchAlsup1
2 Apr 25	Re: Constant Stack Canaries	3	Robert Finch
2 Apr 25	Re: Constant Stack Canaries	1	MitchAlsup1
4 Apr 25	Re: Constant Stack Canaries	1	MitchAlsup1
1 Apr 25	Re: Constant Stack Canaries	3	BGB
1 Apr 25	Re: Constant Stack Canaries	2	Robert Finch
2 Apr 25	Re: Constant Stack Canaries	1	BGB
1 Apr 25	Re: Constant Stack Canaries	29	BGB
2 Apr 25	Re: Constant Stack Canaries	28	MitchAlsup1
2 Apr 25	Re: Constant Stack Canaries	26	Stefan Monnier
2 Apr 25	Re: Constant Stack Canaries	25	BGB
3 Apr 25	Re: Constant Stack Canaries	24	Stefan Monnier
3 Apr 25	Re: Constant Stack Canaries	23	BGB
4 Apr 25	Re: Constant Stack Canaries	22	Robert Finch
4 Apr 25	Re: Constant Stack Canaries	1	BGB
4 Apr 25	Re: Constant Stack Canaries	20	MitchAlsup1
5 Apr 25	Re: Constant Stack Canaries	19	Robert Finch
5 Apr 25	Re: Constant Stack Canaries	18	MitchAlsup1
5 Apr 25	Re: Constant Stack Canaries	3	Robert Finch
6 Apr 25	Re: Constant Stack Canaries	1	MitchAlsup1
6 Apr 25	Re: Constant Stack Canaries	1	Robert Finch
6 Apr 25	Re: Constant Stack Canaries	14	MitchAlsup1
7 Apr 25	Re: Constant Stack Canaries	13	MitchAlsup1
9 Apr 25	Re: Constant Stack Canaries	1	MitchAlsup1
15 Apr 25	Re: Constant Stack Canaries	11	MitchAlsup1
15 Apr 25	Re: Constant Stack Canaries	10	MitchAlsup1
16 Apr 25	Re: Constant Stack Canaries	9	MitchAlsup1
16 Apr 25	Virtualization layers (was: Constant Stack Canaries)	2	Stefan Monnier
16 Apr 25	Re: Virtualization layers	1	MitchAlsup1
16 Apr 25	Re: Constant Stack Canaries	6	Stephen Fuld
17 Apr 25	Re: virtualization, Constant Stack Canaries	5	John Levine
17 Apr 25	Re: virtualization, Constant Stack Canaries	1	Stefan Monnier
17 Apr 25	Re: virtualization, Constant Stack Canaries	1	Stephen Fuld
17 Apr 25	Re: virtualization, Constant Stack Canaries	2	MitchAlsup1
17 Apr 25	Re: virtualization, Constant Stack Canaries	1	MitchAlsup1
2 Apr 25	Re: Constant Stack Canaries	1	BGB