On 9/12/2024 4:52 PM, MitchAlsup1 wrote:
On Thu, 12 Sep 2024 21:14:18 +0000, BGB wrote:
>
This is because in some cases, the performance overhead of copying the
last (sz&31) bytes is significant, say:
rsz=cte-ct;
if(rsz)
{
if(rsz&16)
{
v0=((u64 *)cs)[0]; v1=((u64 *)cs)[1];
((u64 *)ct)[0]=v0; ((u64 *)ct)[1]=v1;
cs+=16; ct+=16;
}
if(rsz&8)
{
v0=((u64 *)cs)[0];
((u64 *)ct)[0]=v0;
cs+=8; ct+=8;
}
if(rsz&4)
{
v0=((u32 *)cs)[0];
((u32 *)ct)[0]=v0;
cs+=4; ct+=4;
}
if(rsz&2)
{
v0=((u16 *)cs)[0];
((u16 *)ct)[0]=v0;
cs+=2; ct+=2;
}
if(rsz&1)
{
v0=((byte *)cs)[0];
((byte *)ct)[0]=v0;
cs++; ct++;
}
}
>
For small copies with awkward sizes, this tailing part can cost more
than the whole rest of the copy.
A fine rendition of why this should be in HW as an instruction.
I guess potentially it could be converted into a bit-select, but harder issue is efficiently generating the selection masks.
Say:
MOV.Q (R4, 0), R16
MOV.Q (R4, 8), R17
MOV.Q (R4, 16), R18
MOV.Q (R4, 24), R19
MOV.Q (R5, 0), R20
MOV.Q (R5, 8), R21
MOV.Q (R5, 16), R22
MOV.Q (R5, 24), R23
BITSEL R20, R36, R16 //R16=(R20&R36)|(R16&(~R36))
BITSEL R21, R37, R17
BITSEL R22, R38, R18
BITSEL R23, R39, R19
MOV.Q R16, (R4, 0)
MOV.Q R17, (R4, 8)
MOV.Q R18, (R4, 16)
MOV.Q R19, (R4, 24)
Maybe:
AND R7, 31, R6
MOV 1, R36 | SHAD R6, 3, R7
SHLD.Q R36, R7, R36
ADD R36, -1, R36
MOV R36, R37 | MOV R36, R38
MOV R36, R39
CMPGT 7, R6
MOV?T -1, R36
CMPGT 15, R6
MOV?T -1, R37
CMPGT 23, R6
MOV?T -1, R38
Or (using Morton shuffling to double bits):
AND R7, 31, R6
MOV 1, R7
SHLD.Q R7, R6, R7
ADD R7, -1, R16
MOVLLD R16, R16, R36 | SHLD R16, -8, R17
MOVLLD R17, R17, R37 | SHLD R17, -8, R18
MOVLLD R18, R18, R38 | SHLD R18, -8, R19
MOVLLD R19, R19, R39 | PMORT.Q R36, R36
MOVLLD R36, R36, R36 | PMORT.Q R37, R37
MOVLLD R37, R37, R37 | PMORT.Q R38, R38
MOVLLD R38, R38, R38 | PMORT.Q R39, R39
MOVLLD R39, R39, R39 | PMORT.Q R36, R36
MOVLLD R36, R36, R36 | PMORT.Q R37, R37
MOVLLD R37, R37, R37 | PMORT.Q R38, R38
MOVLLD R38, R38, R38 | PMORT.Q R39, R39
MOVLLD R39, R39, R39 | PMORT.Q R36, R36
PMORT.Q R37, R37
PMORT.Q R38, R38
PMORT.Q R39, R39
(But, this seems worse...).
Meanwhile, still trying to figure out why the virtual memory system has decided to crap itself...
It appears as if the link register may get corrupted on TLB Miss for some reason (and is otherwise giving behavior very similar to an unresolved bug previously seen in the VL core; but this time is also happening in the emulator...).
It appears as if TLB misses are in effect leading to impossible control-flow (function returns going to places the LR would have previously landed, but which are not valid for the function in question; almost like it is landing on previous stale link-register values). But, otherwise the stack seems to be otherwise intact (as checked by re-enabling global stack canary checks...).
ISR handler prolog/epilog looks correct, not sure what could be messing up LR (well, and clock-edge timing issues are not a thing for an emulator written in C).
But, it is a little bit of a mystery at the moment...
...