Re: "Mini" tags to reduce the number of op codes

Liste des GroupesRevenir à c arch 
Sujet : Re: "Mini" tags to reduce the number of op codes
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 11. Apr 2024, 04:14:33
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <uv7kit$1fc2u$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
User-Agent : Mozilla Thunderbird
On 4/10/2024 4:19 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
 
On 4/10/2024 12:12 PM, MitchAlsup1 wrote:
BGB wrote:
>
On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
>
>
Also the blob of constants needed to be within 512 bytes of the load instruction, which was also kind of an evil mess for branch handling (and extra bad if one needed to spill the constants in the middle of a basic block and then branch over it).
>
In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.
>
 
Yeah.
 
This was why some of the first things I did when I started extending SH-4 were:
Adding mechanisms to build constants inline;
Adding Load/Store ops with a displacement (albeit with encodings borrowed from SH-2A);
Adding 3R and 3RI encodings (originally Imm8 for 3RI).
 My suggestion is that:: "Now that you have screwed around for a while,
Why not take that experience and do a new ISA without any of those
mistakes in it" ??
 
There was a reboot, it became BJX2.
This, of course, has developed some of its own hair...
Where, BJX1 was a modified SuperH, and BJX2 was a redesigned ISA design that was "mostly backwards compatible" at the ASM level.
Granted, possibly I could have gone further, such as no longer having the stack pointer in R15, but alas...
Though, in some areas, SH had features that I had dropped as well, such as auto-increment addressing and delay slots.

Did have a mess when I later extended the ISA to 32 GPRs, as (like with BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.
 
Usually they were spilled between basic-blocks, with the basic-block needing to branch to the following basic-block in these cases.
>
Also 8-bit branch displacements are kinda lame, ...
>
Why do that to yourself ??
>
 
I didn't design SuperH, Hitachi did...
 But you did not fix them en massé, and you complain about them
at least once a week. There comes a time when it takes less time
and less courage to do that big switch and clean up all that mess.
 
For the most part, BJX2 is using 20-bit branches for 32-bit ops.
Exceptions being the Compare-and-Branch, and Compare-Zero-and-Branch ops, but this is mostly because there wasn't enough encoding space to give them larger displacements.
BREQ.Q  Rn, Disp11s
BREQ.Q  Rm, Rn, Disp8s
There are Disp32s variants available, just that these involve using a Jumbo prefix.

 
But, with BJX1, I had added Disp16 branches.
 
With BJX2, they were replaced with 20 bit branches. These have the merit of being able to branch anywhere within a Doom or Quake sized binary.
 
And, if one wanted a 16-bit branch:
   MOV.W (PC, 4), R0  //load a 16-bit branch displacement
   BRA/F R0
   .L0:
   NOP    // delay slot
   .WORD $(Label - .L0)
>
Also kinda bad...
>
Can you say Yech !!
>
 
Yeah.
This sort of stuff created strong incentive for ISA redesign...
 Maybe consider now as the appropriate time to strt.
 
The above was for SuperH; this sort of thing is N/A for BJX2.
In this case, BJX2 can pull it off in a single instruction.
None the less, even with all this crap, the SuperH was still seen as sufficient for the Sega 32X/Saturn/Dreamcast (and the Naomi and Hikaru arcade machine boards, ...).
Though, it seems Sega later jumped ship from SuperH to using low-end x86 PC motherboads in later arcade machines.

Granted, it is possible had I instead started with RISC-V instead of SuperH, it is probable BJX2 wouldn't exist.
 
Though, at the time, the original thinking was that SuperH having smaller instructions meant it would have better code density than RV32I or similar. Turns out not really, as the penalty of the 16 bit ops was needing almost twice as many on average.
 My 66000 only requires 70% the instruction count of RISC-V,
Yours could too ................
 
At this point, I suspect the main issue for me not (entirely) beating RV64G, is mostly compiler issues...
So, the ".text" section is still around 10% bigger, with some amount of this being spent on Jumbo prefixes, and the rest due to cases where code generation falls short.

Things like memcpy/memmove/memset/etc, are function calls in cases when not directly transformed into register load/store sequences.
>
My 66000 does not convert them into LD-ST sequences, MM is a single inst-
ruction.
>
>
I have no high-level memory move/copy/set instructions.
Only loads/stores...
>
You have the power to fix it.........
>
 
But, at what cost...
 You would not have to spend hours a week defending the indefensible !!
 
I had generally avoided anything that will have required microcode or shoving state-machines into the pipeline or similar.
 Things as simple as IDIV and FDIV require sequencers.
But LDM, STM, MM require sequencers simpler than IDIV and FDIV !!
 
Not so much in my case.
IDIV and FDIV:
   Feed inputs into Shift-Add unit;
   Stall pipeline for a predefined number of clock cycles;
   Grab result out of the other end (at which point, pipeline resumes).
In this case, the FDIV was based on noting that if one lets the Shift-Add unit run for longer, it moves from doing an integer divide to doing a fractional divide, so I could make it perform an FDIV merely by feeding the mantissas into it (as two big integers) and doubling the latency. Then glue on some extra logic to figure out the exponents and pack/unpack Binary64, and, done.
Not really the same thing at all...
Apart from it tending to get stomped every time one does an integer divide, could possibly also use it as an RNG, as it basically churns over whatever random bits flow into it from the pipeline.

Things like Load/Store-Multiple or
 If you like polluted ICaches..............
 
For small copies, can encode them inline, but past a certain size this becomes too bulky.
>
A copy loop makes more sense for bigger copies, but has a high overhead for small to medium copy.
>
>
So, there is a size range where doing it inline would be too bulky, but a loop caries an undesirable level of overhead.
>
All the more reason to put it (a highly useful unit of work) into an
instruction.
>
 
This is an area where "slides" work well, the main cost is mostly the bulk that the slide adds to the binary (albeit, it is one-off).
 Consider that the predictor getting into the slide the first time
always mispredicts !!
 
Possibly.
But, note that the paths headed into the slide are things like structure assignment and "memcpy()" where the size is constant. So, in these cases, the compiler already knows where it is branching.
So, say:
   memcpy(dst, src, 512);
Gets compiled as, effectively:
   MOV dst, R4
   MOV src, R5
   BSR __memcpy64_512_ua

Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...
 What if you only wanted to copy 63 bytes ?? Your DW slide fails miserably,
yet a HW sequencer only has to avoid asserting a single byte write enable
once.
 
Two strategies:
Compiler pads it to 64 bytes (typical for struct copy, where structs can always be padded up to their natural alignment);
It emits the code for copying the last N bytes (modulo 32) and then branches into the slide (typical for memcpy).
For variable memcpy, there is an extension:
   _memcpyf(void *dst, void *src, size_t len);
Which is basically the "I don't care if it copies a little extra" version (say, where it may pad the copy up to a multiple of 16 bytes).

For looping memcpy, it makes sense to copy 64 or 128 bytes per loop iteration or so to try to limit looping overhead.
 On low end machines, you want to operate at cache port width,
On high end machines, you want to operate at cache line widths per port.
This is essentially impossible using slides.....here, the same code is
not optimal across a line of implementations.
 
Possible.
As is, it uses 64-bit load/store for unaligned copy, and 128-bit for aligned copy (support for unaligned "MOV.X" is still an optional feature).
It mostly doesn't bother trying to sort this out for the slide, as for the size ranges dealt with by the slide, trying to separate aligned from unaligned at runtime will end up costing about as much as it saves.
Though, for larger copies, it makes more sense to figure it out.

Though, leveraging the memcpy slide for the interior part of the copy could be possible in theory as well.
 What do you do when the STAT drive wants to write a whole page ??
 
?...
Presumably there aren't going to be that many pages being paged out mid-memcpy.

For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot shorter (a big part of LZ decoder performance mostly being in fine-tuning the logic for the match copies).
 
Though, this is part of why my runtime library had added a "_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions, which can consolidate this rather than needing to do it one-off for each LZ decoder (as I see it, it is a similar issue to not wanting code to endlessly re-roll stuff for functions like memcpy or malloc/free, *).
 
*: Though, nevermind that the standard C interface for malloc is annoyingly minimal, and ends up requiring most non-trivial programs to roll their own memory management.
 
Ended up doing these with "slides", which end up eating roughly several kB of code space, but was more compact than using larger inline copies.
>
>
Say (IIRC):
   128 bytes or less: Inline Ld/St sequence
   129 bytes to 512B: Slide
   Over 512B: Call "memcpy()" or similar.
>
Versus::
     1-infinity: use MM instruction.
>
 
Yeah, but it makes the CPU logic more expensive.
 By what, 37-gates ??
 
I will assume it is probably a bit more than this given there is not currently any sort of mechanism that does anything similar.
Would need to add some sort of "inject synthesized instructions into the pipeline" mechanism, my guess is this would probably be at least a few kLUT. Well, unless it is put in ROM, but this would have no real advantage over "just do it in software".
FWIW:
I had originally intended to put a page-table walker in ROM and then pretend like it had a hardware page-walker, but we all know how this turned out.
Though, part of this was because it was competing against arguably more useful uses of ROM space, like the FAT driver, PE/COFF and ELF loaders, and the boot-time sanity checks (eg: verify early on that I hadn't broken fundamental parts of the CPU).

The slide generally has entry points in multiples of 32 bytes, and operates in reverse order. So, if not a multiple of 32 bytes, the last bytes need to be handled externally prior to branching into the slide.
>
Does this remain sequentially consistent ??
>
 
Within a thread, it is fine.
 What if a SATA drive is reading while you are writing !!
That is, DMA is no different than multi-threaded applications--except
DMA cannot perform locks.
 
Currently there is no DMA, only polling IO.
Also no SATA interface, nor PCIE, nor ...
IO to an SDcard is basically probing the MMIO interface and spinning in a loop until it is done. Most elaborate part of this interface is that there was a mechanism added to allow sending/recieving 8 bytes at a time over SPI.

Main wonk is that it does start copying from the high address first.
Presumably interrupts or similar wont be messing with application memory mid memcpy.
 The only things wanting high-low access patterns are dumping stuff to the stack. The fact you CAN get away with it most of the time is no excuse.
 
AFAIK, these is no particular requirement for which direction "memcpy()" goes.
And, high to low was more effective for the copy slide.

The looping memcpy's generally work from low to high addresses though.
 As does all string processing.
Granted.
The string handling functions are their own piles of fun...

Date Sujet#  Auteur
3 Apr 24 * "Mini" tags to reduce the number of op codes81Stephen Fuld
3 Apr 24 +* Re: "Mini" tags to reduce the number of op codes8Anton Ertl
15 Apr 24 i+* Re: "Mini" tags to reduce the number of op codes6MitchAlsup1
15 Apr 24 ii`* Re: "Mini" tags to reduce the number of op codes5Terje Mathisen
15 Apr 24 ii +- Re: "Mini" tags to reduce the number of op codes1Terje Mathisen
15 Apr 24 ii `* Re: "Mini" tags to reduce the number of op codes3MitchAlsup1
16 Apr 24 ii  `* Re: "Mini" tags to reduce the number of op codes2Terje Mathisen
16 Apr 24 ii   `- Re: "Mini" tags to reduce the number of op codes1MitchAlsup1
17 Apr 24 i`- Re: "Mini" tags to reduce the number of op codes1Stephen Fuld
3 Apr 24 +* Re: "Mini" tags to reduce the number of op codes3Thomas Koenig
17 Apr 24 i`* Re: "Mini" tags to reduce the number of op codes2Stephen Fuld
17 Apr 24 i `- Re: "Mini" tags to reduce the number of op codes1BGB-Alt
3 Apr 24 +* Re: "Mini" tags to reduce the number of op codes12BGB-Alt
3 Apr 24 i+* Re: "Mini" tags to reduce the number of op codes9MitchAlsup1
4 Apr 24 ii+* Re: "Mini" tags to reduce the number of op codes7Terje Mathisen
4 Apr 24 iii+* Re: "Mini" tags to reduce the number of op codes3Michael S
4 Apr 24 iiii`* Re: "Mini" tags to reduce the number of op codes2Terje Mathisen
4 Apr 24 iiii `- Re: "Mini" tags to reduce the number of op codes1Michael S
5 Apr 24 iii`* Re: "Mini" tags to reduce the number of op codes3BGB-Alt
5 Apr 24 iii `* Re: "Mini" tags to reduce the number of op codes2MitchAlsup1
5 Apr 24 iii  `- Re: "Mini" tags to reduce the number of op codes1BGB
17 Apr 24 ii`- Re: "Mini" tags to reduce the number of op codes1Stephen Fuld
3 Apr 24 i`* Re: "Mini" tags to reduce the number of op codes2MitchAlsup1
4 Apr 24 i `- Re: "Mini" tags to reduce the number of op codes1BGB
5 Apr 24 +* Re: "Mini" tags to reduce the number of op codes54John Savard
5 Apr 24 i+- Re: "Mini" tags to reduce the number of op codes1BGB-Alt
5 Apr 24 i`* Re: "Mini" tags to reduce the number of op codes52MitchAlsup1
7 Apr 24 i `* Re: "Mini" tags to reduce the number of op codes51John Savard
7 Apr 24 i  +* Re: "Mini" tags to reduce the number of op codes6MitchAlsup1
8 Apr 24 i  i`* Re: "Mini" tags to reduce the number of op codes5John Savard
8 Apr 24 i  i +* Re: "Mini" tags to reduce the number of op codes2Thomas Koenig
17 Apr 24 i  i i`- Re: "Mini" tags to reduce the number of op codes1John Savard
8 Apr 24 i  i `* Re: "Mini" tags to reduce the number of op codes2MitchAlsup1
17 Apr 24 i  i  `- Re: "Mini" tags to reduce the number of op codes1John Savard
7 Apr 24 i  `* Re: "Mini" tags to reduce the number of op codes44Thomas Koenig
7 Apr 24 i   `* Re: "Mini" tags to reduce the number of op codes43MitchAlsup1
8 Apr 24 i    `* Re: "Mini" tags to reduce the number of op codes42Thomas Koenig
8 Apr 24 i     +- Re: "Mini" tags to reduce the number of op codes1Anton Ertl
9 Apr 24 i     `* Re: "Mini" tags to reduce the number of op codes40Thomas Koenig
9 Apr 24 i      +* Re: "Mini" tags to reduce the number of op codes38BGB
9 Apr 24 i      i`* Re: "Mini" tags to reduce the number of op codes37MitchAlsup1
10 Apr 24 i      i `* Re: "Mini" tags to reduce the number of op codes36BGB-Alt
10 Apr 24 i      i  +* Re: "Mini" tags to reduce the number of op codes31MitchAlsup1
10 Apr 24 i      i  i+* Re: "Mini" tags to reduce the number of op codes23BGB
10 Apr 24 i      i  ii`* Re: "Mini" tags to reduce the number of op codes22MitchAlsup1
10 Apr 24 i      i  ii +* Re: "Mini" tags to reduce the number of op codes3BGB-Alt
10 Apr 24 i      i  ii i`* Re: "Mini" tags to reduce the number of op codes2MitchAlsup1
11 Apr 24 i      i  ii i `- Re: "Mini" tags to reduce the number of op codes1BGB
10 Apr 24 i      i  ii +- Re: "Mini" tags to reduce the number of op codes1BGB-Alt
11 Apr 24 i      i  ii +* Re: "Mini" tags to reduce the number of op codes16MitchAlsup1
11 Apr 24 i      i  ii i`* Re: "Mini" tags to reduce the number of op codes15Michael S
11 Apr 24 i      i  ii i `* Re: "Mini" tags to reduce the number of op codes14BGB
11 Apr 24 i      i  ii i  `* Re: "Mini" tags to reduce the number of op codes13MitchAlsup1
11 Apr 24 i      i  ii i   +* Re: "Mini" tags to reduce the number of op codes9BGB-Alt
12 Apr 24 i      i  ii i   i`* Re: "Mini" tags to reduce the number of op codes8MitchAlsup1
12 Apr 24 i      i  ii i   i `* Re: "Mini" tags to reduce the number of op codes7BGB
12 Apr 24 i      i  ii i   i  `* Re: "Mini" tags to reduce the number of op codes6MitchAlsup1
12 Apr 24 i      i  ii i   i   `* Re: "Mini" tags to reduce the number of op codes5BGB
13 Apr 24 i      i  ii i   i    +- Re: "Mini" tags to reduce the number of op codes1MitchAlsup1
13 Apr 24 i      i  ii i   i    `* Re: "Mini" tags to reduce the number of op codes3MitchAlsup1
13 Apr 24 i      i  ii i   i     +- Re: "Mini" tags to reduce the number of op codes1BGB
15 Apr 24 i      i  ii i   i     `- Re: "Mini" tags to reduce the number of op codes1BGB-Alt
12 Apr 24 i      i  ii i   `* Re: "Mini" tags to reduce the number of op codes3Michael S
12 Apr 24 i      i  ii i    +- Re: "Mini" tags to reduce the number of op codes1Michael S
15 Apr 24 i      i  ii i    `- Re: "Mini" tags to reduce the number of op codes1MitchAlsup1
11 Apr 24 i      i  ii `- Re: "Mini" tags to reduce the number of op codes1Terje Mathisen
11 Apr 24 i      i  i`* Re: "Mini" tags to reduce the number of op codes7Paul A. Clayton
11 Apr 24 i      i  i +- Re: "Mini" tags to reduce the number of op codes1BGB
11 Apr 24 i      i  i +* Re: "Mini" tags to reduce the number of op codes2BGB-Alt
12 Apr 24 i      i  i i`- Re: "Mini" tags to reduce the number of op codes1MitchAlsup1
12 Apr 24 i      i  i +* Re: "Mini" tags to reduce the number of op codes2MitchAlsup1
21 Apr 24 i      i  i i`- Re: "Mini" tags to reduce the number of op codes1Paul A. Clayton
21 Apr 24 i      i  i `- Re: "Mini" tags to reduce the number of op codes1Paul A. Clayton
10 Apr 24 i      i  `* Re: "Mini" tags to reduce the number of op codes4Chris M. Thomasson
10 Apr 24 i      i   `* Re: "Mini" tags to reduce the number of op codes3BGB
10 Apr 24 i      i    `* Re: "Mini" tags to reduce the number of op codes2Chris M. Thomasson
10 Apr 24 i      i     `- Re: "Mini" tags to reduce the number of op codes1BGB-Alt
13 Apr 24 i      `- Re: "Mini" tags to reduce the number of op codes1Brian G. Lucas
15 Apr 24 +- Re: "Mini" tags to reduce the number of op codes1MitchAlsup1
17 Apr 24 `* Re: "Mini" tags to reduce the number of op codes2Stephen Fuld
17 Apr 24  `- Re: "Mini" tags to reduce the number of op codes1MitchAlsup1

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal