Sujet : Re: Split instruction and immediate stream
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.archDate : 24. Mar 2025, 10:00:14
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vrr6vl$hpgr$1@dont-email.me>
References : 1 2 3 4 5 6 7 8
User-Agent : Mozilla Thunderbird
On 3/23/2025 8:44 AM, Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
In the latest test project, the LB650 similar to a PowerPC, large
constants are encoded at the end of the cache line. So, there is a
similar issue of code running into the constant area.
What is your motivation for this?
If you have an instruction including constant(s) which no longer
fits your cache line (say, 8 bytes left and 12 bytes needed)
it does not matter where you put the constants and where you
put the instructions - it will not fit, and you have to start
a new cache line.
I am not seeing an advantage over what Power 10 does, which is
just to add a NOP at the end if things don't fit on a cacheline.
I ended up with a vaguely related issue in XG1, where a 96-bit encoding at a certain offset would not work within a 128-bit fetch with 64-bit alignment.
Workaround was that, in the odd case this scenario occurred, to insert a 16-bit NOP.
This issue does not occur with XG2, XG3, or RV+Jx. In XG2 or XG3 modes 32 bit alignment is required, at which point it is not possible for a 96 bit fetch to span 3 QWORDs.
At present it doesn't occur in RV+JX both because BGBCC doesn't yet support the 'C' extension (in any form that works), and also because support for 96-bit encodings was made non-default (*1).
*1: Thus far, all it can really encode is a 64-bit constant load, and 64-bit constant load isn't common enough by itself to justify the added issues of dealing with 96-bit cases (instead, 64b constant loads can be dealt with by using two 64-bit instructions).
But, yeah, can note:
Verilog style bit-manipulation has seen some use in BGBCC, and has the merit that in some cases it can generate faster code than traditional C style bit manipulation.
For example, for repacking RGB555 to a 10-bit format for a table lookup:
v=(_UBitInt(10)) { rgb[14:12], rgb[ 9: 6], rgb[ 4: 2] };
v=lut[v];
Turned out to be notably faster (with the BITMOV) instruction, if compared with:
cr=(rgb>>12)&7;
cg=(rgb>> 6)&15;
cb=(rgb>> 2)&7;
v=(cr<<7)|(cg<<3)|cb;
v=lut[v];
Mostly in relation to RGB555 -> Indexed-color conversion.
Granted, it is still slower than it might have been to have dedicated RGB conversion operations, but a lot more generic.
Though, it is looking like the dedicated palette-conversion instruction I added before might be in-fact too limiting (since it effectively only works with a particular palette), and it may be more reasonable to drop it, and switch to lookup table and a "slightly less niche" instruction for repacting RGB555 into a 9 or 10 bit format to feed through a palette conversion lookup.
A 15-bit lookup table is slower due to L1 misses (whereas, 512B or 1K has an easier time staying in the L1 cache). I had also noted that RGB343 seemingly has a better accuracy at indexed color lookup than RGB333 while cheaper than RGB444 (4K lookup).
The most likely option at the moment is an instruction to repack, say:
rrrrrgggggbbbbb
Into, say:
grbgrbgrbgrbgrb
Which could at least have multiple use-cases (more so if an instruction exists to also switch it back into the usual RGB555 ordering).
Decided to leave off going into specifics of possible considered tweaks to the design of my 256-color system palette, and stuff about color-cells (and the possibility of adding a 1.25 bpp color cell mode).
Can note though that a color-cell mode with 8x8x1 cells and 2x 8bpp endpoints isn't great for image fidelity (well, and the difficulties of trying to make the color-cell encoder fast enough that screen refresh can happen at a reasonable framerate).
...