On 9/10/2024 5:47 AM, David Brown wrote:
On 09/09/2024 21:25, Brett wrote:
David Brown <david.brown@hesbynett.no> wrote:
Of course the fine line between "smart code" and "smart-arse code" is
somewhat subjective!
>
It also varies over time, and depends on the needs of the code.
Sometimes it makes sense to prioritise efficiency over readability - but
that is rare, and has been getting steadily rarer over the decades as
processors have been getting faster (disproportionally so for
inefficient code) and compilers have been getting better.
>
Often you get the most efficient results by writing code clearly and
simply so that the compiler can understand it better and good object
code. This is particularly true if you want the same source to be used
on different targets or different variants of a target - few people can
track the instruction scheduling and timings on multiple processors
better than a good compiler. (And the few people who /can/ do that
spend their time chatting in comp.arch instead of writing code...) When
you do hand-made micro-optimisations, these can work against the
compiler and give poorer results overall.
>
I know of no example where hand optimized code does worse on a newer CPU.
A newer CPU with bigger OoOe will effectively unroll your code and schedule
it even better.
I would agree with you there. For the same object code, newer CPUs (with the same ISA) are typically faster for a variety of reasons. There may be the odd regression, but it is hard to market a newer CPU if it is slower than the older ones!
However, my point was that "hand-optimised" source code can lead to poorer results on newer /compilers/ compared to simpler source code. If you've googled for "bit twiddling hacks" for cool tricks, or written something like "(x << 4) + (x << 2) + x" instead of "x * 21", then the results will be slower with a modern compiler and modern cpu, even though the "hand-optimised" version might have been faster two decades ago. You can expect the modern tool to convert the multiplication into shifts and adds if that is more efficient on the target, or a multiplication if that is best on the target. But you can't expect the compiler to turn the shifts and adds into a multiplication. (Sometimes it can, but you can't expect it to.)
At least on MSVC, it may turn "(x<<3)+y" and similar into LEA instructions, but will generally also do so with multiply, if given the right types and values (2, 3, 4, 5, 8, 9).
3/8 + 5/8 is useful for color-cell blending, as:
Constants are cheap to calculate;
It is approximately 1/3 and 2/3.
While 5/16 and 11/16 are closer, they are more expensive to calculate.
But, on any semi-modern system, trying to avoid an integer multiply with a non-trivial amount of shift and add isn't really worth it in the general case.
Can still be worthwhile for integer constants in logic synthesis, where shifts and adds are cheaper to synthesize.
There are also color-transforms that can use shifts to good effect:
RCT: Y=(2*G+R+B)>>2; U=B-G; V=R-G;
YCoCg (one variant):
V=R-B; t=(R+B)>>1; U=G-t; Y=(G+t)>>1;
*1: There are a number of variants, this one did best in my testing (and is still reversible). Though, didn't really win vs RCT (the claims of "superiority" of YCoCg over RCT don't really seem to agree with my own testing in these areas).
One potentially more physiologically-accurate model might be:
V=R-G; t=(R+G)>>1; U=B-t; Y=(B+t)>>1;
But, this would over-represent B in the Y channel, whereas (for both perceptual accuracy and compression) one wants G as the dominant component of Y (as they tend to be highly correlated).
Another option might be:
U=B-G; t=(B+G)>>1; V=R-t; Y=(R+t)>>1;
Which compromises between the physiological model and compression.
Where, the amount of physiological accuracy matters in the case of lossy compression, and how likely color artifacts are likely to be noticeable.
If the case of lossy compression, the chromatic artifacts of YCoCg tend to be much more obvious than those of RCT or YUV (even if the RMSE score is better).
I had a color model that worked reasonably well:
Y=(4*G+3*R+B)>>3; U=B-Y; V=R-Y;
But, couldn't devise a fully reversible version of the transform (though I had thought I had done so in the past). So, it would only be usable as a lossy model.
As a lossy model, it has the property of being able to be made pretty close to a drop-in replacement of YCbCr if modified to:
Y=(4*G+3*R+B)>>3; U=(B-Y)/2+128; V=(R-Y)/2+128;
With relatively little chromatic distortion (but, is cheaper to calculate).
Can note that in JPEG-like image compressors, the performance of the colorspace transform tends to be a big factor the overall performance (more so if using a cheaper block-transform, such as Block-Haar or WHT, rather than IDCT).
Though, for lossless (and near lossless) modes, entropy coding and block VLC also make a bigger effect. In proper JPEG, the Huffman decoding is also fairly slow. You either need to have an escape case to fall back to a linear search, which is slow, or suffer huge amounts of cache misses due to needing to use ~ 128K for each Huffman lookup table, or ~ 512K worth of Huffman tables. In this case, it can be cheaper to use Rice coding.
Also, unlike Huffman, Rice coding and AdRice is also easier to accelerate in an hardware.
One could "almost" make a case for trying to special-case YUV->RGB conversion in the 4:2:0 case with special CPU ops, say:
PCNVYUV2RGB5 Rm, Rn //YUV approximation
PCNVRCT2RGB5 Rm, Rn //RCT
Where, Rm takes the form:
YYYY-YYYY-UUUU-VVVV
And converts 2 pixels (with 16-bit components), into 2x RGB555 (with built-in clamping).
If one were go the ASM route, might also make sense to have special helper ops for the Haar transform (the core of both Block-Haar and the WHT; also relevant to YCoCg).
PINVHAAR Rm, Rn
Rm (input, as 4x Int16):
(A+B)/2, (C+D)/2; A-B, C-D
Rn (output, as 4x Int16):
Reconstructed A,B,C,D.
Though, this would be less relevant to JPEG proper (which in my case, would in theory be better served by the existing PMULSH.W instruction for the IDCT).
Would make a lot more sense from my recent "UPIC" compressor though (manages to be competitive with T.81 JPEG in Q/bpp, also allows lossless encoding and an alpha channel, and could in theory be easier to optimize the decoder for my ISA).
I did not initially expect it to compare well to T.81 in terms of Q/bpp given, as noted, it was using AdRice and Block-Haar).
But, one may still hard pressed to get acceptable speeds from this class of image compressor (but, they can give both better quality and better Q/bpp vs color-cell based designs).
>
It’s older lesser CPU’s where your hand optimized code might fail hard, and
I know of few examples of that. None actually.
>
This is especially the case
when code is moved around with inlining, constant propagation,
unrolling, link-time optimisation, etc.
>
Long ago, it was a different matter - then compilers needed more help to
get good results. And compilers are far from perfect - there are still
times when "smart" code or assembly-like C is needed (such as when
taking advantage of some vector and SIMD facilities).
>