On 5/8/2025 6:13 AM, Janis Papanagnou wrote:
On 08.05.2025 05:30, BGB wrote:
[...]
>
Though, even for the Latin alphabet, once one goes much outside of ASCII
and Latin-1, it gets messy.
I noticed that in several places you were referring to Latin-1. Since
decades that has been replaced by the Latin-9 (ISO 8859-15) character
set[*] for practical reasons ('€' sign, for example).
Why is your focus still on the old Latin-1 (ISO 8859-1) character set?
Janis, just curious
[*] Unless Unicode and its encodings are used.
U+00A0..U+00FF are designated as Latin-1 in Unicode.
There are further Latin blocks in Unicode, but the characters are more haphazard, so any rules defined are more likely to operate one character at a time rather than moving whole blocks of characters (as in ASCII and the Latin-1 range).
CP-1252, is the dominant remaining ASCII character set in use, is based on Latin-1, with a few characters from Latin-15 shoved into the places where control codes previously went.
Say, euro mark (U+20AC) located at 80, Y with diaeresis (U+0178) located at 9F, ...
Apparently, in some online stats, only 0.02% of webpages use 8859-15 (vs 1.1% for 8859-1).
In my project, as noted, the Unicode mapping was tweaked in that 0080..009F are understood as the 1252 mappings, effectively leaving the C1 control codes as N/E, but the C1 control codes are pretty much unused in practice.
And, of the C0 control codes, only a subset of them can be considered "actually" used:
\0, \a, \b, \t, \r, \r, \e (known used, also have escape notations)
\v, \f (have C escapes, pretty much never encountered though).
In text files, it is usually reduced to:
\t, \r, \n
In this case, it means that the conversion between UTF-8 ans 1252 is fairly straightforward.
1252 -> UTF-8, simply remap anything in 80..FF into a 2-byte encoding.
UTF-8 -> 1252, remap 0000..00FF to bytes;
Potentially detect/reject if characters outside the range are used;
Some canonical Unicode characters mapped to 1252 range if possible.
Can further note:
00..FF:
Can also be represented in 6x8 font cells
Experimental GUI uses 6x8 for the console.
So, needs 480x200 pixels.
In addition to the 8x8 cells.
80..FF needed some twiddling in some cases to fit in 8x8.
Would have been easier with 8x12 cells or similar, but...
The 2-digit hexadecimal can be represented effectively in 8x8, but not so well in 6x8, as we generally need 4x5 pixels for each hex digit. At 6x8, one has to leave out the space pixels, so the digits collide, negatively effecting readability.
It is "mostly" possible to represent the ASCII range in 3x5 pixels (padded to 4x6 or 4x8), though some characters need to get "creative" and legibility is poor.
So, say, can't effectively do an 80x25 console at 320x200 pixels (and still have passable legibility), but 40x25 and 52x25 are possible.
For variable-size text rendering, was mostly using SDF's (signed distance fields).
Can cover most of the Unicode range by having converted Unifont into SDF cells (via an offline tool), but most of Unifont not render effectively at 8x8 or similar.
For best results at smaller sizes, for the 00..FF range, mostly still using an SDF derived from my 8x8 font (which was generally a bit more "robust" at the typical text sizes).
Had experimented with geometric "true type" style text rendering (1), but drawing directly as small glyphs did not work effectively. Had gotten best results by first drawing the glyphs at an "impractically large" size (say, 64x64 pixels) and then using this to generate an SDF image (usually represented as 16x16 pixels), then using the SDF to generate other text sizes.
In this case, bitmap glyphs are used for the actual rendering, but the SDF can be used to generate various size bitmap glyphs. Various stages of caching are used here.
Rendering large text using SDF's is liable to look wonky, but large text rendering is rare.
1: Although TrueType style font rendering originated in the 1980s, not sure how it would have been practical with 1980s level technology (say, machines with 1MHz CPUs and kB of RAM).
Strategies I had found were either computationally intensive or require first drawing the glyph at a large size (and then down-sampling in some way to get to the target size). Seemingly, Bitmap fonts would have presumably been a more practical option.
One downside of SDF's is that they are comparably bulky in terms of memory use, generally requiring around 8 bits per pixel (4b X, 4b Y). So, representing the entire Unicode BMP in uncompressed SDF form would need roughly 16MB. For the font, generally the images are stored in compressed (2) form (with each "plane" of 16x16 glyphs being decompressed as needed).
Then say, one has a cache of several planes (each needing 64K), noting that typically text rendering doesn't chaotically jump between planes.
Though, can note it isn't "that" much worse than using a binary conversion of Unifont (in 1bpp form), where, even if stored at 1bpp, is still around 1MB.
2: Decided to keep it shorter. Thus far, images are 256x256 and naively compressed with a byte-oriented (no entropy coder) LZ77 variant. More effective would be something with a pixel predictor and entropy coder (say, like PNG), but PNG decoding is too expensive. Budget option might be to simply subtract all the bytes from the previous bytes, and use an AdRice+STF+LZ77 style compressor (arguably "not as good" in terms of compression, but lower overheads vs something like Deflate).
[...]