Sujet : Re: Unicode in strings
De : monnier (at) *nospam* iro.umontreal.ca (Stefan Monnier)
Groupes : comp.archDate : 04. Jun 2024, 21:03:39
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <jwvcyowmr0r.fsf-monnier+comp.arch@gnu.org>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13
User-Agent : Gnus/5.13 (Gnus v5.13)
For text editors, this is one of the few cases it makes sense to use 32 or
64 bit characters (say, combining the 'character' with some additional
metadata such as formatting).
Even just 64bit is very tight to encode all the information in an emoji.
Though, one thing that makes sense for text editors is if only the
"currently being edited" lines are fully unpacked, whereas the others can
remain in a more compact form (such as UTF-8), and are then unpacked as they
come into view (say, treating the editor window as a 32-entry modulo cache
or similar).
You sufficiently rarely need to care about "character boundaries" that
such encoding/decoding is probably not worthwhile (especially if you
consider the case of multi-MB lines).
It's easy enough to move through UTF-8 itself.
Not entirely sure how other text editors manage things here, not really
looked into it.
Several different options.
Emacs uses a gap buffer, which is a quite primitive approach which in
theory has poor worst case behavior but works surprisingly well in
practice (especially with the speed at which current CPUs can copy/move
large chunks of memory).
Others use structures like ropes.
https://coredumped.dev/2023/08/09/text-showdown-gap-buffers-vs-ropes/ Stefan