On 5/1/2025 4:13 AM, David Brown wrote:
On 01/05/2025 01:56, Lawrence D'Oliveiro wrote:
On Wed, 30 Apr 2025 12:38:53 -0000 (UTC), Muttley wrote:
>
Its certainly not a scheme I'd use, but I've also seen Makefile and
makefile in the same package build directory in the past.
>
The GNU “make” command, specified without a filename, looks for
“GNUmakefile”, then “Makefile”, then “makefile”. The man page
<https://manpages.debian.org/make(1)> says:
>
We recommend Makefile because it appears prominently near the
beginning of a directory listing, right near other important files
such as README.
>
But is this still true for most people? I think the default sort
settings these days no longer put all-caps names at the top.
I can't speak for "most people", but since my project directories rarely have more than about a dozen files and directories (like "src" and "build") in the top directory, it could be called zzzz and still be near the top!
Wandering in a bit late, but I can note for my project (or, the makeshift OS part):
Nominal filename format: UTF-8.
IIRC, my experimental (Unix style) filesystem could use one of several encodings:
ASCII
UTF8
CP1252 (Latin-1 with extended control-codes replaced)
Merit of 1252 here is that it can potentially take fewer bytes, and statistically is most likely to cover any non-ASCII characters encountered (most are Latin-1), if everything fits into the character range, and using UTF-8 if it doesn't fit. It is possible to rely on disambiguation, and not use 1252 if it could be potentially confused for UTF-8. Most of the time, 1252 (if any non-ASCII chars are used) results in sequences that are invalid as UTF-8, thus no ambiguity would result (if not valid UTF-8, assume 1252). ASCII case can be ignored as it is equivalent between both encoding schemes.
The partial rationale here being that the directory entries in this case were fixed size (like FAT, albeit with longer names), and this could potentially make the difference between using a single directory entry or needing a more complex LFN style scheme. Though, in this case, the default name length is 48, and it is rare for a filename to not fit into 48 bytes.
No other codepages were supported here (so, anything not Latin-1 or similar will need to use UTF-8 regardless).
Another semi-filesystem is in use with similar rules, except with 32 byte filenames.
FAT32, as noted, is:
8.3, CP1252, with bits to encode upper or lower-case base and extension;
LFN's, with up to 256 characters of UCS-2.
...
At higher levels, API's generally assume normalization to UTF-8.
Though, with a few non-standard tweaks: 0080..009F are assumed to be the chars from 1252, and not the extended control chars;
In console settings, the Arabic alphabet was replaced with 2-digit hex numbers (00..FF), as:
I felt a need for 2 dense hex numbers in the console;
I ideally needed a spot low in the mapping;
The Arabic characters don't map to 8x8 pixel character cells (1).
*1: Might reconsider if someone can make a case that this alphabet could in-fact be represented in a recognizable form in 8x8 pixel character cells.
This mostly doesn't apply outside the console. For application use, the standard character assignments would be assumed.
As for collating:
Nominal order is raw unsigned bytes (based on the UTF-8 encoding);
This will put uppercase before lowercase.
I debated some if/what style of normalization to use for UTF strings.
Full Unicode normalization was too complicated;
Fully non-normalized encoding could also pose issues.
If this context, if it takes much over a few hundred lines of code and around 1K of tables, it was too expensive.
Normalization rules ended up being a compromise:
Only the Latin and Extended Latin combining characters are handled.
Or, roughly, Latin-1 and Latin-2.
Pretty much everything else is passed through as-is.
The combined characters are first broken up, and then any combining characters are combined. Filenames exist in combined form as this uses fewer bytes.
Say, filesystem layer does not normalize emojis, it has no reason to know what an emoji is.
There was some debate over representing non-BMP characters as UTF-8 coded surrogate pairs or as larger UTF-8 codepoints, off hand, I don't remember for certain. I think I may have chosen the latter due to fewer bytes, whereas I would usually have preferred UTF-8 coded surrogate pairs in other contexts. I do vaguely remember dealing with this issue in my normalization code though.
Though, in this case, the UTF-8 normalization was dealt with in the VFS level rather than in the FS drivers.
There was also a partial concern (that I remembered) that if a file in the filesystem were normalized in a way that differs from the VFS's normalization, it could potentially make the file effectively inaccessible. IIRC, there was no good solution to this possibility.
Most likely partial answer though is to assume that any filename normalization rules are to preferably be kept frozen.
...