On 5/6/2025 6:19 PM, Janis Papanagnou wrote:
On 06.05.2025 20:01, BGB wrote:
[...]
>
The partial rationale here being that the directory entries in this case
were fixed size (like FAT, albeit with longer names), and this could
potentially make the difference between using a single directory entry
or needing a more complex LFN style scheme. Though, in this case, the
default name length is 48, and it is rare for a filename to not fit into
48 bytes.
You mean rare in your application areas?
This appears to me like a very conservative size. While I'd agree
that it's probably a sensible value for own files with explicitly
chosen file names a lot of files that are downloaded regularly do
have longer file names. A quick check of my "Documents" directory
(that contains both, downloaded files and own files) shows a ratio
of 1563:629, i.e. roughly about 30% files of "document" type with
lengths > 48 (there's no files with a file name length > 128).
I recall someone here recently spoke about chosen lengths of 255
(or some such)for file names, which seems to be plenty, OTOH.
Running quick/dirty stats of everything on my "K:" drive, roughly 2 million files of various assorted types.
Stats (file names less than N bytes):
16: 66.40%
24: 87.85%
32: 95.38%
48: 99.31%
Names under than 48 bytes does drop to around 41.2% if limiting the scope to a directory of videos downloaded off of YouTube, where the downloaders tend to use the video title as the filename (this is atypical among general file naming IME). This represents roughly 5k files.
If the scope of the "K:" list is extended to stat file extensions, the top file extensions are (descending order):
h, c, html, txt, xml, png, s, o, gif, hpp, S, d, gz, cc, in, dts, ...
Checking on my "E:" drive:
Under 48: 98.86%
Top file extensions for the "E:" drive:
c, h, wav, html, png, txt, jpg, java, cpp, gif, js, o, class, obj, s, bmp
Checking on my "C:" drive:
Under 32: 76.80%
Under 48: 84.49%
Top file extensions on C:
h, png, dll, manifest, html, cat, xml, c, mui, gz, hpp, py, pyc, cpp, exe, java.
Drive file count: around 3 million.
So, seemingly, a 48 character name limit does in-fact cover most of the files, and the remaining 1-15% can fall back to an LFN like encoding, where the next sizes up are 100 and 160, which have a roughly 100% coverage across all files.
But, in terms of dirent size + name size, a 64 byte directory entry with a 48 byte name seemed best "on average".
A 128 byte dirent with 112 byte name would have already had ~100% coverage, but this also wastes roughly half the bytes in the average case (so, more space efficient to go smaller).
Meanwhile, a 32 byte dirent with a 16 byte base name would have a significantly lower hit rate, and would have also been less space-efficient for encoding LFN's (so more space-efficient to go bigger).
Can note that otherwise, my experimental filesystem uses an inode table, and each directory entry points to an inode (which holds a block indirection table and other things). In this case, the scheme for encoding block numbers is similar to EXT2 (currently with a 256 byte inode size; with very small files and symlinks able to be stored entirely inside the inode).
Current theoretical volume size limits:
128PB with 48-bit LBA (limit of existing storage devices)
256PB with 1K blocks
1EB with 4K blocks
It also has a few structural properties in common with NTFS, although the design does prioritize not being needlessly complicated. Currently does not have journaling as this was not worthwhile in my current use cases.
Janis
[...]