On 5/7/2025 8:45 AM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 5/6/2025 6:19 PM, Janis Papanagnou wrote:
On 06.05.2025 20:01, BGB wrote:
[...]
>
The partial rationale here being that the directory entries in this case
were fixed size (like FAT, albeit with longer names), and this could
potentially make the difference between using a single directory entry
or needing a more complex LFN style scheme. Though, in this case, the
default name length is 48, and it is rare for a filename to not fit into
48 bytes.
>
You mean rare in your application areas?
>
This appears to me like a very conservative size. While I'd agree
that it's probably a sensible value for own files with explicitly
chosen file names a lot of files that are downloaded regularly do
have longer file names. A quick check of my "Documents" directory
(that contains both, downloaded files and own files) shows a ratio
of 1563:629, i.e. roughly about 30% files of "document" type with
lengths > 48 (there's no files with a file name length > 128).
>
I recall someone here recently spoke about chosen lengths of 255
(or some such)for file names, which seems to be plenty, OTOH.
>
>
>
Running quick/dirty stats of everything on my "K:" drive, roughly 2
million files of various assorted types.
>
Stats (file names less than N bytes):
16: 66.40%
24: 87.85%
32: 95.38%
48: 99.31%
Are you considering the entire path, or just the final component?
Here, only the final:
"whatever.txt"
Along with each directory (on lines where the final component is the directory).
The directory names are not counted for each line though, as this would over-count the directory names. So, effectively, it is only counting the final name component of each line.
The list of drive contents in this case generated in WSL via the "find" command.
These component names are what the filesystem actually stores.
Combined path length can be somewhat longer.
A traditional limit is 260 chars though.
There wasn't a well defined path-length limit in my projects, though informally typically somewhere between 256 and 768 bytes.
It is rare to see a path over 256, but, "if doing it properly" a consistent length limit of, say, 512 or 768 would make sense. Going any bigger is likely needlessly overkill.
Full paths will exceed the 48 bytes frequently, this, for example
is 142 bytes.
/work/music/Blues/Howlin' Wolf/The Chess Box (1963 - 1973) (disc 3)/20 - The Red Rooster (London Sessions w false start and dialog) (1970).wav
We don't usually store full paths in a filesystem, as each directory exists as its own entity.
So, say, if broken down:
5 chars ("work")
6 chars ("music")
6 chars
13 chars
37 chars
75 chars.
Longest name counted here would be 75.
Can note that in my experimental filesystem, directory contents were stored in a structure resembling an AVL tree.
It is not all that uncommon for directories to get larger than a limit where linear search is inefficient.
Hashing works, but adds a constant space cost and a given hash table size only makes sense over a given size range (otherwise, it merely functions as a linear divisor relative to the item count).
Though, if one assumes 99% of directories have fewer than 1024 files and the target search length is 16 or less, could just use a 64 entry hash table and call it done. But, a 64 entry hash table merely wastes space for small directories.
Though, for a 64B dirent, a 64-entry hash would eat 4 dirents, so hashed directories could still fit in a single 1K disk block if fewer than 11 files (also semi common).
B-Trees are arguably the more efficient option for large directories, but B-Trees are also more complicated and would negatively effect the efficiency of small directories (if also used for small directories).
So, effectively I had rejected both the EXTn and NTFS approaches here.
Well, and also rejected EXTn's thing of effectively using variable length directory entries.
Modified AVL trees seemed like the most efficient option for the sizes of directories considered.
Small N (16 or less): No significant overhead compared with linear search.
Medium N (16 to 1024): Comparable efficiency to what B-Trees would provide.
Over 1024: B-Trees would overtake AVL trees for efficiency, but very few directories have more than 1024 items (and one still has an average-case search length of 10 blocks in this case).
The main modification to AVL trees here was that I had loosened the balancing constraints to +/- 2, as the looser constraint significantly reduces the number of node rotations needed when inserting items, while not significantly increasing average search length (or tree depth).
Though, in a theoretical worst case, a +/- 2 tree could have 2x the depth of a +/- 1 tree. This would likely only happen if items were inserted in a rather contrived ordering.
In the design, I did make the "possibly controversial" design decision that the tree-walking only works with 48 byte names.
For longer names, the "short name" is a modified form:
First 40 bytes: The first part of the name;
Last 8 bytes: A hash value of the full name.
Lookup happens with this shorter name, with the final dirent holding the rest of the name (which is spread across multiple dirents, sort of like in VFAT). This modified name is never exposed outside of the FS driver itself though.
Though, may be indirectly exposed in the way that calls like readdir would no longer report longer names in sorted order if the difference does not exist within the first 40 characters (the hash does not respect collation ordering).
Gemerally, the filesystem assumes an overall name-length limit of 256. Just opts for smaller storage when the name if 48 or less (which is true most of the time).
Though, FWIW, my FAT32/VFAT driver had also ignored the SFN when LFN's were present, and rather than using the Windows style handling for LFN's, essentially merely turns the SFN into a Base32 hash value and only uses the LFN as the name. Driver would still use the SFN if the name could be exactly represented in 8.3 form (thus, LFN entries being unnecessary).
Seemingly, modern Windows doesn't notice/care that the 8.3 name is merely a hash (and, in contexts where an 8.3 name does appear, it appears to be synthetic).
For now though, project still mostly uses FAT32 though...