On 5/7/2025 1:57 PM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 5/7/2025 8:45 AM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 5/6/2025 6:19 PM, Janis Papanagnou wrote:
On 06.05.2025 20:01, BGB wrote:
[...]
These component names are what the filesystem actually stores.
>
>
Combined path length can be somewhat longer.
A traditional limit is 260 chars though.
>
There wasn't a well defined path-length limit in my projects, though
informally typically somewhere between 256 and 768 bytes.
>
It is rare to see a path over 256, but, "if doing it properly" a
consistent length limit of, say, 512 or 768 would make sense. Going any
bigger is likely needlessly overkill.
>
>
Full paths will exceed the 48 bytes frequently, this, for example
is 142 bytes.
>
/work/music/Blues/Howlin' Wolf/The Chess Box (1963 - 1973) (disc 3)/20 - The Red Rooster (London Sessions w false start and dialog) (1970).wav
>
We don't usually store full paths in a filesystem, as each directory
exists as its own entity.
>
So, say, if broken down:
5 chars ("work")
6 chars ("music")
6 chars
13 chars
37 chars
75 chars.
>
Longest name counted here would be 75.
It doesn't matter from the application perspective, an
application needs to be written to support the largest
path length an implementation allows (e.g. POSIX PATH_MAX).
Very few applications actually walk directories and thus
don't particularly notice per-directory limits.
As noted, I was talking about filesystem file-name lengths though, not overall path lengths.
The filesystem driver does do directory walking, so this is what I was writing about here.
POSIX has NAME_MAX as the maximum size that an implementation
supports (POSIX also defines the minimum value that an
implementation must support, which is 14 for NAME_MAX).
14 bytes was the maximum filename length for the original unix
filesystems (later known as S5).
$ getconf -a | grep NAME_MAX
NAME_MAX 255
_POSIX_NAME_MAX 255
256 is a typical limit for this. I was also using 256.
$ getconf -a | grep PATH_MAX
PATH_MAX 4096
_POSIX_PATH_MAX 4096
IMO though, 4096 is a little overkill though.
The traditional Windows path limit, 260 is a little small.
As noted, I would be more inclined towards 512 or 768 here, as it is "plenty long" without being so long as to become overly wasteful or problematic.
>
It is not all that uncommon for directories to get larger than a limit
where linear search is inefficient.
Large directories are not generally considered a good
idea for multiple reasons.
More stats (from the gathered drive dumps).
Files in a directory:
Under 10: 83.10
Under 16: 88.74
Under 50: 96.65
Under 100: 98.81
Under 500: 99.91
Under 1000: 99.96
Under 2000: 99.99
So, the vast majority of directories have a relatively small number of files.
Most of the directories are either in the domain where:
Linear search is fastest
But the added cost of an AVL tree walk does not matter.
Are in the domain where AVL is the fastest option.
The main area where B-Tree reaches break-even being less than 0.04%, and where B-Tree's advantages are "actually relevant" being less than 0.01%.
And, if the minimal B-Tree takes more than a single disk block, or is much more expensive than a simple linear search over 10 or so items, it has already lost...
At best, one would end up with most of the directories consisting of a single B-Tree leaf block.
A typical B-Tree has a few drawbacks here:
Needs to duplicate keys between leaf and node blocks;
Leaf blocks are, on average, only around 50% full.
Splitting a leaf means at least 3 blocks;
...
The argued merit is that a smaller number of nodes need to be checked, say, log16(n) rather than log2(n), ... But, this effect only really starts to come into play once 'n' gets bigger.
Similarly, most of the directories are effectively in a "zone of suckage" as far as B-Tree efficiency goes, but in this case can't push the K factor outside of a range to where this is no longer the case.
Where, for small n and small K, say:
K < n < 0.5*K*K
Is an area where B-Trees actually kinda suck.
They mostly start to pay off when n >= K*K.
Almost implicitly though, people assume n > K*K.
Well, but also you want to keep logk(n) small,
which is usual argument against AVL trees and similar.
Say, log2(n) > log16(n) or similar.
Well, and not n<=K, where it is effectively just a flat array...
I guess, a similar "zone of suckage" exists for AVL trees, but in this case it is a tree with 2 items, and we can mostly ignore it (and it is functionally equivalent to a 2-element linked list). So, I guess, it is mostly a case of we quickly blow past this first limit, but n is small enough that we don't eat it in the cost of the 'log2(n)' tree walking part (but, log2(1024) = 10, which is crossing back into "starting to suck again"; where log16(1024) = 2.5, which is still good; but doesn't matter as much if it is 0.04% of the directories...).
One could argue for maybe having multiple types of directory (like ext2 and friends, but this adds complexity).
...
Statistical average file size seems to be somewhere around 256K ( ~ total_space_used / total_files_on_drive).
Though, I suspect more likely it is a distribution of actually most files being under 100K, but then a significant minority of files being considerably larger.
Like, with video files:
There are comparably few of them, all things said, but they are often ~ 100-200 MB each (so can eat lots of HDD space being small in number).
Can note, the FS design I had, ended up using a block allocation strategy like.
Uncompressed file with < 2^32 disk blocks:
16 entries: direct block numbers (16K with 1K blocks);
8 entries: indirect block (2MB with 1K blocks);
4 entries: double indirect (256MB at 1K)
2 entries: triple indirect (128GB)
1 entry: quad indirect (16TB)
1 entry: N/A
(would be 5x, but bigger than volume size, so is N/A)
Uncompressed file with more than 2^32 blocks (64-bit block numbers):
8 blocks: Direct block numbers (8K with 1K blocks).
4 blocks: Indirect block numbers (512K with 1K blocks);
1 entry: double indirect (16MB)
1 entry: triple indirect (2GB)
1 entry: quad indirect (256GB)
1 entry: 5x indirect (32TB)
Though, theoretical max file size goes up to 4PB here with a 4K block size. There is another indirection table that also allows for larger maximum file size, but would require the use of a 512 byte inode.
Compressed files use a similar structure, with 64-bit block numbers, although the final leaf block size is 16x larger (So, 16K for 1K blocks, or 64K for 4K blocks). This is partly based on the mechanism used for file compression (it compresses each logical file block, and stores it in 1 to 16 disk blocks).
Can note that the general allocation of disk blocks working in a superficially similar way to the Linux EXTn (ext2/ext3/ext4) filesystems.
Different tradeoffs exist, but generally block sizes between 1K and 4K seem to make the most sense. Larger block sizes could be possible, but need end-packing to not waste a lot of space.
Small files on FAT32 volumes are often a bad case as FAT32 tends to default towards fairly large cluster sizes, which end up wasting a significant amount of space (when a statistical majority of the files don't even fill a single disk cluster).
Though, when using NTFS, I often end up enabling file compression for most of the drive, as in many contexts, enabling file compression actually makes stuff faster.
Though, from what I gather, NTFS file compression works file-at-a-time, rather than my experimental FS's compression working in terms of blocks.
...