Sujet : Re: IFS=$'\n'
De : mortonspam (at) *nospam* gmail.com (Ed Morton)
Groupes : comp.unix.shellDate : 15. Aug 2024, 13:30:18
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v9kosa$udgv$1@dont-email.me>
References : 1 2 3 4 5
User-Agent : Mozilla Thunderbird
On 8/14/2024 5:55 PM, Ralf Damaschke wrote:
Christian Weisgerber wrote:
If sufficiently many files accrue, find(1) will invoke ls(1) several
times, which will not produce the expected result. That may be unlikely
in this specific example, but it can happen in the general case.
>
Wait, you say, xargs(1) will also split its input across multiple
invocations. I mean, that's very much the point of xargs. Which is why
Helmut added the -x flag, which is supposed to prevent this behavior.
I see the point, but I hope I never meet a use case that says
"do something with the files found, but throw the list away if it can't
be done all at once". I would rather first assemble the list, try to execute
the command with it and if needed switch to some different approach of
handling the files.
Needing to process all of the files at once happens more often than you might think, e.g. to merge CSVs we need to retain the header line from just the first one read so the naive approach would be:
find . -type f -name '*.csv' -exec awk 'NR==1; FNR>1' {} +
but that would fail if `find` had to call awk for multiple batches of files at a time as `NR==1` would then be true multiple times during the execution of `find` and so the header lines from multiple files would be printed. The solution is something like (untested):
awk -v RS='\0' '
NR == FNR { ARGV[ARGC++]=$0; next }
(FNR == 1) && !doneHdr++
FNR > 1
' < <(find . -type f -name '*.csv' -print0) RS='\n'
We have to read the output of `find` in `awk` to populate `ARGV[]` instead of calling `awk` with the output of `find` as an argument list because if that output is so long that `find` has to split it up in the first script above, then it's also too long for `awk` to be passed as an argument list. Having `-print0` is obviously useful in that situation.
Regards,
Ed.