Sujet : Re: [gawk] Handling variants of CSV input data formats
De : janis_papanagnou+ng (at) *nospam* hotmail.com (Janis Papanagnou)
Groupes : comp.lang.awkDate : 26. Aug 2024, 13:54:04
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vahttd$2f666$1@dont-email.me>
References : 1 2
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
On 26.08.2024 13:26, Ed Morton wrote:
On 8/25/2024 1:00 AM, Janis Papanagnou wrote:
Myself I'm usually not using CSV format(s), but recently I advertised
GNU Awk (given that newer versions support CSV data processing) to a
friend seeking CSV solutions.
>
I was quite astonished when I stumbled across a StackOverflow article
about CSV processing with contemporary versions of GNU Awk and read
that you are restricted to comma as separator and double quotes to
enclose strings. The workarounds provided at SO were extremely clumsy.
>
Given that using ',', ';', '|' (or other delimiters) and also various
types of quotes are just a lexical (no functional) difference I wonder
whether it would be sensible to be able to define them, say, through
setting a PROCINFO element?
>
Janis
>
https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk
>
>
FYI gawk just inherited those behaviors (plus mandatory stripping of the
quotes from quoted fields, see
https://lists.gnu.org/archive/html/bug-gawk/2023-11/msg00018.html) from
Kernighans awk.
Thanks.
My opinion on this is that I wouldn't expect GNU Awk to become a (yet
another) CSV-processor. It's very convenient to have an easy input of
CSV data to be processed like other tabular data with Awk. So removal
of the (outer) quotes, transforming "inner" quotes of fields according
to the CSV-standard(s), and handling the escape symbol, would serve my
expectations. (I don't need CSV-output formatting, but I understand if
there is such a demand.)
The (flexible) support for (at least typical) field separators is IMO
more pressing. Whether it can be supported by PROCINFO[] (as I've [ad
hoc] written above) or using FS is only a detail.
But given the current implementation, this error message
$ awk -F';' --csv '...' data.csv
awk: warning: assignment to FS/FIELDWIDTHS/FPAT has no effect when using
--csv
indicates that there's already some FS consistency logic existing, so
instead of introducing another PROCINFO attribute it would probably be
more obvious and appear more consistent to the user to use ("re-use")
FS for the purpose of defining the CSV field delimiter.
And we also don't need to explicitly PROCINFO-define the used quotes
since the quotes could anyway be identified (and handled) implicitly.
Or are there any issues with data like
"Hi there!",42,'Hello "world"?',"Ed's post",3.14
to be provided (in $1..$5) as
Hi there!,42,Hello "world"?,Ed's post,3.14
Where I have a general unsettling feeling is when locales influence
the processing. Personally I have defined "LC_NUMERIC=C.UTF-8" so my
real numbers use a decimal point for the fraction anyway, but if the
processed data will have a locale numbers-representation with commas
then there might be issues (and not only with the CSV commas).
I'd have liked to provide more concrete information here, but I'm at
the moment even unable to reproduce Awk's behavior as documented in
its manual; I've tried the following command with various locales
$ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
-| 5,321
but always got just 5 as result.
Janis