Sujet : Re: Experiences with match() subexpressions?
De : janis_papanagnou+ng (at) *nospam* hotmail.com (Janis Papanagnou)
Groupes : comp.lang.awkDate : 10. Apr 2025, 12:55:07
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vt8bit$2uiq5$1@dont-email.me>
References : 1 2 3
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
On 10.04.2025 13:08, Kenny McCormack wrote:
In article <vt7qs4$2gior$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
On 10.04.2025 09:06, Janis Papanagnou wrote:
I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example
>
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
>
The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?
>
To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.
I have to admit that I (still) don't really understand how this match third
arg stuff works.
I've never used that before but it seems to be quite simple; for every
parenthesis group expression in the regexp it provides (statically, as
the parentheses are written, from left to right) an array element with
the expanded matched subexpression.
I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to
use it.
I adapted your code into the following test script:
--- Cut Here ---
#!/bin/sh
gawk 'BEGIN {
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
for (i in arr) print i,arr[i]
}'
# To clarify; what I wanted is access of the values "r1", "r2", "r3",
# and "e" through 'arr'.
--- Cut Here ---
The output I get is:
--- Cut Here ---
0start 1
0length 18
3start 18
1start 11
2start 13
3length 1
2length 2
1length 5
Above output stuff appears because in 'arr' there's additional elements
about the pattern positions stored.
I don't need that so I'm just interested in the data patterns below and
iterate with a index-counted loop...
0 R=r1,R=r2,R=r3,E=e
the whole expression
1 R=r3,
the expression in the first parenthesis
2 r3
the expression in the second, embedded parenthesis
3 e
the expression in the final parenthesis
--- Cut Here ---
After playing around a bit, I could not come up with any sensible way of
getting what you want to get.
Yeah, Arnold just told me the same; that it's impossible because the
underlying GNU regexp library doesn't support what I'm looking for.
What I considered a possible workaround (in this case) is to sequence
the (...){2,5} expression by using sequences of (...)? expressions.
(But in the general case, for larger ranges than 2-5, that's neither
feasible nor sensible any more.)
As an alternative, it sounds like you could just could just split the
string on the comma; that would get you:
Yes, that was also how I did such things in the past. Only when I saw
that "third argument" to match() I hoped the two-level parsing could
be simplified in one step. The reason was that I thought to have seen
other languages (Perl, maybe?) that supported such a feature.
R=r1
R=r2
R=r3
E=e
Or, for finer control, you could use patsplit().
I think I'll do the parsing the straightforward two-step way as I did
before the GNU Awk specific functions were available; it's probably
also the clearest way to program that functionality.
Janis