Sujet : Re: GNU Awk's types of regular expressions
De : arnold (at) *nospam* freefriends.org (Aharon Robbins)
Groupes : comp.lang.awkDate : 01. Dec 2024, 21:20:22
Autres entêtes
Organisation : The Friends of Rational Range Interpretation
Message-ID : <674cc506$0$711$14726298@news.sunsite.dk>
References : 1
User-Agent : trn 4.0-test77 (Sep 1, 2010)
Hi. Mack The Knife pointed me at this question.
This kind of query should go to the bug list (where I'll see it).
I skim the help list occasionally but don't reply to mails there.
In article <
viac5m$l8oh$1@dont-email.me> Janis writes:
In GNU Awk there's currently three types of regular expressions, in
addition to the standard regexp-constants (/regex/) and the dynamic
regexps ("regex", or variables containing "regex") there's in newer
versions also first class regexp objects (@/regex/, "Strongly Typed
Regexp Constants") supported.
>
One principal advantage of regexp-constants is that the engine to
parse the regexp can be created in advance, while a dynamic regexp
may be constructed dynamically (from strings) and needs an explicit
runtime-step to create the engine before the matching can be done.
Even for such dynamically created regexps, the regexp is compiled once and
cached, not compiled each time it's used (as long as it doesn't change).
Now I assumed that @/regex-const/ would in that respect behave as
/regex-const/ ... - until I found in the GNU Awk manual this text:
>
| Thus, if you have something like this:
|
| re = @/don't panic/
| sub(/don't/, "do", re)
| print typeof(re), re
|
| then re retains its type, but now attempts to match the string ‘do
| panic’. This provides a (very indirect) way to create regexp-typed
| variables at runtime.
>
(I'm astonished that first class regexp objects can be dynamically
changed. But that is not my point here; I'm interested in potential
pre-compiles of regexp constants...)
Since `re' is a variable, it can be changed, just as when you do
str = "don't panic"
sub(/don't/, "do", str)
This would imply that the first class regexp constants can be changed
like dynamic regexps and that there's no regexp pre-compile involved.
"Not so, Watson! Not so!" When you do
re = @/don't panic/
gawk uses reference counted pointers to the original object; the
original strongly typed regexp is precompiled and remains that way.
As soon as you go to *change* `re', gawk makes a copy of the string
value of the orginal regexp, makes the substitution, notes that
it's a strongly typed regexp, and compiles the new regexp. From then
on, the cached compiled regexp is used for matching.
This would also rise suspicion that the "normal" regexp-constants are
probably also not precomputed.
Also not true.
So constant-regexps (both forms) have (only?) the advantage that the
regexp-syntax can be (initially during awk parsing) checked, e.g.,
>
re = @/don't panic[/
^ unterminated regexp
Incorrect, they are compiled when the program is parsed.
And dynamic regexps and first class regexps that got changed (e.g.
by code like
>
sub(/don't/, "do[", re)
>
in above sample snippet) would both create runtime errors, e.g.
>
error: Unmatched [, [^, [:, [., or [=: /do[ panic/
fatal: could not make typed regex
>
(as all ill-formed regexp-types will produce a runtime error).
Well, of course.
In short, I jump through a lot of hoops in order to avoid recompiling
regexps if it's not necessary.
Hope this helps,
Arnold
-- Aharon (Arnold) Robbins arnold AT skeeve DOT com