Sujet : Re: GNU Awk's types of regular expressions
De : 643-408-1753 (at) *nospam* kylheku.com (Kaz Kylheku)
Groupes : comp.lang.awkDate : 29. Nov 2024, 05:13:43
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <20241128200247.439@kylheku.com>
References : 1
User-Agent : slrn/pre1.0.4-9 (Linux)
On 2024-11-28, Janis Papanagnou <janis_papanagnou+
ng@hotmail.com> wrote:
In GNU Awk there's currently three types of regular expressions, in
addition to the standard regexp-constants (/regex/) and the dynamic
regexps ("regex", or variables containing "regex") there's in newer
versions also first class regexp objects (@/regex/, "Strongly Typed
Regexp Constants") supported.
>
One principal advantage of regexp-constants is that the engine to
parse the regexp can be created in advance, while a dynamic regexp
may be constructed dynamically (from strings) and needs an explicit
runtime-step to create the engine before the matching can be done.
Now I assumed that @/regex-const/ would in that respect behave as
/regex-const/ ... - until I found in the GNU Awk manual this text:
>
|
| Thus, if you have something like this:
|
| re = @/don't panic/
| sub(/don't/, "do", re)
| print typeof(re), re
|
| then re retains its type, but now attempts to match the string ‘do
| panic’. This provides a (very indirect) way to create regexp-typed
| variables at runtime.
|
>
(I'm astonished that first class regexp objects can be dynamically
changed. But that is not my point here; I'm interested in potential
pre-compiles of regexp constants...)
I would flatly reject a commit to do such a thing. Yikes!
What representation is it working on? If the regex contains
a match for a literal backslash using escaping, does that
count as two backslash characters when you operate on it?
Or is it a single backslash? Can you replace the second
backslash with an 'n' and have the pair turn into a newline?
Is it just tromboning back to printed representation,
and then parsing again?
I provide this:
1> (regex-source #/a.*b(c|d)/)
(compound #\a (0+ wild) #\b (or #\c #\d))
You can get the source code of the regex object as a nested
list with symbols, characters and other objects.
When you have this, you can analyze and transform it.
Then you can call regex-compile on the result.
For instance, prepend a match for the z character:
2> (regex-compile ^(compound #\z ,*(cdr *1)))
#/za.*b(c|d)/
This is robust; you're not dealing with any character-syntax issues like
escapes, because you have the abstract syntax tree of the regex.
This would imply that the first class regexp constants can be changed
like dynamic regexps and that there's no regexp pre-compile involved.
Not necessarily; it could be that a new regex is compiled, and put into
the re variable, clobbering the old regex, which is freed (if it
hits a refcount of zero or whatever mem management is used).
It could also (in combination with this) be lazy. So that is to say
@/abc/ will just store the textual source code of the regex into
the regex object, but not compile anything. When it comes time to
use the regex, on first use, it is compiled and then cached into
that object. When the regex is edited, the cache is invalidated.
Someone will undoubtedly chime in confirming or refuting these
hypotheses.
It would be pretty silly if these regex objects didn't cache a compiled
regex across multiple uses.
And dynamic regexps and first class regexps that got changed (e.g.
by code like
>
sub(/don't/, "do[", re)
>
in above sample snippet) would both create runtime errors, e.g.
Have you tried this? Do you get an error at sub() time, or when
you later try to use re?
-- TXR Programming Language: http://nongnu.org/txrCygnal: Cygwin Native Application Library: http://kylheku.com/cygnalMastodon: @Kazinator@mstdn.ca