Sujet : Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
De : janis_papanagnou+ng (at) *nospam* hotmail.com (Janis Papanagnou)
Groupes : comp.unix.shellDate : 24. Jul 2024, 10:41:48
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v7qi8t$1mi8m$1@dont-email.me>
References : 1 2 3 4
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
On 23.07.2024 20:34, Kaz Kylheku wrote:
[...]
In the old math/CS papers about regular expressions, regular expressions
are typically represented in terms of some input symbol alphabet
(usually just letters a, b, c ...) and only the operators | and *,
and parentheses (other than when advanced operators are being discussed,
like intersection and complement, whicha re not easily constructed from
these.)
I think character classes might have been a pragmatic invention in
regex implementations. The theory doesn't require [a-c] because
that can be encoded as (a|b|c).
While formally we can restrict to some basic elements it's quite
inconvenient in practice. I recall that in Compiler Construction
and Automata Theory we regularly used the 'all-but' operator for
complementing input symbol sets. And also obvious abbreviations
for larger sets of symbols (like 'digits', etc.). Not only in
practical regexp implementations, also in education certainly no
one wants to waste time.
The ? operator is not required because (R)? can be written (R)(R)*.
ITYM: (R)+
Escaping is not required because the oeprators and input symbols are
distinct; the idea that ( could be an input symbol is something that
occurs in implementations, not in the theory.
Regex implementors take the theory and adjust it to taste,
and add necessary details such as character escape sequences for
control characters, and escaping to allow the oeprator characters
themselves to be matched. Plus character classes, with negation
and ranges and all that.
There are (at least) two different types of such adjustments. One
are the convenience enhancements (like the '+' or the multiplicity
('{m,n}', Perl's '\d' and '\D' etc.) that, from a complexity
perspective, all stay within the same [theoretical] class of the
Regular Languages. (There's other types of extensions that we find
in implementations that leave that language class.)
Not all implementations follow solid theory. For instance, the branch
operator | is supposed to be commutative. There is no difference
between R1|R2 and R2|R1. But in many implementations (particularly
backtracking ones like PCRE and similar), there is a difference: these
implementations implement R1|R2|R3 by trying the expressions in left to
right order and stop at the first match.
In my book it's not necessary to follow "solid theory"; if, and only
if, it's documented, correctly/sensibly implemented, and implications
made clear to the programmer.
There's two common examples that come to mind. I think it's okay if
there's support for, e.g., back-references if it's clearly stated
that you should not expect that this code will run in O(N), linear
complexity. But there have been implementations - don't recall if it
was in Java, Perl, or both - where they implemented "generalizations"
of "Regexp-processings" that had the runtime-effect that even for
*true* Regular Expressions you had no O(N) guarantee any more; and
this is IMO a fatal decision! If the programmer uses only RE elements
from the class of Regular Languages there should always be complexity
O(N) guaranteed.
This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
I think we should differentiate the matching process (implementation
specific syntax for formulas of REs, matching implementation methods)
from the Regular Languages theory; the complete strings of text that
we match are in general not part of the Regular Language, the Regular
Expression only specifies the subset of (matching) strings as part of
the respective Regular Language. And, WRT complexity, we should also
be aware that the O(N) in Regular Languages is the complexity of the
"match", not the length of the data that is skimmed for a match. The
various algorithms combine these complexities in supposedly efficient
ways. Some RE parsers failed.
Janis