Liste des Groupes | Revenir à cu shell |
On 2024-07-24, Ben Bacarisse <ben@bsb.me.uk> wrote:Kaz Kylheku <643-408-1753@kylheku.com> writes:>
>On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:>Kaz Kylheku <643-408-1753@kylheku.com> writes:>This matters when regexes are used for matching a prefix of the input;>
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.
Under the set view, we can ask, what is the longest prefix of
the input which belongs to the language R1|R2. The answer is the
same for R2|R1, which denote the same set, since | corresponds
to set union.
What is "the input" in the set view. The set view is simply a recursive
definition of the language.
It is a separate string under consideration.
>
We have a set, and are asking the question "what is the longest prefix
of the given string which is a member of the set".
>Broken regular expressions identify the longest prefix, except>
when the | operator is used; then they just identify a prefix,
not necessarily longest.
What is a "broken" RE in the set view?
Inconsistency in being able to answer the question "what is the longest
prefix of the string which is a member of the set".
>
Broken regexes contain a pitfall: they deliver the right answer
for expressions like ab*. If the input is "abbbbbbbc",
>
they identify the entire "abbbbbbb" prefix. But if the branch
operator is used, as in "a|ab*", oops, they short-circuit.
The "a" matches a prefix of the input, and so that's done; no need
to match the "ab*" part of the branch.
The "a" prefix is in the language described from the language; a
set element has been identified. But it's not the longest one.
It is an inconsistency. If the longest match is not required, why
bother finding one for "ab*"; for that expression, the "a" prefix could
also just be returned.
Les messages affichés proviennent d'usenet.