Re: Bad performance of back-references

Liste des Groupes 
Sujet : Re: Bad performance of back-references
De : janis_papanagnou+ng (at) *nospam* hotmail.com (Janis Papanagnou)
Groupes : comp.unix.shell
Date : 11. Jun 2025, 09:31:43
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <102beth$1sdmr$1@dont-email.me>
References : 1 2 3 4 5
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
On 09.06.2025 14:17, Dan Cross wrote:
In article <1022c30$3cdq3$1@dont-email.me>,
Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
>
 grep -E -o '(.{42}).*\1'
>
[...]
>
Yes, of course. It's just the magnitude that is so frustrating,
especially in comparison to other solutions. (See below.)
 
Well, factorials get big really fast.

Of course; given the algorithm(s) the complexity follows.

[...]
 
(Myself I'm rarely using expressions that go beyond the Regular
Expressions formal class, since I usually want to rely on the O(N)
or O(N+K) complexity. But in the current case the simple solution
was just tempting.)
 
Yup.  That's why they're there.  But they're deceptive with
respect to their complexity.  Moreover, since authors like
Friedl misrepresent the theory in popular books on the subject,

I haven't read his books. Are they in any way (sort of) standard
literature?

people use what they think are "regular expressions" and get
confused by the observed runtimes.

*shrug* Can't tell. (For me it's obvious, and in the past I've
also regularly pointed that out in the respective discussions.)

But I recall some bad matching behavior in (I think) some Perl
version; a very primitive RE (with no back-references or such)
led to exponential runtime behavior.

 
Traditionally, those tools rolled their own versions.  The
regular expression libraries people have used over the years
came somewhat later, in the evolutinary timeline.

I'm lacking an overview here. One thing I know of GNU Awk - but
that's of course just one example - is that a standard library
had been used, and more recently an alternative library has been
examined. All I remember is that they said it does implement in
some respect improvements. GNU Awk is relying a lot on existing
functions as sort of an adapter.

(But Awk doesn't support back-references, so it won't apply to
the topic of this thread.)

[...]
 
I don't know that anyone knows of a better way for the general
case.  As I said, there is no known fast (as in, less than
exponential) and general solution to matching with back
references.
>
It's certainly a correct observation that if you have a suboptimal
but "general" algorithmic solution that it's tempting to use that.
 
For a general tool, it's unclear to me how one could do anything
else.

Well, it's not uncommon to have tools separate function classes.
In case of Regexps you could certainly tell apart RE expressions
that use [true] Regular Expression patterns and those beyond.

 
One can probably write something faster for specific use cases,
but that would obviously be specific to those cases.
>
To be honest, I haven't to any significant depth studied details
of various "beyond-regular" implementations; with formal strictly
Regular Expressions it's obvious and simple.
 
Mmm, I'm not sure I entirely agree with that.  NDFA analysis can
be subtle, but the real thing that twists people's noggins is
how DFA _creation_ can be exponential.

Well, I agree if you mean that you can formulate non-deterministic
RE patterns that would lead to NDFAs whose transformation to DFAs
may become non-trivial - although patterns are mostly short compared
to the data they are applied to, so even in those rarer cases it may
not be an issue.

Myself I seem to have generally always defined deterministic Regular
Expression patterns. The linear complexity follows.

 
[...]
>
If a "general solution" can be so costly then, I think, something
better should be implemented than a "one-backtracking-fits-all".
 
Well, if someone comes up with something better, they will have
solved a very important open problem in computer science.  :-)

As said, I haven't dived deeper into the "beyond RE" case. But are
you saying that this class of patterns can *only* be solved by
expensive backtracking? - I'd doubt that, but I'm too old and too
lazy to do any fundamental research. ;-)

Janis

[...]



Date Sujet#  Auteur
1 Jun 25 * Bad performance of back-references9Janis Papanagnou
1 Jun 25 `* Re: Bad performance of back-references8Dan Cross
1 Jun 25  `* Re: Bad performance of back-references7Janis Papanagnou
2 Jun 25   `* Re: Bad performance of back-references6Dan Cross
7 Jun 25    `* Re: Bad performance of back-references5Janis Papanagnou
9 Jun 25     `* Re: Bad performance of back-references4Dan Cross
11 Jun 25      `* Re: Bad performance of back-references3Janis Papanagnou
11 Jun 25       `* Re: Bad performance of back-references2Dan Cross
12 Jun 25        `- Re: Bad performance of back-references1Janis Papanagnou

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal