Re: Bad performance of back-references

Liste des GroupesRevenir à cu shell 
Sujet : Re: Bad performance of back-references
De : cross (at) *nospam* spitfire.i.gajendra.net (Dan Cross)
Groupes : comp.unix.shell
Date : 11. Jun 2025, 14:48:31
Autres entêtes
Organisation : PANIX Public Access Internet and UNIX, NYC
Message-ID : <102c1ff$27o$1@reader1.panix.com>
References : 1 2 3 4
User-Agent : trn 4.0-test77 (Sep 1, 2010)
In article <102beth$1sdmr$1@dont-email.me>,
Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
On 09.06.2025 14:17, Dan Cross wrote:
[...]
 
(Myself I'm rarely using expressions that go beyond the Regular
Expressions formal class, since I usually want to rely on the O(N)
or O(N+K) complexity. But in the current case the simple solution
was just tempting.)
 
Yup.  That's why they're there.  But they're deceptive with
respect to their complexity.  Moreover, since authors like
Friedl misrepresent the theory in popular books on the subject,
>
I haven't read his books. Are they in any way (sort of) standard
literature?

No.  He wrote a book called, "Mastering Regular Expressions"
where he mixes up some of the formal terminology.  In particular
he calls things "NDFAs" that are not, in fact, non-deterministic
finite automata.  When corrected, he got somewhat indignant.

That is to say, he's made factually incorrect statements (even
in his book!) and refused to correct them.

people use what they think are "regular expressions" and get
confused by the observed runtimes.
>
*shrug* Can't tell. (For me it's obvious, and in the past I've
also regularly pointed that out in the respective discussions.)
>
But I recall some bad matching behavior in (I think) some Perl
version; a very primitive RE (with no back-references or such)
led to exponential runtime behavior.

An easy way to implement even real regular expressions is with
backtracking, and building an NDFA simulation is rather more
work.  Consequently, a lot of early "regexp" libraries took this
approach, and could suffer from exponential blowup, even using
standard regular expressions that would be linear in a DFA
implementation.  Perl, in particular, did this, borrowing Henry
Spencer's RE library, which used backtracking.  The version of
regular expressions described in "The Practice of Programming"
by Kernighan and Pike also uses backtracking; it's only a page
or two, but exhibits exponential behavior in the worst case.

Traditionally, those tools rolled their own versions.  The
regular expression libraries people have used over the years
came somewhat later, in the evolutinary timeline.
>
I'm lacking an overview here. One thing I know of GNU Awk - but
that's of course just one example - is that a standard library
had been used, and more recently an alternative library has been
examined. All I remember is that they said it does implement in
some respect improvements. GNU Awk is relying a lot on existing
functions as sort of an adapter.
>
(But Awk doesn't support back-references, so it won't apply to
the topic of this thread.)

GNU awk appears to have imported a regular expression library
that does DFA construction, if possible, and an NDFA simulation
if not.  The DFA code was written by Mike Haertel; I assume it
came via gnulib or something.

It's quite a bit of code, wrapped in a few layers.  It doesn't
seem like they link to e.g. a system library or something that
already comes with the base OS.

[...]
 
I don't know that anyone knows of a better way for the general
case.  As I said, there is no known fast (as in, less than
exponential) and general solution to matching with back
references.
>
It's certainly a correct observation that if you have a suboptimal
but "general" algorithmic solution that it's tempting to use that.
 
For a general tool, it's unclear to me how one could do anything
else.
>
Well, it's not uncommon to have tools separate function classes.
In case of Regexps you could certainly tell apart RE expressions
that use [true] Regular Expression patterns and those beyond.

Tcl does something like that.

To be honest, I haven't to any significant depth studied details
of various "beyond-regular" implementations; with formal strictly
Regular Expressions it's obvious and simple.
 
Mmm, I'm not sure I entirely agree with that.  NDFA analysis can
be subtle, but the real thing that twists people's noggins is
how DFA _creation_ can be exponential.
>
Well, I agree if you mean that you can formulate non-deterministic
RE patterns that would lead to NDFAs whose transformation to DFAs
may become non-trivial - although patterns are mostly short compared
to the data they are applied to, so even in those rarer cases it may
not be an issue.
>
Myself I seem to have generally always defined deterministic Regular
Expression patterns. The linear complexity follows.

Careful here; many people assume that but really end up with
something that's simulated by an NDFA, which is superlinear
(though still not exponential).  Any DFA can be _simulated_ by
an NDFA; it's a tradeoff between constuction and runtime.

[...]
>
If a "general solution" can be so costly then, I think, something
better should be implemented than a "one-backtracking-fits-all".
 
Well, if someone comes up with something better, they will have
solved a very important open problem in computer science.  :-)
>
As said, I haven't dived deeper into the "beyond RE" case. But are
you saying that this class of patterns can *only* be solved by
expensive backtracking? - I'd doubt that, but I'm too old and too
lazy to do any fundamental research. ;-)

What I'm saying is that no one actually knows, but if there's a
quicker way to do it, no one has found it yet, and a lot of
people have tried over a lot of years.

As of a few years ago, the best known algorithm for supporting
backreferences had exponential complexity.  As far as I know,
nothing has changed about that since the last time I looked
closely.  People _have_ done some work with fixing the number
of back references and so on, and maybe shown that that problem
is in P (it's somewhat unclear) but that's a) not the general
case, and b) the runtimes are still awful.  Like, O(N^2k), where
k is number of capture groups, kind of awful.

Again, I point to Russ Cox's reference on the subject, which is
a fairly gentle introduction for the lay reader:
https://swtch.com/~rsc/regexp/regexp1.html

- Dan C.


Date Sujet#  Auteur
1 Jun 25 * Bad performance of back-references9Janis Papanagnou
1 Jun 25 `* Re: Bad performance of back-references8Dan Cross
1 Jun 25  `* Re: Bad performance of back-references7Janis Papanagnou
2 Jun 25   `* Re: Bad performance of back-references6Dan Cross
7 Jun 25    `* Re: Bad performance of back-references5Janis Papanagnou
9 Jun 25     `* Re: Bad performance of back-references4Dan Cross
11 Jun 25      `* Re: Bad performance of back-references3Janis Papanagnou
11 Jun 25       `* Re: Bad performance of back-references2Dan Cross
12 Jun 25        `- Re: Bad performance of back-references1Janis Papanagnou

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal