Re: Bad performance of back-references

Liste des Groupes 
Sujet : Re: Bad performance of back-references
De : janis_papanagnou+ng (at) *nospam* hotmail.com (Janis Papanagnou)
Groupes : comp.unix.shell
Date : 07. Jun 2025, 22:48:14
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <1022c30$3cdq3$1@dont-email.me>
References : 1 2 3 4
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
On 02.06.2025 13:20, Dan Cross wrote:
In article <101hjas$242rp$1@dont-email.me>,
Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
On 01.06.2025 14:52, Dan Cross wrote:
In article <101hfq7$22v3c$1@dont-email.me>,
Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
In a recent application case I used back-references to find duplicate
strings in large data sets. I tested that with grep as in
>
 grep -E -o '(.{42}).*\1'
>
While this is obviously a neat way to formulate and solve the task in
principle it is impractical for performance reasons.[*]
>
Applied on my MB sized data the task did not terminate and I killed
the process after a day.
>
I also implemented the desired function explicitly (using two nested
loops) in a couple of languages (interpreted or compiled). All those
hand-crafted and non-optimized implementations terminated and did
that within minutes or up to only few hours (depending on the pattern
length).
>
My astonishment is why the back-reference implementation performs so
badly here with 'grep'.
>
Janis
>
[*] Note: Back-references are not from the Regular Expression functions
class so they cannot be done in O(N) or O(N+K); so I don't expect this
complexity where I use them. This is not the question here, just to be
clear.
>
Your results don't surprise me in the the least.
>
First, "back references" make "regular expressions" not regular,
in the formal sense sense that they are no longer isomorphic to
deterministic finite automata or their NDFA simulations.
>
I'm surprised you give me that answer since with my footnote
above I explicitly intended to exactly prevent focusing on
or distracting about this already known fact.
 
Oh, sorry; I missed the footnote.  It's odd, because that really
is the question in some sense.

Yes, of course. It's just the magnitude that is so frustrating,
especially in comparison to other solutions. (See below.)

(Myself I'm rarely using expressions that go beyond the Regular
Expressions formal class, since I usually want to rely on the O(N)
or O(N+K) complexity. But in the current case the simple solution
was just tempting.)

 
Matching DFA is inherently linear, but _creating_ DFAs can be
exponential; NDFAs can be created in reasonable time (I forget
the exact complexity, I'm afraid) though executing them may be
superlinear; regardless it's much better than exponential.
>
Second, most implementations that support backreferences use
backtracking to do so, which can be exponential in both space
and time.
>
I'd think that most applications that support back-references
might rely on the same library (and not reinvent the wheel).
 
I don't see how, when the functionality is implemented in many
different languages.

Oh, I wasn't primarily speaking about different languages. Here
I had tools in mind that are supposedly all written in "C", like
'grep', 'sed', etc. But of course also other libraries and tools
written in other languages may access (for example) "C" libraries
that provide that functionality; many languages support external
access to functions written in other languages.

That said; there's of course also various other implementations
anyway. (I hope not all do the same mistakes.)

 
Historically, regular expressions were one of those things where
library support wasn't super awesome.  This is why Spencer's
library was so popular, but that used backtracking.
 
If the accessible implementations will all resort to backtracking
as the sole implementation for back-references it indeed explains
something.
 
I don't know that anyone knows of a better way for the general
case.  As I said, there is no known fast (as in, less than
exponential) and general solution to matching with back
references.

It's certainly a correct observation that if you have a suboptimal
but "general" algorithmic solution that it's tempting to use that.

 
One can probably write something faster for specific use cases,
but that would obviously be specific to those cases.

To be honest, I haven't to any significant depth studied details
of various "beyond-regular" implementations; with formal strictly
Regular Expressions it's obvious and simple.

I've asked that question mainly because the performance measures
where too far apart. My simple 'grep' based solution '(.{12}).*\1'
terminated (finally!) after 27.5 hours (more than a day).[*] Where
a bulky hand-crafted nested loops required only a fraction of that;
don't recall exactly but I think it was in the low minutes(!) range.
That was some Awk code that even did a lot of unnecessary costly
copying (because of the necessity, with the given functions, of
repeated unavoidable huge copies). With yet another approach I had
got that task down to 4 seconds, but that's not my expectation with
a standard regexp library. But 27.5 hours is way off, IMO, compared
to the straight forward primitive approach with 2 minutes; a factor
of magnitude approx. 1000 times slower.

If a "general solution" can be so costly then, I think, something
better should be implemented than a "one-backtracking-fits-all".

Janis

[*] Note the comparable small string size of 12 of the substring to
match. (This value is also much smaller than the sample in my OP.)
For sized of {6} on a MB sized file it's okay, then it's still in
the seconds range. But, as opposed to FSMs, the algorithm is badly
scaling.

[...]



Date Sujet#  Auteur
1 Jun 25 * Bad performance of back-references9Janis Papanagnou
1 Jun 25 `* Re: Bad performance of back-references8Dan Cross
1 Jun 25  `* Re: Bad performance of back-references7Janis Papanagnou
2 Jun 25   `* Re: Bad performance of back-references6Dan Cross
7 Jun 25    `* Re: Bad performance of back-references5Janis Papanagnou
9 Jun 25     `* Re: Bad performance of back-references4Dan Cross
11 Jun 25      `* Re: Bad performance of back-references3Janis Papanagnou
11 Jun 25       `* Re: Bad performance of back-references2Dan Cross
12 Jun 25        `- Re: Bad performance of back-references1Janis Papanagnou

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal