Sujet : Re: Bad performance of back-references
De : janis_papanagnou+ng (at) *nospam* hotmail.com (Janis Papanagnou)
Groupes : comp.unix.shellDate : 01. Jun 2025, 14:07:38
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <101hjas$242rp$1@dont-email.me>
References : 1 2
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
On 01.06.2025 14:52, Dan Cross wrote:
In article <101hfq7$22v3c$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
In a recent application case I used back-references to find duplicate
strings in large data sets. I tested that with grep as in
>
grep -E -o '(.{42}).*\1'
>
While this is obviously a neat way to formulate and solve the task in
principle it is impractical for performance reasons.[*]
>
Applied on my MB sized data the task did not terminate and I killed
the process after a day.
>
I also implemented the desired function explicitly (using two nested
loops) in a couple of languages (interpreted or compiled). All those
hand-crafted and non-optimized implementations terminated and did
that within minutes or up to only few hours (depending on the pattern
length).
>
My astonishment is why the back-reference implementation performs so
badly here with 'grep'.
>
Janis
>
[*] Note: Back-references are not from the Regular Expression functions
class so they cannot be done in O(N) or O(N+K); so I don't expect this
complexity where I use them. This is not the question here, just to be
clear.
Your results don't surprise me in the the least.
First, "back references" make "regular expressions" not regular,
in the formal sense sense that they are no longer isomorphic to
deterministic finite automata or their NDFA simulations.
I'm surprised you give me that answer since with my footnote
above I explicitly intended to exactly prevent focusing on
or distracting about this already known fact.
Matching DFA is inherently linear, but _creating_ DFAs can be
exponential; NDFAs can be created in reasonable time (I forget
the exact complexity, I'm afraid) though executing them may be
superlinear; regardless it's much better than exponential.
Second, most implementations that support backreferences use
backtracking to do so, which can be exponential in both space
and time.
I'd think that most applications that support back-references
might rely on the same library (and not reinvent the wheel).
If the accessible implementations will all resort to backtracking
as the sole implementation for back-references it indeed explains
something.
There's some good background information here:
https://swtch.com/~rsc/regexp/regexp1.html
The bottom line is that regexp that use back tracking are not
actually regular expressions, and there is no known way to make
them fast generally.
- Dan C.
Thanks.
Janis