Sujet : Bad performance of back-references De : janis_papanagnou+ng (at) *nospam* hotmail.com (Janis Papanagnou) Groupes :comp.unix.shell Date : 01. Jun 2025, 13:07:33 Autres entêtes Organisation : A noiseless patient Spider Message-ID :<101hfq7$22v3c$1@dont-email.me> User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
In a recent application case I used back-references to find duplicate strings in large data sets. I tested that with grep as in
grep -E -o '(.{42}).*\1'
While this is obviously a neat way to formulate and solve the task in principle it is impractical for performance reasons.[*]
Applied on my MB sized data the task did not terminate and I killed the process after a day.
I also implemented the desired function explicitly (using two nested loops) in a couple of languages (interpreted or compiled). All those hand-crafted and non-optimized implementations terminated and did that within minutes or up to only few hours (depending on the pattern length).
My astonishment is why the back-reference implementation performs so badly here with 'grep'.
Janis
[*] Note: Back-references are not from the Regular Expression functions class so they cannot be done in O(N) or O(N+K); so I don't expect this complexity where I use them. This is not the question here, just to be clear.