Bad performance of back-references

Liste des GroupesRevenir à cu shell 
Sujet : Bad performance of back-references
De : janis_papanagnou+ng (at) *nospam* hotmail.com (Janis Papanagnou)
Groupes : comp.unix.shell
Date : 01. Jun 2025, 13:07:33
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <101hfq7$22v3c$1@dont-email.me>
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
In a recent application case I used back-references to find duplicate
strings in large data sets. I tested that with grep as in

  grep -E -o '(.{42}).*\1'

While this is obviously a neat way to formulate and solve the task in
principle it is impractical for performance reasons.[*]

Applied on my MB sized data the task did not terminate and I killed
the process after a day.

I also implemented the desired function explicitly (using two nested
loops) in a couple of languages (interpreted or compiled). All those
hand-crafted and non-optimized implementations terminated and did
that within minutes or up to only few hours (depending on the pattern
length).

My astonishment is why the back-reference implementation performs so
badly here with 'grep'.

Janis

[*] Note: Back-references are not from the Regular Expression functions
class so they cannot be done in O(N) or O(N+K); so I don't expect this
complexity where I use them. This is not the question here, just to be
clear.

Date Sujet#  Auteur
1 Jun13:07 * Bad performance of back-references4Janis Papanagnou
1 Jun13:52 `* Re: Bad performance of back-references3Dan Cross
1 Jun14:07  `* Re: Bad performance of back-references2Janis Papanagnou
2 Jun12:20   `- Re: Bad performance of back-references1Dan Cross

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal