Sujet : Re: How good is Linux OCR?
De : stefan (at) *nospam* mailchuck.com (Stefan Claas)
Groupes : sci.cryptDate : 09. Jun 2025, 08:18:51
Autres entêtes
Organisation : Victor Usenet Postings
Message-ID : <10261sr$3rs24$1@news.tcpreset.net>
References : 1 2 3 4
User-Agent : flnews/1.3.0pre31 (for GNU/Linux)
Rich wrote:
Stefan Claas <stefan@mailchuck.com> wrote:
Rich wrote:
Note that Tesseract will (I think) compile for windows too, so if you
wanted to know "how well tesseract worked" you could just install the
windows version and see for yourself.
I tried tesseract under Linux. It is horrible, because of to many errors.
Fair enough. The windows version will do the same.
Two other options I'm aware of for Linux:
http://slackbuilds.org/repository/15.0/office/gocr/
http://slackbuilds.org/repository/15.0/libraries/cuneiform/
I have never used either, so I can't comment on how well the work.
Your original image, however, is one that will be hard to OCR, so it is
quite amazing that whatever OCR engine MS supplies is actually able to
convert it with some accuracy.
With 100 % accuracy. :-)
If where you are going is storing binary data (keys/messages) as these
text strings, then you also want to consider the fact that many OCR
engines often confuse similar letters. I've seen 5 (five) become S
(letter ess) or 1 (one) become I (letter eye). I'm not sure I've seen
I become 1, but it is possible, esp. with a font with little to no
difference between those glyphs.
O (letter oh) and 0 (numeral zero) are often confused for each other as
well.
So you might want to restrict your character set to not include the
"easy to confuse" letter pairs. If they don't exist on the "printouts"
then they can't be confused for each other.
My 'atze' encoder uses the letters ATZE only. And these four letters resemble
a german word (slang) for brother.
As an alternate, there is also the "OCR-A"
(https://en.wikipedia.org/wiki/OCR-A) and "OCR-B"
(https://en.wikipedia.org/wiki/OCR-B) fonts which was designed for
early OCR engines to be easy to read. Either might also still be
"easier to read" even though OCR engines have progressed since those
fonts were created.
Well, I always use mono fonts.
Regards
Stefan