Re: OCR image to text

Liste des GroupesRevenir à rp digital 
Sujet : Re: OCR image to text
De : confused (at) *nospam* nospam.net (Peter)
Groupes : rec.photo.digital
Date : 14. Jul 2024, 03:03:21
Autres entêtes
Organisation : -
Message-ID : <v6vbl9$3v4cr$1@dont-email.me>
References : 1 2
User-Agent : Forte Agent 3.3/32.846
Geoff <geoff@geoffwood.org> wrote:

Is there a way to easily OCR a PDF to actual text on Windows for free?
 
https://letmegooglethat.com/?q=free+ocr+to+pdf
 
geoff

You've never actually run that search, have you?
If you did, you'd know all you'll get are advertising shills.
All of which are online PDF converters which are huge privacy scams.

As far as I am aware, there is only one free Windows OCR converter extent.
That's GNU OCR (GOCR, aka JOCR) https://jocr.sourceforge.net/

The gocr help just says it works on "pnm,pgm,pbm,ppm,pcx..." files.
https://jocr.sourceforge.net/examples.html
https://www-e.ovgu.de/jschulen/ocr/download.html
"Windows-binary gocr049.exe" v0.49 154kB by Peter B L Meijer, Oct 2010
http://www-e.uni-magdeburg.de/jschulen/ocr/gocr049.exe
Name: gocr049.exe
Size: 153600 bytes (150 KiB)
SHA256: 1FFC4CD29A5B275F40FBC5F6F9194ED72B8D2BCCBD46019F088C9E5DE2923F59

gocr049.exe
 Optical Character Recognition --- gocr 0.49 20100924
 Copyright (C) 2001-2010 Joerg Schulenburg  GPG=1024D/53BDFBE3
 released under the GNU General Public License
 use option -h for help

gocr049.exe -h
 Optical Character Recognition --- gocr 0.49 20100924
 Copyright (C) 2001-2010 Joerg Schulenburg  GPG=1024D/53BDFBE3
 released under the GNU General Public License
 using: gocr [options] pnm_file_name  # use - for stdin
 options (see gocr manual pages for more details):
 -h, --help
 -i name   - input image file (pnm,pgm,pbm,ppm,pcx,...)
 -o name   - output file  (redirection of stdout)
 -e name   - logging file (redirection of stderr)
 -x name   - progress output to fifo (see manual)
 -p name   - database path including final slash (default is ./db/)
 -f fmt    - output format (ISO8859_1 TeX HTML XML UTF8 ASCII)
 -l num    - threshold grey level 0<160<=255 (0 = autodetect)
 -d num    - dust_size (remove small clusters, -1 = autodetect)
 -s num    - spacewidth/dots (0 = autodetect)
 -v num    - verbose (see manual page)
 -c string - list of chars (debugging, see manual)
 -C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII)
 -m num    - operation modes (bitpattern, see manual)
 -a num    - value of certainty (in percent, 0..100, default=95)
 -u string - output this string for every unrecognized character
 examples:
        gocr -m 4 text1.pbm                   # do layout analyzis
        gocr -m 130 -p ./database/ text1.pbm  # extend database
        djpeg -pnm -gray text.jpg | gocr -    # use jpeg-file via pipe

 webpage: http://jocr.sourceforge.net/

When I tested it just now, it worked but it's prone to spelling errors
even on perfectly good text so, while it works, it doesn't work well.

a. I couldn't get gocr to convert a docx or pdf to anything
   gocr049.exe -i "testpage.docx" -o testpage.txt -f UTF8
b. Then I couldn't get imagemagic to convert pdf to anything
   convert testpage.pdf testpage.pnm
c. So I saved the testpage.pdf to testpage.png to convert by imagemagick
   convert testpage.png testpage.pnm
d. gocr049.exe -i "testpage.pnm" -o testpage.txt -f UTF8
   (it had a tremendous amount of spelling errors, but it worked)

As far as I'm aware, there is no other Windows OCR freeware extent.

Date Sujet#  Auteur
14 Jul 24 * OCR image to text8Bill Powell
14 Jul 24 +* Re: OCR image to text3Geoff
14 Jul 24 i`* Re: OCR image to text2Peter
14 Jul 24 i `- Re: OCR image to text1Geoff
14 Jul 24 +- Re: OCR image to text1Abandoned Trolley
23 Jul 24 +* Re: OCR image to text2Geoff Realname
23 Jul 24 i`- Re: OCR image to text1Abandoned Trolley
24 Aug 24 `- Re: OCR image to text1Matti Haveri

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal