Article <v6vbl9$3v4cr$1@dont-email.me>

Deutsch English Français Italiano
<v6vbl9$3v4cr$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.nobody.at!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Peter <confused@nospam.net>
Newsgroups: rec.photo.digital
Subject: Re: OCR image to text
Date: Sun, 14 Jul 2024 03:03:21 +0100
Organization: -
Lines: 72
Message-ID: <v6vbl9$3v4cr$1@dont-email.me>
References: <v6v7av$80e0$1@matrix.hispagatos.org> <v6v9qq$3r1al$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 14 Jul 2024 04:03:22 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="02c4f814930cc15ee37f798515491961";
	logging-data="4166043"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18RQ+rfMpMYLl529T6pbTn5"
Cancel-Lock: sha1:ZMJI7BPypKG3BgG+SCYh1kuMoAw=
X-No-Archive: yes
X-Newsreader: Forte Agent 3.3/32.846
Bytes: 4157

Geoff <geoff@geoffwood.org> wrote:

>> Is there a way to easily OCR a PDF to actual text on Windows for free?
> 
> https://letmegooglethat.com/?q=free+ocr+to+pdf
> 
> geoff

You've never actually run that search, have you?
If you did, you'd know all you'll get are advertising shills.
All of which are online PDF converters which are huge privacy scams.

As far as I am aware, there is only one free Windows OCR converter extent.
That's GNU OCR (GOCR, aka JOCR) https://jocr.sourceforge.net/

The gocr help just says it works on "pnm,pgm,pbm,ppm,pcx..." files.
https://jocr.sourceforge.net/examples.html
https://www-e.ovgu.de/jschulen/ocr/download.html
"Windows-binary gocr049.exe" v0.49 154kB by Peter B L Meijer, Oct 2010
http://www-e.uni-magdeburg.de/jschulen/ocr/gocr049.exe
Name: gocr049.exe
Size: 153600 bytes (150 KiB)
SHA256: 1FFC4CD29A5B275F40FBC5F6F9194ED72B8D2BCCBD46019F088C9E5DE2923F59

gocr049.exe
 Optical Character Recognition --- gocr 0.49 20100924
 Copyright (C) 2001-2010 Joerg Schulenburg  GPG=1024D/53BDFBE3
 released under the GNU General Public License
 use option -h for help

gocr049.exe -h
 Optical Character Recognition --- gocr 0.49 20100924
 Copyright (C) 2001-2010 Joerg Schulenburg  GPG=1024D/53BDFBE3
 released under the GNU General Public License
 using: gocr [options] pnm_file_name  # use - for stdin
 options (see gocr manual pages for more details):
 -h, --help
 -i name   - input image file (pnm,pgm,pbm,ppm,pcx,...)
 -o name   - output file  (redirection of stdout)
 -e name   - logging file (redirection of stderr)
 -x name   - progress output to fifo (see manual)
 -p name   - database path including final slash (default is ./db/)
 -f fmt    - output format (ISO8859_1 TeX HTML XML UTF8 ASCII)
 -l num    - threshold grey level 0<160<=255 (0 = autodetect)
 -d num    - dust_size (remove small clusters, -1 = autodetect)
 -s num    - spacewidth/dots (0 = autodetect)
 -v num    - verbose (see manual page)
 -c string - list of chars (debugging, see manual)
 -C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII)
 -m num    - operation modes (bitpattern, see manual)
 -a num    - value of certainty (in percent, 0..100, default=95)
 -u string - output this string for every unrecognized character
 examples:
        gocr -m 4 text1.pbm                   # do layout analyzis
        gocr -m 130 -p ./database/ text1.pbm  # extend database
        djpeg -pnm -gray text.jpg | gocr -    # use jpeg-file via pipe

 webpage: http://jocr.sourceforge.net/

When I tested it just now, it worked but it's prone to spelling errors
even on perfectly good text so, while it works, it doesn't work well.

a. I couldn't get gocr to convert a docx or pdf to anything
   gocr049.exe -i "testpage.docx" -o testpage.txt -f UTF8
b. Then I couldn't get imagemagic to convert pdf to anything
   convert testpage.pdf testpage.pnm
c. So I saved the testpage.pdf to testpage.png to convert by imagemagick
   convert testpage.png testpage.pnm
d. gocr049.exe -i "testpage.pnm" -o testpage.txt -f UTF8
   (it had a tremendous amount of spelling errors, but it worked)

As far as I'm aware, there is no other Windows OCR freeware extent.