Deutsch English Français Italiano |
<v6vbl9$3v4cr$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!news.nobody.at!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Peter <confused@nospam.net> Newsgroups: rec.photo.digital Subject: Re: OCR image to text Date: Sun, 14 Jul 2024 03:03:21 +0100 Organization: - Lines: 72 Message-ID: <v6vbl9$3v4cr$1@dont-email.me> References: <v6v7av$80e0$1@matrix.hispagatos.org> <v6v9qq$3r1al$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Injection-Date: Sun, 14 Jul 2024 04:03:22 +0200 (CEST) Injection-Info: dont-email.me; posting-host="02c4f814930cc15ee37f798515491961"; logging-data="4166043"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18RQ+rfMpMYLl529T6pbTn5" Cancel-Lock: sha1:ZMJI7BPypKG3BgG+SCYh1kuMoAw= X-No-Archive: yes X-Newsreader: Forte Agent 3.3/32.846 Bytes: 4157 Geoff <geoff@geoffwood.org> wrote: >> Is there a way to easily OCR a PDF to actual text on Windows for free? > > https://letmegooglethat.com/?q=free+ocr+to+pdf > > geoff You've never actually run that search, have you? If you did, you'd know all you'll get are advertising shills. All of which are online PDF converters which are huge privacy scams. As far as I am aware, there is only one free Windows OCR converter extent. That's GNU OCR (GOCR, aka JOCR) https://jocr.sourceforge.net/ The gocr help just says it works on "pnm,pgm,pbm,ppm,pcx..." files. https://jocr.sourceforge.net/examples.html https://www-e.ovgu.de/jschulen/ocr/download.html "Windows-binary gocr049.exe" v0.49 154kB by Peter B L Meijer, Oct 2010 http://www-e.uni-magdeburg.de/jschulen/ocr/gocr049.exe Name: gocr049.exe Size: 153600 bytes (150 KiB) SHA256: 1FFC4CD29A5B275F40FBC5F6F9194ED72B8D2BCCBD46019F088C9E5DE2923F59 gocr049.exe Optical Character Recognition --- gocr 0.49 20100924 Copyright (C) 2001-2010 Joerg Schulenburg GPG=1024D/53BDFBE3 released under the GNU General Public License use option -h for help gocr049.exe -h Optical Character Recognition --- gocr 0.49 20100924 Copyright (C) 2001-2010 Joerg Schulenburg GPG=1024D/53BDFBE3 released under the GNU General Public License using: gocr [options] pnm_file_name # use - for stdin options (see gocr manual pages for more details): -h, --help -i name - input image file (pnm,pgm,pbm,ppm,pcx,...) -o name - output file (redirection of stdout) -e name - logging file (redirection of stderr) -x name - progress output to fifo (see manual) -p name - database path including final slash (default is ./db/) -f fmt - output format (ISO8859_1 TeX HTML XML UTF8 ASCII) -l num - threshold grey level 0<160<=255 (0 = autodetect) -d num - dust_size (remove small clusters, -1 = autodetect) -s num - spacewidth/dots (0 = autodetect) -v num - verbose (see manual page) -c string - list of chars (debugging, see manual) -C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII) -m num - operation modes (bitpattern, see manual) -a num - value of certainty (in percent, 0..100, default=95) -u string - output this string for every unrecognized character examples: gocr -m 4 text1.pbm # do layout analyzis gocr -m 130 -p ./database/ text1.pbm # extend database djpeg -pnm -gray text.jpg | gocr - # use jpeg-file via pipe webpage: http://jocr.sourceforge.net/ When I tested it just now, it worked but it's prone to spelling errors even on perfectly good text so, while it works, it doesn't work well. a. I couldn't get gocr to convert a docx or pdf to anything gocr049.exe -i "testpage.docx" -o testpage.txt -f UTF8 b. Then I couldn't get imagemagic to convert pdf to anything convert testpage.pdf testpage.pnm c. So I saved the testpage.pdf to testpage.png to convert by imagemagick convert testpage.png testpage.pnm d. gocr049.exe -i "testpage.pnm" -o testpage.txt -f UTF8 (it had a tremendous amount of spelling errors, but it worked) As far as I'm aware, there is no other Windows OCR freeware extent.