Deutsch   English   Français   Italiano  
<10254ri$38nl$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Rich <rich@example.invalid>
Newsgroups: sci.crypt
Subject: Re: How good is Linux OCR?
Date: Sun, 8 Jun 2025 23:03:14 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 45
Message-ID: <10254ri$38nl$1@dont-email.me>
References: <1023in3$3djnq$1@news.tcpreset.net> <1024uvt$2iq7$1@dont-email.me> <1024vhq$3fnkt$1@news.tcpreset.net>
Injection-Date: Mon, 09 Jun 2025 01:03:17 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="75d37cf7f27322b6b5d6c7277ce0e4b1";
	logging-data="107253"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19YoB27AnTGmG2nVwyEU40l"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.139 (x86_64))
Cancel-Lock: sha1:Zduyr63oUUGhtEPO3WrQ8l8v2aQ=

Stefan Claas <stefan@mailchuck.com> wrote:
> Rich wrote:
> 
>> Note that Tesseract will (I think) compile for windows too, so if you 
>> wanted to know "how well tesseract worked" you could just install the 
>> windows version and see for yourself.
> 
> I tried tesseract under Linux. It is horrible, because of to many errors.

Fair enough.  The windows version will do the same.

Two other options I'm aware of for Linux:

http://slackbuilds.org/repository/15.0/office/gocr/

http://slackbuilds.org/repository/15.0/libraries/cuneiform/

I have never used either, so I can't comment on how well the work.

Your original image, however, is one that will be hard to OCR, so it is 
quite amazing that whatever OCR engine MS supplies is actually able to 
convert it with some accuracy.

If where you are going is storing binary data (keys/messages) as these 
text strings, then you also want to consider the fact that many OCR 
engines often confuse similar letters.  I've seen 5 (five) become S 
(letter ess) or 1 (one) become I (letter eye).  I'm not sure I've seen 
I become 1, but it is possible, esp. with a font with little to no 
difference between those glyphs.

O (letter oh) and 0 (numeral zero) are often confused for each other as 
well.

So you might want to restrict your character set to not include the 
"easy to confuse" letter pairs.  If they don't exist on the "printouts" 
then they can't be confused for each other.

As an alternate, there is also the "OCR-A" 
(https://en.wikipedia.org/wiki/OCR-A) and "OCR-B" 
(https://en.wikipedia.org/wiki/OCR-B) fonts which was designed for 
early OCR engines to be easy to read.  Either might also still be 
"easier to read" even though OCR engines have progressed since those 
fonts were created.