| Deutsch English Français Italiano |
|
<10254ri$38nl$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Rich <rich@example.invalid>
Newsgroups: sci.crypt
Subject: Re: How good is Linux OCR?
Date: Sun, 8 Jun 2025 23:03:14 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 45
Message-ID: <10254ri$38nl$1@dont-email.me>
References: <1023in3$3djnq$1@news.tcpreset.net> <1024uvt$2iq7$1@dont-email.me> <1024vhq$3fnkt$1@news.tcpreset.net>
Injection-Date: Mon, 09 Jun 2025 01:03:17 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="75d37cf7f27322b6b5d6c7277ce0e4b1";
logging-data="107253"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19YoB27AnTGmG2nVwyEU40l"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.139 (x86_64))
Cancel-Lock: sha1:Zduyr63oUUGhtEPO3WrQ8l8v2aQ=
Stefan Claas <stefan@mailchuck.com> wrote:
> Rich wrote:
>
>> Note that Tesseract will (I think) compile for windows too, so if you
>> wanted to know "how well tesseract worked" you could just install the
>> windows version and see for yourself.
>
> I tried tesseract under Linux. It is horrible, because of to many errors.
Fair enough. The windows version will do the same.
Two other options I'm aware of for Linux:
http://slackbuilds.org/repository/15.0/office/gocr/
http://slackbuilds.org/repository/15.0/libraries/cuneiform/
I have never used either, so I can't comment on how well the work.
Your original image, however, is one that will be hard to OCR, so it is
quite amazing that whatever OCR engine MS supplies is actually able to
convert it with some accuracy.
If where you are going is storing binary data (keys/messages) as these
text strings, then you also want to consider the fact that many OCR
engines often confuse similar letters. I've seen 5 (five) become S
(letter ess) or 1 (one) become I (letter eye). I'm not sure I've seen
I become 1, but it is possible, esp. with a font with little to no
difference between those glyphs.
O (letter oh) and 0 (numeral zero) are often confused for each other as
well.
So you might want to restrict your character set to not include the
"easy to confuse" letter pairs. If they don't exist on the "printouts"
then they can't be confused for each other.
As an alternate, there is also the "OCR-A"
(https://en.wikipedia.org/wiki/OCR-A) and "OCR-B"
(https://en.wikipedia.org/wiki/OCR-B) fonts which was designed for
early OCR engines to be easy to read. Either might also still be
"easier to read" even though OCR engines have progressed since those
fonts were created.