How to do OCR from the Linux command line using Tesseract

A terminal window on a Linux laptop.
Fatmawati Achmad Zaenuri / Shutterstock

You can extract text from images on the Linux command line using the Tesseract OCR engine. It’s fast, accurate, and works in approximately 100 languages. Here’s how to use it.

Optical character recognition

Optical character recognition (OCR) is the ability to look at and find words in an image and then extract them as editable text. This simple task for humans is very difficult for computers to perform. The first efforts were clumsy, to say the least. Computers were often confused if the font or size was not to the liking of the OCR software.

However, the pioneers in this field were still highly esteemed. If you lost an electronic copy of a document but still had a printed version, OCR could recreate an electronic, editable version. Even if the results weren’t 100 percent accurate, this was a huge time saver.

With a little manual order, you would get your document back. People forgave the mistakes they made because they understood the complexity of the task facing an OCR package. Also, it was better than rewriting the entire document.

Things have improved significantly since then. The Tesseract OCR application, written by Hewlett Packard, started in the 1980s as a commercial application. It was open source in 2005 and is now supported by Google. It has multi-language capabilities, is considered one of the most accurate OCR systems available, and you can use it for free.

Installing Tesseract OCR

To install Tesseract OCR on Ubuntu, use this command:

sudo apt-get install tesseract-ocr

In Fedora, the command is:

sudo dnf install tesseract

In Manjaro, you must type:

sudo pacman -Syu tesseract

Using Tesseract OCR

We are going to pose a number of challenges for Tesseract OCR. Our first image containing text is an extract from recital 63 of the General data protection regulations. Let’s see if OCR can read this (and stay awake).

extract from recital 63 of the GDPR

It is a misleading picture because each sentence begins with a weak superscript number, which is typical in legislative documents.

We need to give the tesseract order certain information, including:

  • The name of the image file that we want it to process.
  • The name of the text file that you will create to contain the extracted text. We do not have to provide the file extension (it will always be .txt). If a file with the same name already exists, it will be overwritten.
  • We can use the --dpi count option tesseract that dots per inch (dpi) the resolution of the image is. If we don’t provide a dpi value, tesseract will try to solve it.

Our image file is called “recital-63.png” and its resolution is 150 dpi. We are going to create a text file called “recital.txt”.

Our command looks like this:

tesseract recital-63.png recital --dpi 150

The scores are very good. The only problem is the superscripts: they were too faint to read correctly. A good quality image is essential for good results.

Text taken from recital 63.

tesseract you have interpreted the superscript numbers as quotes (“) and degree symbols (°), but the actual text has been extracted perfectly (the right side of the image had to be cropped to fit here).

The final character is a byte with the hexadecimal value of 0x0C, which is a carriage return.

Below is another image with text in different sizes, both bold and italic.

Image with different sizes of bold and italic text.

The name of this file is “bold-italic.png”. We want to create a text file called “bold.txt”, so our command is:

tesseract bold-italic.png bold --dpi 150

This was not a problem and the text was perfectly extracted.

Using different languages

Supports Tesseract OCR about 100 languages. To use a language, you must first install it. When you find the language you want to use in the list, look at its abbreviation. We are going to install support for Welsh. Its abbreviation is “cym”, which is short for “Cymru”, which means Welsh.

The installation package is called “tesseract-ocr-” with the language abbreviation tagged at the end. To install the Welsh language file on Ubuntu, we will use:

sudo apt-get install tesseract-ocr-cym

The image with the text is below. It is the first stanza of the Welsh national anthem.

image containing the text of the first verse of the national anthem of Wales.

Let’s see if Tesseract OCR is up to the challenge. We will use the -l (language) option to leave tesseract know the language in which we want to work:

tesseract hen-wlad-fy-nhadau.png anthem -l cym --dpi 150

tesseract fits perfectly, as shown in the text excerpted below. Sunrise, Tesseract OCR.

Excerpted Welsh text.

If your document contains two or more languages ​​(like a Welsh to English dictionary, for example), you can use a plus sign (+) tell tesseract to add another language, like so:

tesseract image.png textfile -l eng+cym+fra

Using Tesseract OCR with PDF files

The tesseract The command is designed to work with image files, but it cannot read PDF files. However, if you need to extract text from a PDF, you can use another utility first to generate a set of images. A single image will represent a single page of the PDF.

The pdftppm utility you need should already be installed on your Linux computer. The PDF we will use for our example is a copy of Alan Turing’s seminal article on artificial intelligence, “Machinery and Computer Intelligence.”

PDF of the cover of "Intelligence and computer machinery" by AM Turing.

We use the -png option to specify that we want to create PNG files. The file name of our PDF is “turing.pdf”. We will call our image files “turing-01.png”, “turing-02.png”, and so on:

pdftoppm -png turing.pdf turing

Run tesseract in each image file using single command, we need to use a in loop. For each of our “turing-nn.png, ”files that we execute tesseractand create a text file called “text-” plus “turing-nn“As part of the image file name:

for i in turing-??.png; do tesseract "$i" "text-$i" -l eng; done;

To combine all the text files into one, we can use cat:

cat text-turing* > complete.txt

So how did it go? Very good, as you can see below. However, the first page seems quite challenging. It has different styles and sizes of text and decoration. There is also a vertical “watermark” on the right edge of the page.

However, the output is close to the original. Obviously the formatting was lost, but the text is correct.

First page of text extracted from Turing PDF.

The vertical watermark was transcribed as a gibberish line at the bottom of the page. The text was too small to be read by tesseract accurately, but it would be pretty easy to find and remove it. The worst result would have been missing characters at the end of each line.

Interestingly, the individual letters at the beginning of the question and answer list on page two have been ignored. The PDF section is shown below.

A list of questions and answers from the PDF of the Turing document.

As you can see below, the questions remain, but the “Q” and “A” at the beginning of each line were lost.

Text taken from the question and answer page of the Turing PDF.

The diagrams will not be transcribed correctly either. Let’s see what happens when we try to extract the one shown below from the Turing PDF.

A diagram of "Input" Y "Last state" from the Turing PDF.

As you can see from our output below, the characters were read, but the diagram format was lost.

Text extracted from a diagram in the Turing PDF.

Again, tesseract struggled with the small size of the subscripts, and they were rendered incorrectly.

However, to be fair, it was a good result. We couldn’t extract a simple text, but then this example was deliberately chosen because it presented a challenge.

A good solution when you need it

OCR is not something you should use on a daily basis. However, when the need arises, it’s good to know that you have one of the best OCR engines at your disposal.

Leave a Reply