How To Make Tesseract Faster

March 09, 2024 Post a Comment

This is a long shot, but I have to ask. I need any ideas that might make Tesseract OCR engine faster. I'm processing 2M PDFs consisting of about 20M pages of text, and I need to ge

Solution 1:

I also have huge OCR needs and Tesseract is prohibitively slow. I ended up going for a custom feedforward net similar to this one. You don't have to build it yourself, though; you can use a high-performance library like Nervana neon, which happens to be easy to use.

Then there's two parts to the problem:

1) Separate characters from non-characters. 2) Feed characters to the net.

Let's say you feed characters in batches of size 1000, that you resize each character to dimensions 8 x 8 (64 pixels), and that you want to recognize 26 letters (lowercase AND uppercase) and 10 digits and 10 special characters (72 glyphs total). Then parsing all 1000 characters ends up being two (non-associative!) matrix products:

(A dot B) dot C.

A would be a 1000 x 64 matrix, B would be a 64 x 256 matrix, C would be a 256 x 72 matrix.

For me, this is several orders of magnitude faster than Tesseract. Just benchmark how fast your computer can do those matrix products (the elements are floats).

The matrix products are non-associative because after the first one you have to apply a (cheap) function called a ReLU.

It took me a few months to get this whole enchilada to work from scratch, but OCR was a major part of my project.

Also, segmenting characters is non-trivial. Depending on your PDFs, it can be anything from an easy exercise in computer vision to an open research problem in artificial intelligence.

I'm not claiming this is the easiest or most effective way to do this... This is simply what I did!

Getting Started with Python

How To Make Tesseract Faster

Solution 1:

Post a Comment for "How To Make Tesseract Faster"