How To Make Tesseract Faster
Solution 1:
I also have huge OCR needs and Tesseract is prohibitively slow. I ended up going for a custom feedforward net similar to this one. You don't have to build it yourself, though; you can use a high-performance library like Nervana neon, which happens to be easy to use.
Then there's two parts to the problem:
1) Separate characters from non-characters. 2) Feed characters to the net.
Let's say you feed characters in batches of size 1000
, that you resize each character to dimensions 8 x 8
(64 pixels), and that you want to recognize 26 letters (lowercase AND uppercase) and 10 digits and 10 special characters (72 glyphs total). Then parsing all 1000 characters ends up being two (non-associative!) matrix products:
(A
dot B
) dot C
.
A
would be a 1000 x 64
matrix, B
would be a 64 x 256
matrix, C
would be a 256 x 72
matrix.
For me, this is several orders of magnitude faster than Tesseract. Just benchmark how fast your computer can do those matrix products (the elements are floats).
The matrix products are non-associative because after the first one you have to apply a (cheap) function called a ReLU.
It took me a few months to get this whole enchilada to work from scratch, but OCR was a major part of my project.
Also, segmenting characters is non-trivial. Depending on your PDFs, it can be anything from an easy exercise in computer vision to an open research problem in artificial intelligence.
I'm not claiming this is the easiest or most effective way to do this... This is simply what I did!
Post a Comment for "How To Make Tesseract Faster"