Tuesday, July 03, 2007

Installing tesseract command line OCR on MacOS X

Installing libpng from source:
http://kenno.wordpress.com/2006/04/20/compiling-libpng-for-mac-os-x/

fink install libjpeg, aspell, aspell-en

I will want to create my own aspell dictionary using taxonomic names:
http://www.mail-archive.com/code4lib@listserv.nd.edu/msg01545.html

Download and installing tesseract following install instructions:
http://code.google.com/p/tesseract-ocr/downloads/list

fink xpdf for pdfimages to extract images from a pdf:
>pdfimages -j LandPlants_paper.pdf LandPlantImg

To convert in imagemagick to tif for tesseract :
convert LandPlantImg.jpg -compress None test.tif

Using tesseract:
tesseract test.tif out.txt

I have now got a script to extract the names and check them against a dictionary of taxonomic names from spira.
I am thinking that using information from the article itself might provide even better results. When tesseract 2.0 comes out, there will also be a way of training the program to improve the character recognition. OCRupus also looks like an interesting program for layout detection but it doesn't work on MacOSx yet
The line extraction is proving to be much more difficult than first thought mainly because the lack of consistn format and the labelling at the nodes that get in the way of edge detection. I have tried a number of methods for cleaning up the image and bit by bit I will get there, I hope.






No comments:

Disqus for Evo-Karma