Creating searchable PDF with ExactImage 0.6
Friday, September 19th, 2008ExactImage 0.6 now comes with an revamped PDF writer and hocr2pdf front-end. Together with a patch to cuneiform to annotate each recognized glyph with a hOCR-like bounding box, it allows the creation of pretty exactly positioned, searchable PDF files with open source software!
Basically hocr2pdf accepts the input from STDIN (we could also add a -h/–html option to read it from a file) and the image from the filename passed with -i/–input. The resulting PDF filename is specified with -o/–output.
Additionally -s/–sloppy-text allows grouping of words on a line for sometimes improved search and cut’n paste results with older PDF viewers. The -n/–no-image option allows to skip the image - normally shadowing the text - to either save storage space or take a look how exactly the glyphs are positioned. Basically the short introductionary usage boils down to:
cuneiform -f hocr -o test.hocrl ocr-test.tif
hocr2pdf -i ocr-test.tif -o test.pdf < test.hoc
And the searchable PDF is there. The cuneiform hocr patch is now in the Launchpad’s cuneiform Bazaar HEAD/TIP.
It’s also already in use on the Archivista Box - a complete and open source long term archiving solution.