René Rebe

OCRKit 2.6 - how we wrote our own PDF lib

We just released another OCRKit milestone release: OCRKit 2.6.

The biggest change in this new version is the switch to our own -written from scratch- PDF library.

Years ago, when we started ExactScan for Mac around 2005 we already wrote PDF ourselves. Back in that day we already had an own image processing library, so why depend on some proprietary, platform specific code? We just implemented writing standard conforming files ourselves. Problem solved.

However, writing PDF files is relatively easy. Reading, and correctly rasterizing PDF files is a really complex challenge. Mostly because the PDF standard is over 1000 pages long, and the various compression formats, encryption, 3D objects, forms, annotations, etc. pp. makes it even more difficult to implement all the various combinations of that.

Initially we were only interested to deliver an awesome OCR, and thus we decided to get the page images by handing the PDF to the Mac OS X frameworks and get the rasterized page back. While this “solved the problem”, it came with some drawbacks. For example we do not know what is on the page. Just one image? Everything black and white? 200dpi, or 300? All implementation details are completely hidden from us. And worst of all the introduction of HiDPI Retina scaling altered the results so that we got 2x scaled and clipped images, and had to rewrite part of the code interfacing with Mac OS X to compensate for this.

In the meantime we are not only interested about Mac apps anymore, more and more customers ask for Windows, and even Linux solutions. And for neither Windows nor Linux we could re-use this PDF interfacing code. For classic Windows there is no system-wide PDF support, and for Linux we could base on some ghostscript, poppler or so.

However, we wanted full control about the feature set, behaviour, and not source in other’s security issues and bugs into our apps, … and something that just works on Windows, too. So we decided we better start our own PDF parsing and rasterization code. It was quite some effort, but the results start to pay off: We have a much, much better understanding of the PDF internals, and more powerful, faster code that just works on any-OS.

Our new Windows version is obviously using it since the beginning (in fact it already processed hundreds of thousands, TBs of PDFs at customers already).

With OCRKit 2.6 the same new code now comes to the Mac version. Improving program behaviour, e.g. not to rely on our previous color detection to determine if a page was black & white, gray or color, and thus retain the exact image appearance, compression, and in some cases vastly speed up the processing. For example on a 2,3 GHz Intel Core i7 15″ Retina MacBook Pro a 44 page test file is decompressed, OCR’ed and re-written in just 16 seconds - down from over 40 seconds in the previous version of OCRKit. This is over two times as fast (and OCRKit already was fast)! And bringing down OCR time on this multi-core machine from approximately one second per page, to 0.36 seconds (yes, zero point three six - nearly a third of a second!) per page!!!

And best of all: OCRKit v2.6 is still a free update for all our existing OCRKit users since 2010. Enjoy and spread the word!

OCRKit - Recognition revisited.

This entry was posted on Monday, June 30th, 2014 at 15:16 and is filed under Software, Life. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

You must be logged in to post a comment.

OCRKit 2.6 - how we wrote our own PDF lib

Leave a Reply