Archive for June, 2014

OCRKit 2.6 - how we wrote our own PDF lib

Monday, June 30th, 2014

We just released another OCRKit milestone release: OCRKit 2.6.

The biggest change in this new version is the switch to our own -written from scratch- PDF library.

Years ago, when we started ExactScan for Mac around 2005 we already wrote PDF ourselves. Back in that day we already had an own image processing library, so why depend on some proprietary, platform specific code? We just implemented writing standard conforming files ourselves. Problem solved.

However, writing PDF files is relatively easy. Reading, and correctly rasterizing PDF files is a really complex challenge. Mostly because the PDF standard is over 1000 pages long, and the various compression formats, encryption, 3D objects, forms, annotations, etc. pp. makes it even more difficult to implement all the various combinations of that.

Initially we were only interested to deliver an awesome OCR, and thus we decided to get the page images by handing the PDF to the Mac OS X frameworks and get the rasterized page back. While this “solved the problem”, it came with some drawbacks. For example we do not know what is on the page. Just one image? Everything black and white? 200dpi, or 300? All implementation details are completely hidden from us. And worst of all the introduction of HiDPI Retina scaling altered the results so that we got 2x scaled and clipped images, and had to rewrite part of the code interfacing with Mac OS X to compensate for this.

In the meantime we are not only interested about Mac apps anymore, more and more customers ask for Windows, and even Linux solutions. And for neither Windows nor Linux we could re-use this PDF interfacing code. For classic Windows there is no system-wide PDF support, and for Linux we could base on some ghostscript, poppler or so.

However, we wanted full control about the feature set, behaviour, and not source in other’s security issues and bugs into our apps, … and something that just works on Windows, too. So we decided we better start our own PDF parsing and rasterization code. It was quite some effort, but the results start to pay off: We have a much, much better understanding of the PDF internals, and more powerful, faster code that just works on any-OS.

Our new Windows version is obviously using it since the beginning (in fact it already processed hundreds of thousands, TBs of PDFs at customers already).

With OCRKit 2.6 the same new code now comes to the Mac version. Improving program behaviour, e.g. not to rely on our previous color detection to determine if a page was black & white, gray or color, and thus retain the exact image appearance, compression, and in some cases vastly speed up the processing. For example on a 2,3 GHz Intel Core i7 15″ Retina MacBook Pro a 44 page test file is decompressed, OCR’ed and re-written in just 16 seconds - down from over 40 seconds in the previous version of OCRKit. This is over two times as fast (and OCRKit already was fast)! And bringing down OCR time on this multi-core machine from approximately one second per page, to 0.36 seconds (yes, zero point three six - nearly a third of a second!) per page!!!

And best of all: OCRKit v2.6 is still a free update for all our existing OCRKit users since 2010. Enjoy and spread the word!

OCRKit - Recognition revisited.

Deduplicating the Internet

Friday, June 27th, 2014

So we learned the NSA, and their FVEY (Five Eyes) friends plus other fellows such as the German BND et al. are effectively making their own copy of all the data going thru their optical fibre splitters and such.

Now all this is of course seriously bad, unconstitutional, and exactly creating the police state Orwell already imaging in the now famous 1984 and we could already see in the East Germany’s (GDR) Stasi.

This anti-democratic setup and questions aside, … they effectively duplicating most (if not all, minus the youtube video streams) data on the internet.

Now, imagine, just for a brief unlikely moment, they would stop doing this. This would effectually freeing up a whole lot of bandwidth, like double the Internet capacity. Make everything fast and snappy. Imagine how many 4k video streams that would be!!!

And, actually, I also wonder how many connection issues all this surveillance cause. Certainly not the optical fibre splitters, but other kind of non-optical duplication law inspection certainly causes some connection drops left and right. And yes, I have seen proprietary commercial firewall code, … ! :-/

Why monolithic kernels are fail

Friday, June 27th, 2014

Yes, there have been long enough flamewars to no end, and we know where Minix and Linux stand today in regards to the installed device base, … however, modern Windows NT, and Mac OS X are a bit micro kernel’ish to some degree, …

In wake of last twelve months (and counting) NSA, GCHQ & Co revelations let’s look at the processes running on a typical network appliance:

PID Uid VmSize Stat Command
1 root 364 S init
2 root SW [keventd]
3 root RWN [ksoftirqd_CPU0]
4 root SW [kswapd]
5 root SW [bdflush]
6 root SW [kupdated]
8 root SW [mtdblockd]
35 root SWN [jffs2_gcd_mtd2]
87 root 364 S logger -s -p 6 -t
89 root 364 S init
96 root 376 S syslogd -C 16
99 root 348 S klogd
230 root 388 S udhcpc -b -p /var/run/udhcpc.eth0.1.pid -i eth0.1
314 root 388 S /usr/sbin/dropbear -g

Hm, ok, so aside the minimal logging, dhcp server and the dropbear SSH server for administration tasks we got nothing separated in the user-mode context. All the networking, packet filtering, firewalling, load balancing, WiFi stack and what not is all running in the kernel context.

Yeah, right, exactly that kernel context where a typo, off-by-one, etc. pp. likely sooner than later crashes (oops) the whole system, or gives you a root login.

Would it not be nice if such a typo, bug, … in the NIC driver, the IPv4 or v6 stack, or firewall, or mostly anywhere else would just segfault, and restart an associated user-space ipv4d, iptabled, hosted?

With more isolated drivers and sub-system we certainly should have rather less security issues, and given Linus’ famous performance quotes - I do rather trade some percent of performance for more security. Besides, nowadays we run most systems virtualized with even more performance overhead, … for security, management and scalability.

Microsoft Surface 3

Tuesday, June 24th, 2014

I kinda like some of the innovations of the Surface 3, like the kickstand, improved keyboard. What I like about the Surface 3 is that the keyboard is cool as the tablet with it’s CPU/GPU behind the display gets warm there. Leaving my typing experience without heated fingers, … Of course the stylus input is nice to have.

One major drawback, however, is that due to the gab between the tablet and the kickstand one can not really use the Surface’s on airline seat desks, … :-/

However, what is a complete shame at this date and time is the fact that the Surface’s are absolutely not repairable.

Of course I do not expect to upgrade the CPU, nor RAM in such laptops. But upgrading the storage (SSD) in a year or two (or when all it’s write cycles are exhausted) or to swap in a new battery in a similar timeframe, … (each of my classic MacBooks got a second battery by now, … in some even the second battery died since then, …). This is not asked too much, and even a must considering todays level of environmental pollution and landfills, …

A tablet / laptop like this can be for good use, even second hand, for five if not ten years. A glued construction like this seriously limits this possibilities without any good reason. Beside maximum company profits, … of course.

Kabel Deutschland vs. IPv6

Sunday, June 22nd, 2014

Da man heutzutage ja nicht mehr wirklich einen Telefon-Anschluss benötigt, teste ich seit einiger Zeit zum ersten Mal Kabel (Deutschland) für Internet. Grundsätzlich hat dieser 20 MBit/s Kabel Anschluss zu Beginn funktioniert - allerdings waren einige Seiten völlig unzuverlässlich und langsam. So im Bereich von 5 Minuten und 7-Mal “Reload” klicken. Betroffen war verschiedenes. Von Slashdot, Semiaccurate, Tagesschau.de, Apple’s iTuens Store, PayPal, etc.

Letzen Dezember war deren internes Routing im Backbone scheinbar so kaputt, dass fast gar nichts ging, und z.B. meine eigene SSH Sessions zu eigenen Servern offensichtlich zu völlig anderen Servern am Ende der Welt gerouted wurden. (Ein Schelm wer da an die NSA und BND denkt, …).

Das grundsätzlich Verbindungproblem mit einigen Servern wurde kürzlich endlich dadurch gelöst, dass ich den Support-ler mal gebeten habe IPv4 für meinen Zugang einzuschalten. Nach den üblichen “nee, das können wir nicht machen” Ausreden hat ein netter Suppler IPv4 doch irgendwann mal eingeschaltet, und siehe da: Seit dem sind alle Verbindungsprobleme verschwunden. Entweder ist deren “Carrier-grade-NAT” einfach zu überlastet, oder hat anderweitige Konfigurations-Probleme - oder einige Webseiten funktionieren schlicht noch immer nicht vollständig mit nativem IPv6, … Whatever, … just ask for ipv4 for now, …

The new entry-level iMac14,4 and ARM transition

Thursday, June 19th, 2014

On the first glance the “new” entry-level (aka low-cost) 21″ iMac with it’s 1.4GHz ULV CPU looks mighty underpowered for a desktop-class machine in 2014.

On the second glance I wonder if this is a test-run of Apple to see how many people complain about such a low-performance machine for a potential transition to Apple’s own ARM A7 CPUs. Currently this 1.4GHz (base clock) iMac’s peak (turbo boost) performance is just twice as fast as an iPad Air. Apple already called the A7 “desktop class performance”, and next Apple A8 ARM64 CPUs at a higher clock will certainly close this gap further, …

Update: I personally do not believe in a short-term success of possible ARM Macs. The initial performance will be far below current high-end Intel chips. A gap Apple will only be able to close in half a decade or so, if not Intel will always have a lead, … The far bigger problem is that Apple would loose all the customers buying the Macs and still running Windows (or Linux) on the Mac’s natively, or in a VM. Either intentionally due build quality and design or accidentally getting the first Mac due advertising, and later finding out their office warez do not work on Mac, and thus sticking with the Windows OS for the time being. I would estimate this to be a too large portion of the Mac sales to be easily lost in this transition.

Translucency in iOS and Mac OS X 10.10

Tuesday, June 3rd, 2014

When Apple introduced the massively blurred translucency in iOS 7 I already was skeptical weather that vast amount of number crunch is really worth anything. Wasted for some background effect. Just making everything less snappy and wasting battery life.

Now Apple extends this blurry transparency to Mac OS X 10.10 -Yosemite- (a really bad name to pronounce internationally btw.), and also letting Apps like iPhoto scroll the view content, blurred under the window’s title- and toolbar.

I recently got a pretty fast 15″ rMBP w/ Nvidia GPU to drive my 4k display at work. Of course OS X -10.9 and such- is super snappy on that. Now guess what? The current Mac OS X 10.10 beta with this blurred, transparent windows makes them all bit sloppy to drag around over the screen. This flat UI would normally be a snap to draw for the GPU. No gradients to compute, or bitmaps to blit, … just solid fills. But no, just waste all the GPU power while at it :-/!

Planned obsolescence at it’s best. For nothing. Well, except blurry background content.

You can go ahead and google Gausian blur, and do the calculations of operations required for it. Not to forget the massive radius Apple must be using for this, …

There would be some optimization possibilities, such as not using each pixel, only every 4th or 8th for this blurry madness, though, …