Tuesday, March 12, 2013

Creating PDFs for EBrary Reader

Turns out that Brown has a subscription to at least part of the Ebrary archive of online books. I found this out because I was looking for a book in our library, using the online catalog, and it turned out that said book was available online. Very cool!

Well, kind of cool, until I found out what this involved. By default, you basically have to read the document online. You can download the "EBrary Reader", which is a Java application, and read documents using it. But it's kind of clunky, to say the least. What I wanted was a PDF that I could then use as I wanted. How to get one?

I noticed that the EBrary Reader would allow me to print, so I thought maybe I could print the file as a PDF, since the default print dialog for Fedora lets you do that. Unfortunately, the Java application was not using the system dialog, but a Java dialog, so that didn't work.

A little googling led me to the cups-pdf package, which installs a system-wide PDF printer for the Common Unix Printing System. A quick "sudo yum install cups-pdf" was enough to give me access to that.

The next step was to convert this file to DjVu, which tends to be much smaller than the corresponding PDF. I've done this a million times before, so figured it would be pretty easy. Unfortunately, it was not.

The first step was to run the pdfimages command (from the poppler-utils package):
pdfimages -p file.pdf p
to extract the page images. Imagine my surprise when I got 420 images from a 20 page paper! It turned out that each page was constructed from 21 different images, stacked on top of each other. (To help with download times?)

Fortunately, I've had enough experience with ImageMagick to know this was not a problem that could not be solved. It took a little more googling, and a little experimentation, but eventually I found out that:
for i in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20; do
  montage p-0${i}*.ppm -geometry +0+0 -background none -tile 1x21 page-$i.tiff;
would stack all the images back on top of each other.

So now I had 20 page images, all as TIFFs, and those could then be fed to ScanTailor for processing on the way to creating a DjVu.

No comments:

Post a Comment

Comments welcome, but they are expected to be civil.
Please don't bother spamming me. I'm only going to delete it.