Fixing Fouled up e-Books

Whilst there are plenty of legal "free" e-books out there, the downside is that much of the time, they are glitchy, and have egregious formatting and layout issues. You can fix them with these tools

Oct 20, 2024

reading an e-book and looking frustrated

As I have mentioned many times, I have been a long time satisfied user of my reader and ebooks. Certainly better than hauling around a lot of dead trees when I travel.

All good. I have been building a collection for more than 5 years now, from a variety of sources, many commercial, but also many of the free sources (Project Gutenberg) as well as some other sources for out of print books that are ahem less than legit.

Most of the commercial options are DRM encumbered, so that I can’t peek inside with impunity. But all the others are open books, so to speak, mostly ePub format. There are some great tools to work with.

SIGIL – A WYSIWYG EPUB EDITOR

Sigil is free, open source, and pretty solid. It will help you put together a book, and fix minor errors.

It is a good place to start to figure out the ePub format.

ePub are pretty straightforward HTML with some special attributes. You can do just about anything that you can put on a web page (within reason, no javascript or animations).

But you can tweak up the look and feel of the book with stylesheets, inserted graphical elements, and all the other tricks that you can use with web pages.

CALIBRE – AN OPEN SOURCE LIBRARY MANAGER

Of course, your reader probably comes with software to manage its files, You will find that it is pretty limited. Perhaps you have some old files in one of the dead or dying formats (.lit, .lrf, BBeB etc.) Additionally there are a lot of eBooks in plain text format or Microsoft Word format.

It is helpful to be able to shift formats, and to clean up some of the glitches.

Enter Calibre. An open source, multi platform (Mac, Windows, Linux) environment for managing your library. It groks all the standard formats, and converts between them seamlessly. It is extensible with plugins, and it can help you clean up books as well as transcode them. Additionally, it connects with several sources to get covers, meta data, and other tangibles to improve the user experience.

It can be used to take HTML files or word processing files (RTF or .DOCX) and turn them into eBooks in any format.

Being a powerful package, to get the most out of it, you really need to understand what it is doing, and how to optimize the settings. By default it does an OK job, but as in many cases, garbage in equals garbage out.

SOME ISSUES

Why is this a problem? Well, it is because a lot of the free or community books are poorly formatted to begin with. Also, some sources in general suck. Often, I will find an out of print book that was scanned and OCR’d. Often this is turned into a MS word file. Until recently, you needed to save that file as an HTML file and run it through Calibre.

Calibre uses some pretty heavy stylesheets, that mostly look OK. The ambitious person can customize them easily, if you know what you are doing. Of course not every reader can handle all styesheet formats, so it can be a trial and error process.

Of course, there are some things that really foul up any book. Anything output by Microsoft Word uses a class structure that is insane. If you see class=”msonormalxx”, you know that you are going to have an ugly book.

RTF files are not much better. They typically have a lot less funky classes that are tossed in, but the conversion does glitch in some spectacular ways.

EPUB VERSUS OTHER FORMATS

I have a pretty large colletion of the Microsoft ebook format (.lit) and the old Sony reader format (.lrf) that I convert to read. Both these formats can be problematic.

The Sony format leads to ePubs with some really whacky xhtml coding in them. Really ugly to try to clean up. Additionally, they have odd chapter breaks, and pretty non functional Tables of Content.

Fortunately, it isn’t too difficult to clean them up, but it is time consuming. You need a few tools.

An HTML stripper. There are several options, but I use a simple app for my Mac HTML Stripper A reasonably priced utility. There are some free ones, but I like to support small vendors, and $15 is a good price for this tool.
The HTML stripper will give you good plain text. You will need to reformat that into clean HTML. Fortunately, Markdown is a fabulous way to do this. I use Mou for the Mac (free, but do donate to them), and MarkdownPad on my PC. Again free, but the pro version has some nice extensions, so it might be worth spending the $15 to buy it (I have).

THE CLEAN UP WORKFLOW

First I extract the raw HTML. I do this chapter by chapter. It is best to create an ePub with one source file per chapter. That makes for clean chapter breaks, and a well functioning table of contents.

Then I run it through my HTML stripper. That gives me clean text file. It will likely have odd numbers of breaks in paragraphs, and some other interesting things. Fortunately that doesn’t matter.

I then import that text into my markdown editor. Add a chapter title in h1 and then you have a nice complete chapter to drop back into the epub. (every markdown editor has a “copy to HTML” function. Works great.)

Lastly, I build a new epub using Sigil. Add meta data, a cover, and construct a table of contents, and you have a nice book.

BUT WHAT IF YOU WANT TO READ IT ON YOUR KINDLE?

Of course, the Amazon kindle doesn’t support the ePub format. So you need to convert it into either an .AZW3 or a .mobi format file.

Calibre to the rescue again. Trivial, and the defaults are pretty good for conversion.

And naturally, you use Calibre to transfer or manage your library on the Kindle (this is only for files you didn’t buy from Amazon). Works like a charm.

CODA

I got into cleaning up ebooks when my collection of old Doc Savage books. Circa 2008 I found a repository of them in Sony format (I had a PRS 700 reader then), and the 181 original Doc Savage stories were a joy to read.

But they converted poorly into ePub. When I lost my PRS700, and replaced it with the PRS 600, the support for .lrf files was removed. My only options were to convert them. Calibre converted them, but it did a lousy job.

The last few days, I have been using the workflow above to clean some of these books. It takes me about 35 mintues to create a crisp, clean, and standards compliant ePub from a completely ugly converted ePub.

A labor of love.

Sweaty Spice

Discussion about this post