CHAPTER I
A SHIFTING REEF
style='font-size:13.5pt;color:black'>The year 1866 was signalised by a
remarkable incident, a mysterious and puzzling phenomenon, which doubtless no
one has yet forgotten. Not to mention rumours which agitated the maritime
population and excited the public mind, even in the interior of continents,
seafaring men were particularly excited. Merchants, common sailors, captains of
vessels, skippers, both of Europe and America, naval officers of all countries,
and the Governments of several States on the two continents, were deeply
interested in the matter.
style='font-size:13.5pt;color:black'>For some time past vessels had been met by
“an enormous thing,” a long object, spindle-shaped, occasionally
phosphorescent, and infinitely larger and more rapid in its movements than a
whale.
Notice that the Headings and paragraphs all have margins and other styles in the style attribute, and that there are other styles like font size and color added to the span tags. Here is what that same text would look like when it is cleaned up:
PART ONE
CHAPTER I
A SHIFTING REEF
The year 1866 was signalised by a remarkable incident, a mysterious and puzzling phenomenon, which doubtless no one has yet forgotten. Not to mention rumours which agitated the maritime population and excited the public mind, even in the interior of continents, seafaring men were particularly excited. Merchants, common sailors, captains of vessels, skippers, both of Europe and America, naval officers of all countries, and the Governments of several States on the two continents, were deeply interested in the matter.
For some time past vessels had been met by “an enormous thing,” a long object, spindle-shaped, occasionally phosphorescent, and infinitely larger and more rapid in its movements than a whale.
As you can tell, this code is much cleaner and easier to understand. The formatting has been trimmed down, and all the extraneous styles and tags have been removed.
PDF HTML
Adobe PDF files create HTML that is even more bloated and messy than Word. I took the same Word document we used above, created a PDF from it using Adobe Acrobat, and exported it as HTML from Acrobat. Here is what it gave me:
>PART ONE
>
>
>
>CHAPTER I
>
>A SHIFTING REEF
>
>
>
>The year 1866 wa
>
>s signalised by a remarkable incident, a mysterious and puzzling phenomenon, which doubtless no one has yet forgotten. Not to mention rumours which agitated the maritime population and excited the public mind, even in the interior of continents, seafaring men were particularly excited. Merchants, common sailors, captains of vessels, skippers, both of Europe and America, naval officers of all countries, and the Governments of several States on the two continents, were deeply interested in the matter.
>
>
>
>For some time past vessels had been met by “an enormous thing,†a long object, spindle-shaped, occasionally phosphorescent, and infinitely larger and more rapid in its movements than a whale.
>
>
>
There are a lot of differences between this output and the Word output above. First, there is a lot more code added to the file. There are more tags, more attributes, and even some added ids. Second, the line breaks are added inside the tags themselves and at odd places. Third, the curly quotes (“ and ”), which were fine in the Word document, came over as garbled text from the PDF (“ and â€).
These differences can cause many problems as you try to clean up the code and make it more useable. This is why I suggest that you convert the PDF to Word before converting to HTML. When you go that route, the code may look something like this:
style='font-size:20.0pt;color:black'>PART ONE
style='font-size:24.0pt;color:black'>
CHAPTER
I
A SHIFTING REEF
The year 1866 was signalised by a remarkable incident, a
mysterious and puzzling phenomenon, which doubtless no one has yet forgotten.
Not to mention rumours which agitated the maritime population and excited the
public mind, even in the interior of continents, seafaring men were
particularly excited. Merchants, common sailors, captains of vessels, skippers,
both of Europe and America, naval officers of all countries, and the
Governments of several States on the two continents, were deeply interested in
the matter.
style='font-size:13.5pt'>For some time past vessels had been met by “an
enormous thing,” a long object, spindle-shaped, occasionally phosphorescent,
and infinitely larger and more rapid in its movements than a whale.
style='font-size:11.5pt'>
This HTML is not exactly like the code we got from Word directly, but it is certainly cleaner than the HTML we got from the PDF.
Mobipocket HTML
Mobipocket Creator does a better job of creating clean HTML, but it also has some issues of which you should be aware.
PART ONE
CHAPTER I
A SHIFTING REEF
The year 1866 was signalised by a remarkable incident, a mysterious and
puzzling phenomenon, which doubtless no one has yet forgotten. Not to mention
rumours which agitated the maritime population and excited the public mind, even in
the interior of
continents, seafaring men were particularly excited. Merchants,
common sailors, captains of vessels, skippers, both of Europe and America, naval
officers of all countries, and the Governments of several States on the two continents,
were deeply interested in the matter.
For some time past vessels had been met by “an enormous thing,” a long object,
spindle-shaped, occasionally phosphorescent, and infinitely larger and more rapid in
its movements than a whale.
Notice that the bloat is all gone, but the heading is not in a heading tag and there are some other issues that will make formatting a bit harder to do. Overall, though, the code could be much easier to work with.
Joining Paragraph Lines
One thing you may have noticed in the above examples, and which you will see in your own file after you convert it into HTML, is that there are line breaks added throughout the file. These line breaks are not a problem for HTML since it will only start a new paragraph when you have a
tag; however, they do make editing the file more difficult, especially if you are using regular expressions and making a lot of changes to your file. Chapter 1 Chapter 1 tag, but you will probably see something more like one of these examples in your HTML: tags will make the book code much more manageable. Be careful to ensure that the tags you replace are actually the regular paragraphs, not a specially styled paragraph, a poem, or something else. You will want to handle those individually. Chapter 1 ]*>]+font-size:20.0pt[^>]*>(Chapter [^<]+) There were two or three things... The Four Million, copyright © 1906 by O. Henry.
The easiest way to remove these line breaks is to create a Perl script that will do the work for you, and run it on your file. Here is a simple script that will work well for that purpose:
#!/usr/bin/perl
my $book;
my $in = "MyBook.html";
my $out = "MyBook.linebreaksremoved.html";
{
open IN, $in;
local $/;
$book =
}
$book =~ s{(
my $body=$1;
$body =~ s{<(p|h[1-6]|td|li|dt|dd).*?1>}{
$all = $&;
$all =~ s/n/ /g;
$all =~ s/ss+/ /g;
$all;
}gesi;
while ($body =~ s{nn}{n}g) {}
"$body";
}esi;
open OUT, ">$out";
print OUT $book;
This script is also available in the Book Tools section on my website.
Removing Extraneous Styles and Tags
The next step in cleaning up your document is to remove the unneeded styles and tags that were inserted by Word or Acrobat. Which styles and tags you remove will be completely up to you, but I highly suggest that you strip the HTML down to its most basic tags. Doing so will remedy most display problems and make the book consistent throughout.
As you are stripping out extra tags and styles, you will want to replace them with tags and formatting that work well in the Kindle. For instance, if all of your chapter headings look like this:
you will want to turn them into actual heading tags in the HTML file, like this:
Chapter 1
If you only turn that into a regular paragraph (
The majority of tags and styles present in your file will actually be helpful in your efforts to convert the file to clean, Kindle-ready HTML. You can use unneeded styles like margins to help you give headings the right spacing, or to find places where your file has a blank line between paragraphs to show a scene change. The difficulty is that there are most likely also margins in your file that are really not needed. Discerning what to use and what to remove will require some investigation.
When I am cleaning up a file I usually start with the easy pickings, like the regular paragraphs. In most books the paragraphs just need to be formatted as a
Notice the variety in formatting. All of that is due to the settings used by the authors when they were formatting their books in Word. In most books, changing these to
Next, it is usually best to attack the chapter headings, and any subheadings your book may have. Just as with paragraphs, you may find a variety of styles applied to headings. The main difference is that they will probably not be as consistent as the paragraphs to replace. You may find that searching in the HTML file for “Chapter” is the easiest way to find them all. You may also notice a pattern in the font size formatting for the various headings, such as all top-level (chapter) headings being formatted in “font-size:20.0pt;” and all the second-level subheadings being formatted in “font-size:16.0pt;”. The key, as in all of the cleanup process, is to look for patterns and put them to good use.
In that vein, let’s work out a RegEx that might come in handy with your headings. Say you have a chapter heading like this one:
but when you look at Chapter 2 you see that it is slightly different, with a top margin of .50in. To catch both of these in one fell swoop, you will want to create a RegEx that ignores the top margin and bases its search on something else that you know is standardized, like the font-size. Here is an example of what that could look like:
Find:
Replace: 1
Of course, there are other RegExes you can use in a situation like this, but that should give you the general idea.
The next step I usually take is to get rid of all the span tags, since they are the worst bloat-creators in program-generated HTML. You will want to search for “
When you have finished those three pieces of your process, you have probably handled the majority of the basic cleanup your file needs. Now it is time to learn about the formatting that the Kindle supports and how to make your book look great on the device.
Chapter 5
Formatting Your Book
While the Kindle format is essentially HTML, the device only supports a small portion of the tags and styles that are supported in most Web browsers and other HTML viewers. That actually works out well for you as an author or publisher, since it removes some complexity from the formatting process.
In this chapter I will cover the HTML tags and styles that work in the Kindle. I have also included a list of supported tags and styles in Appendix A, and there is a printable copy of the same information in the Book Tools section of my website.
Font Formatting
To start out, let’s take a look at some of the basic text formatting tools you have at your disposal.
Bold and Italics
To make text bold in your book, you will need to apply the tag, and to italicize text in your book, you will need to apply the tag. For example:
I entered, and f
ound Captain Nemo deep in algebraical calculations of x and other quantities.
You can also apply bold to any tag in your style sheet using the font-weight: bold; property, and italics using the text-style: italic; property.
The tag and tag are often thought of as replacements for the and tags. These tags are intended for use in specific situations when the text being marked up requires emphasis or strong emphasis. Like most browsers, the Kindle will format as italics and as bold.
Underline
To underline text in the Kindle, use the tag.
Henry, O. The Four Million. New York: McClure, Phillips & Co., 1906.
You can also apply an underline style to any tag in your style sheet using the text-decoration: underline; property.
Big and Small
There are times when making some text bigger or smaller than the default size is necessary. While the Kindle does allow a small amount of tweaking with the CSS font-size property, the easiest and most consistent way to adjust font sizes in your text is by using the and tags. These tags can also be nested to enhance the effect.
Three examples of the use of and come to mind. The first using the tag to create a drop cap of sorts. Since the Kindle does not allow floating elements, the large letter will not actually “drop,” but the overall effect is similar. For example:
The second example is using the tag on a copyright page. I do this by default in most of my books because it more closely matches most hardcopies.
The third example is using the tag to create the impression of small caps. The default font of the Kindle does not, unfortunately, allow the use of small caps, but to give the same effect just put tags around the small caps text, like this:
WILLIAM SYDNEY PORTER
Superscript and Subscript