by Matt Garrish
Here’s the entry for our primary chapter heading, for example:
The text element contains an src attribute that identifies the filename of the content document to synchronize with, and a fragment identifier (the value after the # character) that indicates the unique identifier of a particular element within that content document. In this case, we’re indicating that chapter_001.xhtml needs to be loaded and the element with the id c01h01 displayed (the h1 in our sample content, as expected).
The audio element likewise identifies the source file containing the audio narration in its src attribute, and defines the starting and ending offsets within it using the clipBegin and clipEnd attributes. As indicated by these attributes, the narration of the heading text begins at the mid 24 second mark (to skip past the preliminary announcements) and ends just after the 29 second mark. The milliseconds on the end of the start and end values give an idea of the level of precision needed to create overlays, and why people typically don’t mark them up by hand. If you are only as precise as a second, the reading system can move readers to the new prose at the wrong time or start narration in the middle of a word or at the wrong word.
But those concerns aside, that’s all there is to basic text and audio synchronization. So, as you can now see, no reading system witchcraft was required to synchronize the text document with its audio track! Instead, the audio playback is controlled by timestamps that precisely determine how an audio recording is mapped to the text structure. Whether synchronizing down to the word or moving through by paragraph, this process doesn’t change.
To synchronize the first three words “Call me Ishmael” in the first paragraph, for example, we simply repeat the process of matching element ids and audio offsets:
You’ll notice each clipEnd matches the next element’s clipBegin here because we have a single continuous playback track. Finding each of these synchronization points manually is not so easy, though, as you might imagine.
Synchronizing to the sentence level, however, means only one synchronization point is required for all the words the sentence contains, thereby reducing the time and complexity of the process several magnitudes. The par is otherwise constructed exactly like the previous example:
The process of creating overlays is only complicated by the time and text synchronizations involved, as is no doubt becoming clearer. Moving up another level, paragraph level synchronization reduces the process several more magnitudes as all the sentences can be skipped. Here’s the single entry we’d only have to make for the entire 28s second paragraph:
The complexity isn’t only limited to the number of entries and finding the audio points, however, otherwise technology would easily overcome the problem. Narrating at a heading, paragraph, or even sentence level can be done relatively easily with trained narrators, as each of these structures provides a natural pause point for the person reading, a simplifier not provided when performing word-level synchronization.
A real-world recording scenario, for example, would typically involve the narrator loading their ebook and synchronizing the text in the recording application as they narrate to speed up this process immensely (e.g., using the forward arrow or spacebar each time they start a new paragraph to have the recording program automatically set the new synchronization point). Performing the synchronization at the natural pause points is not problematic in this scenario, as the person reading is briefly not focused on that task and/or the person assisting has enough of a break to cleanly resynchronize. Trying to narrate and synchronize at the word level, however, is a tricky process to perform effectively, as people naturally talk more fluidly than any process can keep up with, even if two people are involved.
Note
The real-world experience I describe here comes from the creation of DAISY talking books, to be clear. Tools for the similar production of EPUB 3 overlays will undoubtedly appear in time, as well, but as of writing are in short supply.
Ultimately, the only advice that can be given is to strive for the finest granularity you can. Paragraphs may be easier to synchronize than sentences, but if the viewing screen isn’t large enough to view the entire paragraph the invisible part won’t ever come into view as the narration plays (the reading system can only know to resynch at the next point; it can’t intrinsically know that narration has to match what is on screen or have any way to determine what is on screen at any given time).
We’re not completely done yet, though. There are a few quick details to run through in order to now include this overlay in our EPUB.
Note
The following instructions assume a basic level of familiarity with EPUB publication files. Refer to the EPUB Publications 3.0 specification for more information.
Assuming we’ve saved our overlay as chapter_001_overlay.smil, the first step is simply to include an entry in the manifest:
We then need to include a media-overlay attribute on the manifest item for the corresponding content document for chapter one:
The value of this attribute is the id we created for the overlay in the previous step.
And finally we need to add metadata to the publication file indicating the total length of the audio for each individual overlay and for the publication as a whole. For completeness, we’ll also include the name of the narrator.
0:14:43 … 0:23:46 Stuart Wills
The refines attribute on the first meta element specifies the id of the manifest item we created for the overlay, as this is how we match the time value up to the content file it belongs to. The lack of a refines attribute on the next duration meta element indicates it contains the total time for the publication (only one can omit the refines attribute).
There’s one final metadata item left to add and then we’re done:
-epub-media-overlay-active
This special media:active-class meta property tells the reading system which CSS class to apply to the active element when the audio narration is being played (i.e., the highlighting to give it).
For example, to apply a yellow background to each section of prose as it is read, as is traditionally found in accessible talking books, you would define the following definition in your CSS file:
.-epub-media-overlay-active { background-color: yellow; }
And that’s the long and short of creating overlays.
Structural Considerations
I briefly touched on the need to escape nested structures, and skip unwanted ones, but let’s go back to this functionality as a first best practice, as it is critical to the usability of the overlays feature in exactly the same way markup is to content-level navigation.
If you have the bad idea in your head that only par elements matter for playback, and you can go
ahead and make overlays that are nothing more than a continuous sequence of these elements, get that idea back out of your head. It’s the equivalent of tagging everything in the body of a content file using div or p tags.
Using seq elements for problematic structures like lists and tables provides the information a reading system needs to escape from them.
Here’s how to structure a simple list, for example:
A reading system can now discover from the epub:type attribute the nature of the seq element and of each par it contains. If the reader indicates at any point during the playback of the par element list items that they want to jump to the end, the reading system simply continues playback at the next seq or par element following the parent seq. If the list contained sub-lists, you could similarly wrap each in its own seq element to allow the reader to escape back up through all the levels.
A similar nested seq process is critical for table navigation: a seq to contain the entire table, individual seq elements for each row, and table-cell semantics on the par elements containing the text data.
A simple three-cell table could be marked up using this model as follows:
You could also use a seq for the table cells if they contained complex data:
But attention shouldn’t only be given to seq elements when it comes to applying semantics. Readers also benefit when par elements are identifiable, particularly for skipping:
If all notes receive the semantic as in the above example, a reader could disable all note playback, ensuring the logical reading order is maintained. All secondary content that isn’t part of the logical reading order should be so identified so that it can be skipped.
This small extra effort to mark up structures and secondary content does a lot to make your content more accessible.
Tell It Like It Is: Text-to-Speech (TTS)
An alternative (and complement) to human narration, and the associated costs of creating and distributing it, is speech synthesis—when done right, that is. The mere thought of synthesized speech is enough to make some people cringe, though, as it’s still typically equated with the likes of poor old much-maligned Microsoft Sam and his tinny, often-incomprehensible renderings. Modern high-end voices are getting harder and harder to distinguish as synthesized, however, and the voices on most reading systems and computers are getting progressively more natural sounding and pleasant to the ears for extended listening.
But whatever you think of the voices, the need to be able to synthesize the text of your ebook is always going to be vital to a segment of your readers, especially when human narration is not available. It’s also generally useful to the broader reading demographic, as I’ll return to.
And the voice issues are a bit of a red herring. The real issue here is not how the voices sound but the mispronunciations the rendering engines make, and the frequency with which they often make them. The constant mispronunciation of words disrupts comprehension and ruins reading enjoyment, as it breaks the narrative flow and leaves the reader to guess what the engine was actually trying to speak. It doesn’t have to be this way, though; the errors occur because the mechanisms to enhance default synthetic renderings haven’t been made available in ebooks, not because there aren’t any.
But to step back slightly, synthetic speech engines aren’t inherently riddled with errors, they just fail because word pronunciation can be an incredibly complex task, one that requires more than just the simple recognition of character data. Picture yourself learning a new language and struggling to understand why some vowels are silent in some situations and not in others, or why their pronunciation changes in seemingly haphazard ways, not to mention trying to grasp where phonetic boundaries are and so on. A rendering engine faces the same issues with less intelligence and no ability to learn on its own or from past mistakes.
The issue is often sometimes as simple as not being able to parse parts of speech. For example, consider the following sentence:
An official group record of past achievements was never kept.
A speech engine may or may not say “record” properly, because record used as noun is not pronounced the same way as when used as a verb in English.
The result is that most reading systems with built-in synthetic speech capabilities will do a decent job with the most common words in any language, but can trip over themselves when trying to pronounce complex compound words, technical terms, proper names, abbreviations, numbers, and the like. Heteronyms—words that are spelled the same way but have different pronunciations and meanings—also offer a challenge, as you can’t always be sure which pronunciation will come out. The word bass in English, for example, is pronounced one way to indicate a fish (bass) and another to indicate an instrument (base).
When you add up the various problem areas, it’s not a surprise why there’s a high frequency of errors. These failings are especially problematic in educational, science, medical, legal, tax, and similar technical publishing fields, as you might expect, as the proper pronunciation of terms is critical to comprehension and being able to communicate with peers.
The ability to correctly voice individual words is a huge benefit to all readers, in other words, which is why you should care about the synthetic rendering quality of your ebooks, as I said I’d get back to. Even if all your readers aren’t going to read your whole book via synthetic speech, everyone comes across words they aren’t sure how to pronounce, weird-looking character names, etc. In the print world, they’d just have to guess at the pronunciation and live with the nuisance of wondering for the rest of the book whether they have the it right in their head or not (barring the rare pronunciation guide in the back, of course).
The embedded dictionaries and pronunciations that reading systems offer are a step up from print, but typically are of little-to-no help in many of these cases, since specialized terms and names don’t appear in general dictionaries. Enhancing your ebooks even just to cover the most complicated names and terms goes a long way to making the entire experience better for all. Enhanced synthetic speech capabilities are a great value-add to set you apart from the crowd, especially if you’re targeting broad audience groups.
Synthetic speech can also reduce the cost to produce audio-enhanced ebooks. Human narration is costly, as I mentioned at the outset, and typically only practical for novels, general non-fiction, and the like. But even in those kinds of books, are you going to have a person narrate the bibliographies and indexes and other complex structures in the back matter, or would it make more sense to leave them to the reader’s device to voice? Having the pronunciation of words consistent across the human-machine divide takes on a little more importance in this light, unless you want to irk your readers with rotten sounding back matter (or worse, omitted material).
And as I mentioned in the overlays section, there are reading systems that already give word-level text-audio synchronization in synthetic speech playback mode, surpassing what most people would attempt with an overlay and human narration. As each word is fed for rendering it gets highlighted on the screen auto-magically; there’s nothing special you have to do.
T
he cost and effort to improve synthetic speech is also one that has the potential to decrease over time as you build re-usable lexicons and processes to enhance your books.
But enough selling of benefits. You undoubtedly want to know how EPUB 3 helps you, so let’s get on with the task.
The new specification adds three mechanisms specifically aimed at synthetic speech production: PLS lexicon files, SSML markup, and CSS3 Speech style sheets. We’ll go into each of these in turn and explore how you can now combine them to optimize the quality of your ebooks.
PLS Lexicons
The first of the new synthetic speech enhancement layers we’ll look at is PLS files, which are xml lexicon files that conform to the W3C Pronunciation Lexicon Specification. The entries in these files identify the word(s) to apply each pronunciation rule to. The entries also include the correct phonetic spelling, which provides the text-to-speech engine with the proper pronunciation to render.
Perhaps a simpler way of thinking about PLS files, though, is as containing globally-applicable pronunciation rules: the entries you define in these files will be used for all matching cases in your content. Instead of having to add the pronunciation over and over every time the word is encountered in your markup, as SSML requires, these lexicons are used as global lookups.
PLS files are consequently the ideal place to define all the proper names and technical terms and other complex words that do not change based on the context in which they are used. Even in the case of heteronyms, it’s good to define the pronunciation you deem the most commonly used in your PLS file, as it may be the only case in your ebook(s). It also ensures that you know how the heteronym will always be pronounced by default, to remove the element of chance.