Accessible EPUB 3
Page 6
Call me Ishmael. Some years ago… … There now is your insular city of the Manhattoes…
Note
See the epubReadingSystem object for more information on how to query what scripting capabilities a system has.
Timed Tracks
Improved access to the content and the playback controls is only one half of the problem; your content still needs to be accessible to be useful. To this end, both the audio and video elements allow timed text tracks to be embedded using the HTML5 track element.
If you’re wondering what timed text tracks are, though, you’re probably more familiar with their practical names, like captions, subtitles, and descriptions. A timed track provides the instructions on how to synchronize text (or its rendering) with an audio or video resource: to overlay text as a video plays, to include synthesized voice descriptions, to provide signed descriptions, to allow navigation within the resource, etc.
As I touched on when talking about accessibility at the start of the guide, don’t underestimate the usefulness of subtitles and captions. They are not a niche accessibility need. There are many cases where a reader would prefer not to be bothered with the noise while reading, are reading in an environment where it would bother others to enable sound, or are unable to hear clearly or accurately what is going on because of background noise (e.g., on a subway, bus, or airplane). The irritation they will feel at having to return to the video later when they are in a more amenable environment pales next to someone who is not provided any access to that information.
It probably bears repeating at this point, too, that subtitles and captions are not the same thing, and both have important uses that necessitate their inclusion. Subtitles provide the dialogue being spoken, whether in the same language as in the video or translated, and there’s typically an assumption the reader is aware which person is speaking. Captions, however, are descriptive and provide ambient and other context useful for someone who can’t hear what else might be going on in the video in addition to the dialogue (which typically will shift location on the screen to reflect the person speaking).
A typical aside at this point would be to show a simple example of how to create one of these tracks using one of the many available technologies, but plenty of these kinds of examples abound on the Web. Understanding a bit of the technology is not a bad thing, but, similar to writing effective descriptions for images, the bigger issue is having the experience and knowledge about the target audience to create meaningful and useful captions and descriptions. These issues are outside the realm of EPUB 3, so the only advice I’ll give is if you don’t have the expertise, engage those who do. Transcription costs are probably much less than you’d expect, especially considering the small amounts of video and audio ebooks will likely include.
We’ll instead turn our attention to how these tracks can be attached to your audio or video content using the track element. The following example shows a subtitle and caption track being added to a video:
The first three attributes on the track element provide information about the relation to the referenced video resource: the kind attribute indicates the nature of the timed track you’re attaching; the src attribute provides the location of the timed track in the EPUB container; and the srclang attribute indicates the language of that track.
The label attribute differs in that it provides the text to render when presenting the options the reader can select from. The value, as you might expect, is that you aren’t limited to a single version of any one type of track so long as each has a unique label. We could expand our previous example to include translated French subtitles as follows:
I’ve intentionally only used the language name for the label here to highlight one of the prime deficiencies of the track element for accessibility purposes, however. Different disabilities have different needs, and how you caption a video for someone who is deaf is not necessarily how you might caption it for someone with cognitive disabilities, for example.
The weak semantics of the label attribute are unfortunately all that is available to convey the target audience. The HTML5 specification, for example, currently includes the following track for captions (fixed to be XHTML-compliant):
You can match the kind of track and language to a reader’s preferences, but you can’t make finer distinctions about who is the intended audience without reading the label. Machines not only haven’t mastered the art of reading, but native speakers find many ways to say the same thing, scuttling heuristic tests.
The result is that reading systems are going to be limited in terms of being able to automatically enable the appropriate captioning for any given user. In reality, getting one caption track would be a huge step forward compared to the Web, but it takes away a tool from those who do target these reader groups and introduces a frustration for the readers in that they have to turn on the proper captioning for each video.
I mentioned the difference between subtitles and captions at the outset, but the kind attribute can additionally take the following two values of note:
descriptions — specifying this value indicates that the track contains a text description of the video. A descriptions track is designed to provide missing information to readers who can hear the audio but not see the video (which includes blind and low-vision readers, but also anyone for whom the video display is obscured or not available). The track is intended to be voiced by a text-to-speech engine.
chapters — a chapters track includes navigational aid within the resource. If your audio or video is structured in a meaningful way (e.g., scenes), adding a chapters track will enable readers of all abilities to more easily navigate through it.
But now I’m going to trip you up a bit. The downside of the track element that I’ve been trying to hold off on is that it remains unsupported in browser cores as of writing (at least natively), which means EPUB readers also may not support tracks right away. There are some JavaScript libraries that claim to be able to provide support now (polyfills, as they’re colloquially called), but that assumes the reader has a JavaScript-enabled reading system.
Embedding the tracks directly in your video resources is, of course, another option if native support does not materialize right away.
Talk to Me: Media Overlays
When you watch words get highlighted in your reading system as a narrator speaks, the term media overlay probably doesn’t immediately jump to mind as the best marketing buzzword to describe the experience. But what you are in fact witnessing is a media type (audio) being overlaid on your text content, and tech trumps marketing in standards!
The audio-visual magic that overlays enable in EPUBs is just the tip of the iceberg, however. Overlays represent a bridge between the audio and video worlds, and between mainstream and accessibility needs, which is what really makes this new technology so exciting. They offer accessible audio navigation for blind and low-vision readers. They can improve the reading experience for persons with trouble focusing across a page and improve reading comprehension for many types of readers. They can also provide a unique read-along experience for young readers.
From a mainstream publisher’s perspective, overlays provide a bridge between audiobook and ebook production streams. Create a single source using overlays and you could transform and distribute across the spectrum from a
udio-only to text-only. With full-text and full-audio synchronization ebooks, you can transform down to a variety of formats. If you’re going to create an audiobook version of your ebook, it doesn’t make sense not to combine production, which is exactly what EPUB 3 allows you to now do. Your source content is more valuable by virtue of its completeness, and you can also choose to target and distribute your ebook with different modalities enabled for different audiences.
From a reader’s perspective, they can purchase a format that provides adaptive modalities to meet their reading needs: they can choose which playback format they prefer, or purchase a book with multiple modalities and switch between them as the situation warrants—listening while driving and visually reading when back at home, for example.
Media overlays are the answer to a lot of problems on both sides of the accessibility fence, in other words.
If you’re coming to this guide from accessibility circles, however, you’re probably wondering why this is considered new and exciting when it sounds an awful lot like the SMIL technology that has been at the core of the DAISY talking book specifications for more than a decade. And you’re right…sort of. Overlays are not new technology, but represent a new evolution of the DAISY standard, which EPUB 3 is a successor to. What is really exciting from an accessibility perspective is the chance to move this production back to the source to get high-quality audio and text synchronized ebooks directly from publishers. Removing synchronization information from having to be encoded in content files is another benefit overlays provide over older talking book formats, greatly simplifying their creation.
But knowing what overlays are and how they can enhance ebooks doesn’t get us any closer to understanding how they work and the considerations involved in making them, which is the real goal for this section. If you like to believe in magic, though, here’s an early warning that by the end it won’t seem all that fantastic how your reading system makes words and paragraphs highlight as a voice narrates the text. Prepare to be disappointed that your reading system doesn’t have superpowers.
To begin moving under the hood of an EPUB, though, the first thing to understand is that overlays are just specialized xml documents that contain the instructions a reading system uses to synchronize the text display with the audio playback. They’re expressed using a subset of SMIL that we’ll cover as we move along, combined with the epub:type attribute we ran into earlier for semantic inflection.
Note
SMIL (pronounced “smile”) is the shorthand way of referring to the Synchronized Multimedia Integration Language. For more information on this technology, see http://www.w3.org/TR/SMIL
The order of the instructions in the overlay document defines the reading order for the ebook when in playback mode. A reading system will move through the instructions one at a time, or a reader can manually navigate in similar fashion to how assistive technologies enable navigation through the markup (i.e., escaping and skipping).
As a reading system encounters each synchronization point, it determines from the provided information which element in which content file has to be loaded (by its id) and the corresponding position in the audio track at which to start the narration. The reading system will then load and highlight the word or passage for you at the same time that you hear the audio start. When the audio track reaches the end point you’ve specified—or the end of the audio file if you haven’t specified one—the reading system checks the next synchronization point to see what text and audio to load next.
This process of playback and resynchronization continues over and over until you reach the end of the book, giving the appearance to the reader that their system has come alive and is guiding them through it.
Note
This portrayal is intentionally simple. In practice, overlay synchronization points may, for example, omit an audio reference when the reading system is expected to synthetically render the text, or if the text reference points to a multimedia object like the audio or video element that the reading system is expected to initiate. Refer to the Media Overlays specification for more information on the full range of features.
As you might suspect at this point, the reading system can’t synchronize or play content back any way but what has been defined; as a reader you cannot, for example, dynamically change from word-to-word to paragraph-by-paragraph read-back as you desire. The magic is only as magical as you make it, at least at this time.
With only a single level of playback granularity available, the decision on how fine a playback experience to provide has typically been influenced by the disability you’re targeting, going back to the origins of the functionality in talking books. Books for blind and low-vision readers are often only synchronized to the heading level, for example, and omit the text entirely. Readers with dyslexia or cognitive issues, however, may benefit more from word-level synchronization using full-text full-audio playback.
Coarser synchronization—for example at the phrase or paragraph level—can be useful in cases where the defining characteristics of a particular human narration (flow, intonation, emphasis) add an extra dimension to the prose, such as with spoken poetry or religious verses. The production costs associated with synchronizing human-narrated ebooks to the word level, however, has typically meant that only short-prose works (such as children’s books) get this treatment.
Let’s turn to the practical construction of an overlay to discover why the complexity increases by level, though. Understanding the issues will give better insight into which model you ultimately decide to use.
Building an Overlay
Every overlay document begins with root smil element and a body, as exemplified in the following markup:
There’s nothing exciting going on here but a couple of namespace declarations and a version attribute on the root. These are static in EPUB 3, so of little interest beyond their existence. There is no required metadata in the overlays themselves, which is why we don’t need to add a head element.
Of course, in order to now illustrate how to build up this shell and include it in an EPUB, we’re going to need some content. I’m going to use the Moby Dick ebook that Dave Cramer, a member of the EPUB working group, built as a proof of concept of the specification for the rest of this section. This book is available from the EPUB 3 Sample Content project page.
If we look at the content file for chapter one, we can see that the HTML markup has been structured to showcase different levels of text/audio synchronization. After the chapter heading, for example, the first paragraph has been broken down to achieve fine synchronization granularity (word and sentence level), whereas the following paragraph hasn’t been divided into smaller parts.
Compressing the markup in the file to just what we’ll be looking at, we have:
Chapter 1. Loomings.
You’ll notice that each element containing text content has an id attribute, as that’s what we’ll be referencing when we get to synchronizing with the audio track.
The markup additionally includes span tags to differentiate words and sentences in the first p tag. The second paragraph only has an id attribute on it, however, as we’re going to omit synchronization on the individual text components it contains to show paragraph-level synchronization.
We can now take this information to start building the body of our overlay. Switching back to our empty overlay document, the first element we’re going to include in the body is a seq:
This element serves the same grouping function the corresponding
section element does in the markup, and you’ll notice the textref attribute references the section’s id. The logical grouping of content inside the seq element likewise enables escaping and skipping of structures during playback, as we’ll return to when we look at some structural considerations later.
In this case, the epub:type attribute conveys that this seq represents a chapter in the body matter. Although the attribute isn’t required, there’s little benefit in adding seq elements if you omit any semantics, as a reading system will not be able to provide skippability and escapability behaviors unless it can identify the purpose of the structure.
It may seem redundant to have the same semantic information in both the markup and overlay, but remember that each is tailored to different rendering and playback methods. Without this information in the overlay, the reading system would have to inspect the markup file to determine what the synchronization elements represent, and then resynchronize the overlay using the markup as a guide. Not a simple process. A single channel of information is much more efficient, although it does translate into a bit of redundancy (you also typically wouldn’t be crafting these documents by hand, and a recording application could potentially pick up the semantics from the markup and apply them to the overlay for you).
We can now start defining synchronization points by adding par elements to the seq, which is the only other step in the process. Each par contains a child text and a child audio element, which define the fragment of your content and the associated portion of an audio file to render in parallel, respectively.