Accessible EPUB 3

Page 8

by Matt Garrish

Note

The PLS specification does define a role attribute to enable context-dependent pronunciations (e.g., to differentiate the pronunciation of a word when used as a verb or noun), but support for it is not widespread and no vocabulary is defined for standard use. I’ll defer context-dependent differentiation to SSML, as a result, even though a measure is technically possible in PLS files.

But let’s take a look at a minimal example of a complete PLS file to see how they work in practice. Here we’ll define a single entry for “acetaminophen” to cure our pronunciation headaches:

acetaminophen @"sit@'mIn@f@n

To start breaking this markup down, the alphabet attribute on the root lexicon element defines the phonetic alphabet we’re going to use to write our pronunciations. In this case, I’m indicating that I’m going to write them using X-SAMPA.

Note

X-SAMPA is the Extended Speech Assessment Methods Phonetic Alphabet. Being an ASCII-based phonetic alphabet, I’ve chosen to use it here only because it is more easily writable (by mere mortals like this author) than the International Phonetic Alphabet (IPA). It is not clear at this time which alphabet(s) will receive the most widespread support in reading systems, however.

The version and xmlns namespace declaration attributes are static values, so nothing exciting to see there, as usual. The xml:lang attribute, however, is required, and must reflect the language of the entries contained in the lexicon. Here we’re declaring that all the entries are in English.

The root would normally contain many more lexeme elements than in this example, as each defines the word(s) the rule applies to in the child grapheme element(s). (Graphemes, of course, don’t have to take the form of words, but for simplicity of explanation I’ll stick to the general concept.) When the string is matched, the pronunciation in the phoneme element gets rendered in place of the default rendering the engine would have performed.

Or, if it helps conceptualize, when the word “acetaminophen” is encountered in the prose, before passing the word to the rendering engine to voice, an internal lookup of the defined graphemes occurs. Because we’ve defined a match, the phoneme and the alphabet it adheres to are swapped in instead for voicing.

That you can include multiple graphemes may not seem immediately useful, but it enables you to create a single entry for regional variations in spelling, for example. British and American variants of “defense” could be defined in a single rule as:

defense defence dI'fEns

It is similarly possible to define more than one pronunciation by adding multiple phoneme elements. We could add the IPA spelling to the last example as follows, in case reading systems end up only supporting one or the other alphabet:

defense defence dI'fEns dɪˈfɛns

The alphabet attribute on the new phoneme element is required because its spelling doesn’t conform to the default defined on the root. If the rendering engine doesn’t support X-SAMPA, it could now possibly make use of this embedded IPA version instead.

The phoneme doesn’t have to be in another alphabet, however; you could add a regional dialect as a secondary pronunciation, for example. The specification unfortunately doesn’t provide any mechanisms to indicate why you’ve included such additional pronunciations or when they should be used, so there’s not much value in doing so at this time.

There’s much more to creating PLS files than can be covered here, of course, but you’re now versed in the basics and ready to start compiling your own lexicons. You only need to attach your PLS file to your publication to complete the process of enhancing your ebook.

The first step is to include an entry for the PLS file in the EPUB manifest:

The href attribute defines the location of the file relative to the EPUB container root and the media-type attribute value “application/pls+xml” identifies to a reading system that we’ve attached a PLS file.

Including one or more PLS files does not mean they apply by default to all your content, however; in fact, they apply to none of it by default. You next have to explicitly tie each PLS lexicon to each XHTML content document it is to be used with by adding a link element to the document’s header:

… … …

There are a number of differences between the declaration for the PLS file in the publication manifest above and in the content file here. The first is the use of the rel attribute to include an explicit relationship (that the referenced file represents pronunciation information). This attribute represents somewhat redundant information, however, since the media type is once again specified (here in the type attribute). But as it is a required attribute in HTML5, it can’t be omitted.

You may have also noticed that the location of the PLS file appears to have changed. We’ve dropped the EPUB subdirectory from the path in the href attribute because reading systems process the EPUB container differently than they do content files. Resources listed in the manifest are referenced by their location from the container root. Content documents, on the other hand, reference resources relative to their own location in the container. Since we’ll store both our content document and lexicon file in the EPUB subdirectory, the href attribute contains only the filename of the PLS lexicon.

The HTML link element also includes an additional piece of information to allow selective targeting of lexicons: the hreflang attribute. This attribute specifies the language to which the included pronunciations apply. For example, if you have an English document (as defined in the xml:lang attribute on the html root element) that embeds French prose, you could include two lexicon files:

Assuming all your French passages have xml:lang attributes on them, the reading system can selectively apply the lexicons to prevent any possible pronunciation confusion:

It's the Hunchback of Notre Dame not of Notre Dame.

A unilingual person reading this prose probably would not understand the distinction being made here: that the French pronunciation is not the same as the Americanization. Including separate lexicons by language, however, would ensure that readers would hear the Indiana university name differently than the French cathedral if they turn on TTS:

Notre Dame noUt@r 'deIm Notre Dame n%oUtr@ d"Am

When the contents of the i tag are encountered, and identified as French, the pronunciation from the corresponding lexicon gets applied instead of the one from the default English lexicon.

Now that we know how to globally define pronunciation rules, let’s turn to how we can override and/or define behavior at the markup level.

SSML

Although PLS files are a great way to globally set the pronunciation of words, their primary failing is that they aren’t a lot of help where context matters in determining the correct pronunciation. Leave the pronunciation of heteronyms to chance, for example, and you’re invariably going to be disappointed by the result; the cases where context might not significantly influence comprehension (e.g., an English heteronym l
ike “mobile”), are going to be dwarfed by the ones where it does.

By way of example, when talking about PLS files I mentioned bass the instrument and bass the fish as an example of how context influences pronunciation. Let’s take a look at this problem in practice now:

The guitarist was playing a bass that was shaped like a bass.

Human readers won’t have much of a struggle with this sentence, despite the contrived oddity of it. A guitarist is not going to be playing a fish shaped like a guitar, and it would be strange to note that the bass guitar is shaped like a bass guitar. From context you’re able to determine without much pause that we’re talking about someone playing a guitar shaped like a fish.

All good and simple. Now consider your reaction if, when listening to a synthetic speech engine pronounce the sentence, you heard both words pronounced the same way, which is the typical result. The process to correct the mistake takes you out of the flow of the narrative. You’re going to wonder why the guitar is shaped like a guitar, admit it.

Synthetic narration doesn’t afford you the same ease to move forward and back through the prose that visual reading does, as words are only announced as they’re voiced. The engine may be applying heuristic tests to attempt to better interpret the text for you behind the scenes, but you’re at its mercy. You can back up and listen to the word again to verify whether the engine said what you thought it did, but it’s an intrusive process that requires you to interact with the reading system. If you still can’t make sense of the word, you can have the reading system spell it out as a last resort, but now you’re train of thought is completely on understanding the word.

And this is an easy example. A blind reader used to synthetic speech engines would probably just keep listening past this sentence having made a quick assumption that the engine should have said something else, for example, but that’s not a justification for neglect. The problems only get more complex and less avoidable, no matter your familiarity. And that you’re asking your readers to compensate is a major red flag you’re not being accessible, as mispronunciations are not always easily overcome depending on the reader’s disability. It also doesn’t reflect well on your ebooks if readers turn to synthetic speech engines to help with pronunciation and find gibberish, as I touched on in the last section.

And the problems are rarely one-time occurrences. When the reader figures out what the engine was trying to say they will, in all likelihood, have to make a mental note on how to translate the synthetic gunk each time it is re-encountered to avoid repeatedly going through the same process. If you don’t think that makes reading comprehension a headache, try it sometime.

But this is where the Synthetic Speech Markup Language (SSML) comes in, allowing you to define individual pronunciations at the markup level. EPUB 3 adds the ssml:alphabet and ssml:ph attributes, which allow you to specify the alphabet you’re using and phonemic pronunciation of the containing element’s content, respectively. These attributes work in very much the same way as the PLS entries we just reviewed, as you might already suspect.

For example, we could revise our earlier example as follows to ensure the proper pronunciation for each use of bass:

The guitarist was playing a bass that was shaped like a bass.

The ssml:alphabet attribute on each span element identifies that the pronunciation carried in the ssml:ph attribute is written in X-SAMPA, identically to the PLS alphabet attribute. We don’t need a grapheme to match against, because we’re telling the synthetic speech engine to replace the content of the span element. The engine will now voice the provided pronunciations instead of applying its own rules. In other words, no more ambiguity and no more rendering problem; it really is that simple.

Note

The second ssml:ph attribute includes an & entity as the actual X-SAMPA spelling is: b&s. Ampersands are special characters in XHTML that denote the start of a character entity, so have to be converted to entities themselves in order for your document to be valid. When passed to the synthetic speech engine, however, the entity will be converted back to the ampersand character. (In other words, the extra characters to encode the character will not affect the rendering.)

Single and double quote characters in X-SAMPA representations would similarly need to be escaped depending on the characters you use to enclose the attribute value.

It bears a quick note that the pronunciation in the ssml:ph attribute has to match the prose contained in the element it is attached to. By wrapping span elements around each individual word in this example, I’ve limited the translation of text to phonetic code to just the problematic words I want to fix. If I put the attribute on the parent p element, I’d have to transcode the entire sentence.

The upside of the granularity SSML markup provides should be clear now, though: you can overcome any problem no matter how small (or big) with greater precision than PLS files offer. The downside, of course, is having to work at the markup level to correct each instance that has to be overridden.

To hark back to the discussion of PLS files for a moment, though, we could further simplify the correction process by moving the more common pronunciation to our PLS lexicon and only fix the differing heteronym:

bass beIs

The guitarist was playing a bass that was shaped like a bass.

It’s also not necessary to define the ssml:alphabet attribute every time. If we were only using a single alphabet throughout the document, which would be typical of most ebooks, we could instead define the alphabet once on the root html element:

So long as the alphabet is defined on an ancestor of the element carrying the ssml:ph attribute, a rendering engine will interpret it correctly (and your document will be valid). (The root element is the ancestor of all the elements in the document, which is why these kinds of declarations are invariably found on it, in case you’ve ever wondered but were afraid to ask.)

Our markup can now be reduced to the much more legible and easily maintained:

The guitarist was playing a bass that was shaped like a bass.

Note

If you’re planning to share content across ebooks or across content files within one, it’s better to keep the attributes paired so that there is no confusion about which alphabet was used to define the pronunciation. It’s not a common requirement, however.

But heteronyms are far from the only case for SSML. Any language construct that can be voiced differently depending on the context in which it is used is a candidate for SSML. Numbers are always problematic, as are weights and measures:

There are 1024 bits in a byte, not 1024, as the year is pronounced.

It reached a high of 37C in the sun as I stood outside 37C waiting for someone to answer my knocks and let me in.

You'll be an XL by the end of Super Bowl XL at the rate you're eating.

But there’s unfortunately no simple guideline to give in terms of finding issues. It takes an eye for detail and an ear for possible different aural renderings. Editors and indexers are good starting resources for the process, as they should be able to quickly flag problem words during production so they don’t have to be rooted out after the fact. Programs that can analyze books and report on potentially problematic words, although not generally available, are not just a fantasy. Their prevalence will hopefully grow now that EPUB 3 incorporates more
facilities to enhance default renderings, as they can greatly reduce the human burden.

The only other requirement when using the SSML attributes that I haven’t touched on is that you always have to declare the SSML namespace. I’ve omitted the declaration from the previous examples for clarity, and because the namespace is typically only specified once on the root html element as follows:

Similar to the alphabet attribute, we could have equally well attached the namespace declaration to each instance where we used the attributes:

But that’s a verbose approach to markup, and generally only makes sense when content is encapsulated and shared across documents, as I just noted, or expected to be extracted into foreign playback environments where the full document context is unavailable.

The question you may still be wondering at this point is what happens if a PLS file contains a pronunciation rule that matches a word that is also defined by an SSML pronunciation, how can you be sure which one wins? You don’t have to worry, however, as the EPUB 3 specification defines a precedence rule that states that the SSML pronunciation must be honored. There’d be no way to override the global PLS definitions, otherwise, which would make SSML largely useless in resolving conflicts.

But to wrap up, a final note is that there is no reason why you couldn’t make all your improvements in SSML. It’s not the ideal way to tackle the problem, because of the text-level recognition and tagging it requires, at least in this author’s opinion, but it may make more sense to internal production to only use a single technology and/or support for PLS may not prove universal (it’s too early to know yet).

‹ Prev Next ›