Summary

Setting the language ensures that assistive technologies correctly interpret and render the text and that reading systems can make language enhancements available for users.

Techniques

Example

Example 1 — Declaring the package document language

The xml:lang attribute is set to English on the package element to ensure the metadata in the package is correctly interpreted.

<package … xml:lang="en">
Example 2 — Overriding the default package document language

The xml:lang attribute is used to indicate the author's name is in Japanese.

<package … xml:lang="en">
   <metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
     …
     <dc:creator id="creator" xml:lang="ja">村上 春樹</dc:creator>
     …
   </metadata>
   …
</package>
Example 3 — Identifying the primary languages of a publication

The dc:language element is used to indicate the primary languages of the content are French and English.

<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
     …
     <dc:language>fr</dc:language>
     <dc:language>en</dc:language>
     …
   </metadata>
Example 4 — Setting the language in an XHTML and SVG content documents

Although the xml:lang attribute sets the language for XML grammars like XHTML and SVG, it is also good practice to include a lang attribute with the same value. Refer to the section on setting the lang attribute for more information.

<html … xml:lang="en" lang="en">
   …
</html>
<svg … xml:lang="en" lang="en">
   …
</svg>

Frequently Asked Questions

Do I need to list every language used in the publication?

No, the dc:language elements should only list the primary languages of the content. If a publication contains a few phrases in a foreign language, for example, that language is not listed.

Explanation

Setting the language of a publication is an important step in ensuring its accessibility as it helps assistive technologies pronounce the text correctly. Without language declarations, assistive technologies will read the text in the default language of the user. This can lead to the entire text being mispronounced (when reading a publication in another language) or individual phrases being mangled (for inline foreign phrases).

This tutorial covers how to set the language in the EPUB package document as well as in XHTML and SVG content documents so that the information is available to assistive technologies and reading systems.

What are language tags

Before we can get into the mechanisms for setting the language, it is important to first understand what you are setting. Languages are declared using language tags, which are hyphen-separated codes that identify the language, region, script, etc.

At a minimum, each language tag consists of a primary language, which is a two- or three-character code that identifies the language.

The following table lists some common language codes:

Code Language
de German
en English
he Hebrew
hi Hindi
ko Korean

For the complete list of language codes, refer to the IANA language registry (search for the language's name to find its code).

For many languages, all that you have to specify is the primary language code. For others with regional dialects, however, you can add an additional region subtag for more precision.

American and British pronunciations, for example, can differ significantly, but a code of "en" will not inform an assistive technology which to apply when it matters (e.g., when reading an American or British novel where the characters talk in regional dialects). Adding the region solves this problem as it allows the assistive technology to pick a more appropriate voice for playback.

The region subtag is added to the language using a hyphen. For example, "en-US" indicates that the text is in English as spoken in the US.

The following table lists some common language tags with their region subtags:

Code Language
en-UK British English
en-US US English
fr-CA French as spoken in Canada
fr-FR French as spoken in France

Note that although it is common convention to capitalize region tags, this is not a requirement. Language tags are processed case-insensitively.

You can also specify the script the text is written in using a script subtag. Simplified and traditional Chinese, for example, can be differentiated using the "zh-Hans" and "zh-Hant" script tags. You should only use script subtags when a language is commonly written in more than one script.

For a more in-depth explanation of language tags, refer to the W3C article Language tags in HTML and XML.

Language declaration mechanisms

With an understanding now of what language tags are, it is time to turn to how to express those tags in markup languages.

In XML-based markup languages, like XHTML, SVG and the EPUB package document, the standard mechanism for declaring the language of the text is the xml:lang attribute, where the value of this attribute is a language tag.

Best practice is to always declare a language on the root element (i.e., the element that contains all the other markup). For example, the language of an XHTML document can be specified as follows:

<html … xml:lang="en-US">
   …
</html>

Language information is inherited, so by setting the attribute on the root element you automatically declare the language for all the elements and text in the document.

Overriding the language

Not all publications are written in a single language. Multilingual publications may switch between languages often, while other publications may contain short phrases or single words in another language.

To indicate a change of language, you only need to declare the new language on a tag that surrounds the foreign text. The change in language only exists within that tag, as shown in the following example:

<p xml:lang="en">
    This is in English
    <span xml:lang="fr">mais ceci en français</span>
    and back to English again.
</p>

Note

The lang attribute is omitted from these XHTML examples for clarity. Refer to the section on lang in XHTML and SVG for why it is useful to include.

The text of markup documents always inherits the language of the nearest ancestor tag with a language declaration, so there is no limit on how many times the language can change:

<p xml:lang="en">
    English
    <span xml:lang="fr">
       French
       <span xml:lang="es">Spanish</span>
       French
    </span>
    English
</p>

It is important to indicate when the primary language changes so that text-to-speech engines can pronounce the foreign language phrases correctly. Without the correct language information they will try to pronounce the text according to the rules for the default language.

It is not necessary to indicate a language change for terms and phrases that have become part of the default language, however. Words like "café" and "coup d'état", although French in origin, are now considered common English phrases. Text-to-speech engines can typically handle these words as English.

Setting the package document language

As an EPUB publication is a collection of documents, there are multiple places where the language of the content must be specified. The first spot we will look at is the package document.

The package document is central to an EPUB publication as it contains the metadata about the work, the resources that belong to it, and how to order those resources into a reading order. As you may have guessed already, because the package document contains metadata such as the title and author names, it is important to tell reading systems what language this information is in.

The most common way to do this is to declare a language tag on the package element, as in the following example:

<package … xml:lang="en">

Because the package element is the root element (i.e., it contains all the other elements), the language you specify on this element will apply to all the metadata it contains.

Note

EPUB 2 does not allow a global language declaration using the xml:lang attribute on the package element. You must declare an xml:lang attribute on every metadata tag.

With a global language declaration on the package element, you only need to override that declaration if metadata is written in another language. For example, if the book is a translation, you can indicate the language of the author's name by adding a language declaration to their dc:creator tag:

<dc:creator xml:lang="fr">Albert Camus</dc:creator>

One limitation of the package document metadata is that it is not possible to override the language of the text within a metadata tag. If you have a title that includes a foreign-language term or phrase, for example, you cannot identify that that text is in a different language. It will have to be read in the default language for the tag.

Note that it is rarely helpful to use region codes (e.g., adding "-US" for American English) in the package document metadata. Users will typically expect to hear the metadata announced in their preferred regional dialect.

Setting the language of the package document metadata is only the first step in defining the needed language information for reading systems. It is also necessary to specify the language of the publication content in the package document, as is covered in the next section.

Note

EPUB does not currently have a method for adding translations of metadata. Consider the following two titles:

<dc:title>King Lear</dc:title>
<dc:title xml:lang="fr">Le roi Lear</dc:title>

A reading system will treat second title as a French subtitle (if it recognizes it at all). It is possible, however, to provide metadata in an alternate script using the alternate-script property.

Setting the publication language

Although the xml:lang attribute specifies the language of the package document metadata, it does not tell reading systems the language of the content of the publication. The language of the metadata and content is often the same, but there are good reasons why a separate method of specifying this is included. For example, the work may be multilingual, or it may be written in a specific regional dialect.

EPUB requires authors to include at least one dc:language tag in the package document metadata to identify the primary language(s) of the content. Like with the xml:lang attribute, the value of this element is a language code:

<dc:language>es</dc:language>

If a publication is written in more than one language (e.g., a new language learning guide), you can repeat the dc:language element for each language (refer to example 2). Do not place all the languages into a single tag. The order in which you list the languages indicates their primacy (i.e., the first dc:language element defines the primary language of the work).

The language information contained in the dc:language tags is only informative, however. Setting this property helps reading systems optimize the rendering of the publication. They might use this information to preload a language-specific dictionary, for example, or to preload a text-to-speech engine so that users do not encounter a delay when they try to voice the content. It is still necessary to set both the language of the package document metadata and the language of each content document in the publication.

Setting the content language

Although language settings in the package document are important to set, it is even more critical to specify the language of each content document. The information set in the package document does not automatically filter down.

Setting the language of XHTML and SVG content documents, the two primary formats EPUB supports, is no different than setting the language in the package document. The primary language of the documents is set on the respective root element of each document (refer to example 4). You can then indicate that terms and phrases are in another language by wrapping them in any of HTML's or SVG's various tags.

<html … xml:lang="en" lang="en">
   …
   <body>
      …
      <p>
          As the French would say, there is a
          certain "<span xml:lang="fr" lang="fr">je ne
          sais quoi</span>" about the way that …
      </p>
      …
   </body>
</html>

Note

For more information about setting the language in XHTML documents, refer to the HTML Language topic in the knowledge base.

The lang attribute

Although you are only required to use the xml:lang attribute with XHTML and SVG documents, it is best practice to also add a lang attribute. When doing so, the language tag expressed in the xml:lang and lang attributes must match. For example:

<html … xml:lang="en-US" lang="en-US">

The reason it is recommended to add both attributes is that XHTML and SVG documents in EPUB publications may not always be processed as XML, despite the requirements of the standard. A browser-based reading system might, for example, default to processing all the XHTML documents as regular HTML. In this case, HTML processors ignore the xml:lang attribute as they only recognize the lang attribute. By always adding both attributes, you help ensure that the correct language information is available to users regardless of how the document is processed.

Related Links