Put Encoding Metadata Inside your Documents

Internal Encoding Label

It is considered best practice to embed within your XML and HTML documents an indication of the encoding used to create the documents.

For example, in XML documents encoding information is put in the XML declaration:

      <?xml version="1.0" encoding="UTF-8"?>

In HTML documents encoding information is put in the meta tag:

      <html>
          <head>
              <meta http-equiv="Content-Type" content="text/html; Charset=UTF-8"  /> 

Why? Shouldn't encoding metadata be external to a document? The next two sections explain why it may be preferrable to put encoding information inside your documents.

External Encoding Label

XML and HTML documents are exchanged on the Web using the HTTP protocol. The HTTP header has a property that may be used to indicate the charset (encoding) of its payload (i.e. the XML document or the HTML document), e.g.

      Content-Type: application/xml; Charset=UTF-8

Isn't the (external) HTTP header sufficient to specify a document's encoding? It's not always possible, here's why:

Suppose you have a big web server with lots of sites and hundreds of pages, contributed by lots of people in lots of different languages. The web server wouldn't know the encoding of each document.

Note: if an HTTP header specifies an encoding then any encoding within the payload document will be ignored.

Outside the Web

XML and HTML documents reside in many places, not just the Web. They may be stored in repositories or on hard disks, for instance, where they may not be accompanied by an external encoding label.

Thus, it is considered best practice to specify the encoding within the document.

Chicken-and-Egg Problem

Suppose there is no external information to indicate the encoding of a document. Then an intriguing problem arises: in order to read the document you need to know its encoding, but to know its encoding you must read the document!

Stated differently, for an XML parser to know how to interpret the bit strings in a document it must know the encoding, but to know the encoding it must read the document!

We have a chicken-and-egg problem. How is this resolved? That's described next.

Determining a Document's Encoding

Consider the problem of an XML parser trying to determine the encoding of an XML document. Suppose the XML document has this XML declaration:

      <?xml version="1.0" encoding="UTF-8"?>

Note that the characters in an XML declaration are restricted to characters from the ASCII repertoire (however encoded).

One thing an XML parser must do is determine whether each character uses 1, 2, or 4 bytes. That is, the XML parser must determine whether the encoding is 1-byte-oriented (such as UTF-8), or 2-bytes-oriented (such as UTF-16), or 4-bytes-oriented.

With multi-byte encodings there are two approaches to arranging the bytes: store the most significant byte first or store the most significant byte last. The two storage approaches are called big-endian and little-endian. Thus, an XML parser must determine whether the characters in the XML declaration are stored as big-endian or little-endian.

An XML document may or may not have a Byte-Order Mark (BOM) in its first four bytes. The BOM indicates the ordering of the bytes, i.e. the endianness. So the first thing an XML parser will do is check for the presence of a BOM. If as BOM is present then the endianess is determined. Also, the BOM can be used to determine whether the document's encoding uses 1, 2, or 4 bytes per character. The following table shows how the BOM is analyzed by an XML parser to determine the endianness and byte-size:

Analyzing a BOM for endianness and byte size
Hex Values Endianness Encoding Byte Size
00 00 FE FF big-endian 4-byte
FF FE 00 00 little-endian 4-byte
FE FF ## ## big-endian 2-byte
FF FE ## ## little-endian 2-byte
EF BB BF    n/a 1-byte

The notation ## is used to denote any byte value except that two consecutive ##s cannot be both 00.

If the document does not have a BOM then an XML parser will look for the initial characers of the XML declaration, "<?xm". The next table shows how the endianness and byte-size may be determined by examining the initial byte sequence in the document. Note that 3C indicates "<", 3F indicates "?", 78 indicates "x", and 6D indicates "m"

Analyzing a 4-byte sequence for endianness and byte size
Hex Values Character(s) Endianness Encoding Byte Size
00 00 00 3C < big-endian 4-byte
3C 00 00 00 < little-endian 4-byte
00 3C 00 3F <? big-endian 2-byte
3C 00 3F 00 <? little-endian 2-byte
3C 3F 78 6D <?xm n/a 1-byte

By this point an XML parser will know the endianness and byte-size being used by the document's encoding. So it can parse the XML declaration!

It parses the XML declaration until it arrives at the encoding. With the encoding information the XML parser now knows exactly how to parse the remainder of the document!

We do not discuss here how an HTML parser determines the document's encoding.

Internal Encoding Label is Optional

In an XML document the XML declaration is optional. How does an XML parser determine the encoding when there is no XML declaration? The meta tag is optional in HTML, so how is its encoding determined?

Algorithm for Detecting the Character Encoding when there is no Internal Encoding Label

First, a parser looks for external information, at the operating-system or transport-protocol level, to determine the encoding.

If the external information is unreliable or unavailable then a parser examines the first 4 bytes of the document to see if it is a BOM. If it is a BOM and the BOM indicates a 2-byte per character encoding then the parser defaults to UTF-16. If the BOM indicates a 1-byte per character encoding then the parser defaults to UTF-8. If the BOM indicates a 4-byte per character encoding then the parser defaults to UTF-32.

If a parser can't determine the encoding from external information and there is no BOM then it default to UTF-8.

Recommendations

  1. XML Documents: always use an XML declaration, with encoding.
  2. HTML Documents: always use the meta tag and place it at the top of the header section, i.e. the first child of <head>.

Further Information

The above discussion simplifies some things. For example, recall the discussion of determining the encoding where the XML document does not contain a BOM. We said that 3C indicates "<", 3F indicates "?", 78 indicates "x", and 6D indicates "m". That is true in the vast majority of encodings. However, there are some exceptions. For example, the EBCDIC encoding uses 4C to indicate "<", 6F to indicate "?", A7 to indicate "x", and 94 to indicate "m". So an XML Parser must deal with that situation. Since each implementation is assumed to support only a finite set of character encodings the problem is tractable.

Further information about how the encoding of a document is determined may be found in the XML 1.0 Recomendation, Appendix F: Autodetection of Character Encodings, http://www.w3.org/TR/REC-xml/#sec-guessing.

This is a good place to get information about encodings in HTML and XHTML documents: Tutorial: Character sets & encodings in XHTML, HTML and CSS http://www.w3.org/International/tutorials/tutorial-char-enc/en/all.html

This is another nice web site for character encoding, from the Web Standards Project: Specifying Character Encoding, http://www.webstandards.org/learn/articles/askw3c/dec2002/

Acknowledgements

The following people contributed to the creation of this document:

Tags

Last Updated: September 21, 2007