Viewing the Bytes of a UTF-8 Encoded XML Document

Example

Here is a simple XML document. Most of its characters are ASCII, but there is one non-ASCII character, the é character:

   <?xml version="1.0" encoding="UTF-8"?>
   <Resume>
       <Title>My Résumé</Title>
       <Age>39</Age>
   </Resume>

In UTF-8 all ASCII characters are encoded using one byte. Non-ASCII characters are encoded using 2, 3, or 4 bytes. Below is shown the byte values for each character in the above XML document. Notice that the é character uses two bytes whereas all the other characters use one byte.

Hex values of the XML document

Figuring Out the Bytes Needed to Encode in UTF-8 the é Character

The Unicode code point of é is: U-00E9

All code points greater than or equal to U+0080 require more than one byte.

Hex E9 = decimal 233, which is binary: 11101001

Take the low six bits from the binary and replace the z's in this: 10zzzzzz. Take the upper 2 bits from the binary and replace the y's in this: 110yyyyy

Mapping a binary to two bytes in UTF-8

Thus we arrive at these two bytes: 11000011 10101001

These bytes correspond to hex C3 and hex A9.

Thus, é is encoded in UTF-8 as: C3A9

The Unicode code points of the other characters (e.g. M y  R s u m) are all less than U-0080, and so the UTF-8 encoding of those characters uses only one byte for each character.

Online Hex Tool

This online tool allows you to see the hex values for UTF-8 encoding, UTF-16 encoding, and others: http://people.w3.org/rishida/scripts/uniview/conversion.php

Acknowledgements

The following people contributed to the creation of this document:

Tags

Last Updated: September 28, 2007