Dangers of Copying Text into XML

Issue

What problems may arise in copying text from a document and pasting it into an XML document?

Example

Consider this XML document:

<?xml version="1.0" encoding="UTF-8"?>
<Document>
      <Para id="...">...</Para>
</Document> 

Suppose that text is copied from a document and pasted into the XML document, either as the content of the <Para> element or as the value of the id attribute. What problems may be introduced?

Problems

  1. Reserved XML Characters: The text may contain these reserved characters: {<, >, ', ", &}. These characters may introduce syntax errors into the XML document and may need to be escaped.
  2. Encoding Mismatch: The editor used to create the text may use a different encoding than the XML document's encoding. Two types of problems may result when there is a mismatch of encodings:
    • A byte sequence that represents a character in one encoding may represent a different character in another encoding. Consequently, if the text was created in an editor that uses a different encoding than the XML document then the characters that result from pasting the text into the XML document may not be the same.
    • A byte sequence that represents a character in one encoding may not represent any character in another encoding. Consequently, if the text was created in an editor that uses a different encoding than the XML document then pasting the text into the XML document may result in introducing illegal characters.
      • Example: Microsoft Word uses Windows-1252 encoding. The hex value for the left curly (a.k.a. smart) quote is x93. In UTF-8 encoding the left curly quote is a three-byte sequence of hex codes xE2 x80 x9C, and there is no character corresponding to hex value x93. Copying a left curly quote from a Word document and pasting it into a UTF-8 XML document may result in the XML document receiving a byte sequence that cannot be decoded as UTF-8.
  3. Non-XML Characters: The text might contain non-XML characters, which when pasted into the XML document will produce an erroneous XML document.
  4. Entity References: The text might contain entity references that are not defined in the XML document.
  5. Namespace Mismatch: The text may contain references to namespaces that are not defined in the XML document, or it may overwrite a namespace that already exists in the XML document, or it may contain markup that is not allowed in the XML document's namespace.
  6. Partial Markup: The text may contain an incomplete unit of markup, such as the start symbols of a CDATA section but not its end symbols, a start tag but not its end tag, or the first delimiter of an attribute value but not its end delimiter.
  7. Uniqueness Collision: If the text is pasted into an attribute that is of datatype ID then the attribute's value may collide with an ID value elsewhere in the document. Similarly, if the text is pasted into an element that has a uniqueness constraint then the element's value may no longer be unique.
  8. Invalid Datatype: The text may not be of the appropriate datatype for the element or attribute.

Minimizing Problems

Some of the above problems may go unnoticed when you copy and paste using a plain-text editor that doesn't understand XML or character encodings (e.g. Notepad, vi). In particular, the problems go unnoticed during editing, only to surface later when the document is parsed by an XML parser.

There are significantly fewer dangers when you are using encoding- and XML-aware editors. Thus the problems may be detected earlier, during editing rather than during XML parsing.

Danish Translation

Acknowledgements

The following people contributed to the creation of this document:

Tags

Last Updated: September 12, 2012