Experiment on the Indexibility of XML versus XHTML by Searchbots

Problem Statement

How well do the web search engines index XML documents versus XHTML documents?

In particular, how well do the web search engines index documents marked-up using non-standard, proprietary tags versus documents marked-up using the standard XHTML tag-set?

Experiment

I created an XHTML document and an XML document. They both contained the same data. The XHTML document was marked up using the XHTML tags. The XML document mimicked the XHTML document, but used tags that I created. The XHTML document was validated using the W3C Markup validator Service. The XML document was checked for well-formedness by dropping it into Internet Explorer, version 7.

The documents were very simple, containing a definition of graceful degradation. The title of each document was "Definition of Graceful Degradation"

January 29, 2008

I posted the XHTML and XML versions onto my public website (xfront.com). I linked my homepage to each version. I made the links empty so that no humans would see the links (and potentially influence the results); thus, only searchbots would find them.

February 10, 2008

I used seven different online web search tools and searched for "Definition of Graceful Degradation" and recorded how the XHTML and XML versions were ranked in the results.

XHTML Version

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>
    <meta http-equiv="content-type" content="application/xhtml+xml;charset=UTF-8" />
    <meta http-equiv="content-language" content="en" />
    <meta name="author" content="Roger L. Costello" />
    <meta name="description" content="Definition of Graceful Degradation" />
    <meta name="keywords" content="Definition, Graceful Degradation" />
    <title>Definition of Graceful Degradation</title>
</head>
<body>

    <h1>Definition of Graceful Degradation</h1>

    <p><dfn>Graceful Degradation</dfn> is the ability to continue working, albeit with reduced functionality, 
       when some expected capability is absent.</p>
    
</body>
</html>

This HTML was placed in a file, with the filename: GracefulDegradation.html

XML Version

<?xml version="1.0" encoding="UTF-8"?>
<Graceful_Degradation>

    <Metadata>
        <Content_Language>en</Content_Language>
        <Author>Roger L. Costello</Author>
        <Description>Definition of Graceful Degradation</Description>
        <Keywords>Definition, Graceful Degradation</Keywords>
        <Title>Definition of Graceful Degradation</Title>
    </Metadata>
    <Word_Definition>

        <Header>Definition of Graceful Degradation</Header>

        <Definition>Graceful Degradation is the ability to continue working, albeit with reduced functionality, 
                    when some expected capability is absent.</Definition>
    
    </Word_Definition>
</Graceful_Degradation>

This XML was placed in a file, with the filename: GracefulDegradation.xml

Results

I used seven different online web search tools and searched for "Definition of Graceful Degradation" and recorded how the XHTML and XML versions were ranked in the results. Here are the results:

Two of the seven search engines did not index the XML document at all. Of those that did index the XML document, they all ranked the XML lower than the XHTML version in the search results.

Conclusions

To maximize the ability of search engines to find your data, use a standardized tag-set such as the XHTML tag-set, rather than a non-standard, proprietary tag-set.