PHP: XPath on HTML and XHTML

A discussion how to load HTML files properly and run XPath on their DOMDocument objects.

Loading HTML content

The correct way to load a HTML document is to fetch it via HTTP with the Accept header listing all the MIME types your application understands.

Accept: application/xhtml+xml; q=1, application/xml; q=0.9,
        text/xml; q=0.9, text/html; q=0.5

Before loading the contents into DOMDocument, you need to check the response Content-type header and use loadXML or loadHTML depending on its value.

HTML

Extracting data from HTML documents with PHP is easy: Use DOMDocument's loadHTML method, then use DOMXPath to access the title or some other element:

<?php
$doc = new DOMDocument();
$doc->loadHTMLFile('http://cweiske.de/');
 
$xpath = new DOMXPath($doc);
$title = $xpath->evaluate('string(/html/head/title)');
 
echo "Document title is: " . $title . "\n";
?>

XHTML

To properly use XHTML with all of its features (e.g. CDATA sections), you need to use DOMDocument's loadXML method - loadHTML does not load these properly and throws at least warnings.

Now with XML, the document's namespace is respected and all nodes are in their proper namespace. The XPath now needs to take care of this:

<?php
$doc = new DOMDocument();
$doc->load('http://cweiske.de/');
 
$xpath = new DOMXPath($doc);
$xpath->registerNamespace('h', 'http://www.w3.org/1999/xhtml');
$title = $xpath->evaluate('string(/h:html/h:head/h:title)', null, false);
 
echo "Document title is: " . $title . "\n";
?>

DOMXPath::evaluate and DOMXPath::query both automatically register namespace prefixes that appear on the context node - which may overwrite your registered namespace prefix. Thus it is important to pass false as third parameter.

XPath for XML and HTML

Now the problem is that the XPath does only work for one of the loading modes:

In 2006, someone asked the PHP core developers about exactly this problem and requested that loadHTML should automatically put all tags within the HTML namespace, http://www.w3.org/1999/xhtml.

They declined and told him to use tidy to convert broken HTML into XML and then exclusively load XML. This unfortunately means for us that we're still stuck in XPath hell for the time to come.

At least there is a way to write XPath that works on both namespaced and non-namespaced HTML:

$xpath = '/*[self::html or self::h:html]'
    . '/*[self::head or self::h:head]'
    . '/*[self::title or self::h:title]/'

It's pretty ugly and makes the XPath quite unreadable, but it's the only way without reverting to external tools to XMLify HTML.

XML distinguishes between tags in different casing variants, while HTML does not: <body> is not <BODY> in XML.

You could be inclined to use XPath's local-name() function. This will only work as long as no other namespaces with same tag names exist in the document, so better don't.

Written by Christian Weiske.

Comments? Please send an e-mail.