All my blog posts are XHTML, because I can load and manipulate them with an XML parser. I do that with scripts when adding IDs for better referencing, and when compiling the blog posts by adding navigation, header and footer.
The pages have no XML declaration because the W3C validator complains that
Saw <?. Probable cause: Attempt to use an XML processing instruction in HTML. (XML processing instructions are not supported in HTML.)
But when loading such a XHTML page with PHP's SimpleXML library and generating the XML to save it, entities get encoded:
<?php $xml = '<html><head><meta charset="utf-8"/><title>ÄÖÜ</title></head></html>'; $sx = simplexml_load_string($xml); echo $sx->asXML() . "\n"; ?>
This script generates encoded entities:
<?xml version="1.0"?> <html><head><meta charset="utf-8"/><title>ÄÖÜ</title></head></html>
I found the solution for that problem in a stack overflow answer: You have to manually declare the encoding - despite the standard saying that UTF-8 is standard when no declaration is given.
dom_import_simplexml($sx)->ownerDocument->encoding = 'UTF-8';
Now the generated XML has proper un-encoded characters:
<html><head><meta charset="utf-8"/><title>ÄÖÜ</title></head></html>