Christians Tagebuch: xml

The latest posts in full-text for feed readers.


XML XSD: This element is not expected

When testing a patch for Gerbera I wanted to validate my configuration file, saw that the config2.xsd file was broken and decided to fix it.

The Gerbera configuration XML has two <container> tags with different meanings, attributes and children. The XML schema definition had two top-level <element name="container"> entries which was criticised by xmllint:

$ xmllint --noout --schema test.xsd data.xml
test.xsd:17: element element: Schemas parser error : Element '{http://www.w3.org/2001/XMLSchema}element': A global element declaration '{http://example.org/cw}container' does already exist.
WXS schema test.xsd failed to compile

This happened because the Gerbera config XSD does not use nested declarations but a long list of top-level elements, just like we know it from DTDs.

A solution to the problem is not to declare the second "container" tag but to only define its type, and reference that:

data.xml


    
        
        
    
]]>
test.xsd



    
        
            
                
            
        
    

    
        
            
                
                
            
        
    

    
        
            
        
    

    
        
    
]]>

Now xmllint gave a strange error:

$ xmllint --noout --schema test.xsd data.xml
data.xml:4: element container: Schemas validity error : Element '{http://example.org/cw}container': This element is not expected. Expected is ( container ).
data.xml fails to validate

It gets a namespaced <container> tag but wants a non-namespaced container! This problem does not happen with the <order> tag which is loaded into the <resources> tag with a ref="" - it only happens when the nested xsd:element uses a name="" attribute.

I did not find any solutions when searching for "xsd" "This element is not expected. Expected is one of" element "ref" "name" in DuckDuckGo and Google and already started to write a question on StackOverflow when its "Similar Questions" list showed Sub-elements and namespaces in XSD. One tiny but important comment told the solution:

Add elementFormDefault="qualified" to your schema element (<xsd:schema ...>) and you should be good to go.

mechanical_meat, Jun 22, 2009 at 21:22

And indeed this was it:

new test.xsd


...]]>
$ xmllint --noout --schema test.xsd data.xml
data.xml validates

elementFormDefault?

The elementFormDefault attribute is documented in XSD primer: 3.2 Qualified Locals and "documented" in XSD Structures: 3.15.2 XML Representations of Schemas, but I would never have guessed that this is related to my problem. Which proves what is written in the criticism section of the XML Schema wikipedia page:

It is too complicated (the spec is several hundred pages in a very technical language), so it is hard to use by non-experts — but many non-experts need schemas to describe data formats. The W3C Recommendation itself is extremely difficult to read.

Published on 2024-05-30 in


PHP: Saving XHTML creates entity references

All my blog posts are XHTML, because I can load and manipulate them with an XML parser. I do that with scripts when adding IDs for better referencing, and when compiling the blog posts by adding navigation, header and footer.

The pages have no XML declaration because the W3C validator complains that

Saw <?. Probable cause: Attempt to use an XML processing instruction in HTML. (XML processing instructions are not supported in HTML.)

But when loading such a XHTML page with PHP's SimpleXML library and generating the XML to save it, entities get encoded:

ÄÖÜ';
$sx = simplexml_load_string($xml);
echo $sx->asXML() . "\n";
?>]]>

This script generates encoded entities:


ÄÖÜ]]>

I found the solution for that problem in a stack overflow answer: You have to manually declare the encoding - despite the standard saying that UTF-8 is standard when no declaration is given.

dom_import_simplexml($sx)->ownerDocument->encoding = 'UTF-8';

Now the generated XML has proper un-encoded characters:

ÄÖÜ]]>

Published on 2021-03-07 in ,


Validating an Atom feed locally

Atom feeds have been invented in 2005. I prefer Atom over the four incompatible-with-each-other RSS formats because it is properly standardized.

After making an Atom feed, it is important to validate it to see if it's correct and every feed reader is able to understand it.

Online services

There are two web services to validate feeds:

Offline validation

At the time of writing, feedvalidator.org was broken and could not be used. Also during development, the feed most often is not available at a publicly accessible URL and thus validation by URL does not work. And copy&pasting is cumbersome. Validating the atom feed on your own machine without network requirements is to be preferred.

Atom feeds have to be validated on two levels:

  1. XML well-formedness
  2. Schema validity

Well-formedness

To check if your feed complies to the XML rules, simply check if it is well-formed:

$ xmllint --noout /path/to/feed.atom

If you get no output all is fine and the feed is valid XML (e.g. its tags are properly nested).

Schema validity

Apart from following the XML rules, Atom feeds also have to adhere to the rules that RFC 4287 defines. The RFC even contains a machine-readable Atom feed schema in appendix B: RELAX NG Compact Schema.

Unfortunately xmllint is not able to work with RELAX NG compact files, but trang can be used to convert .rnc to "normal" .rng files:

$ trang -I rnc -O rng atom.rnc atom.rng

Now we can use the atom.rng schema file to validate our feed:

$ xmllint --noout --relaxng atom.rng http://cweiske.de/tagebuch/feed/
http://cweiske.de/tagebuch/feed/ validates

XML schema

At the time of writing in 2017, I know of not a single working XML schema file for the Atom feed specification.

www.kbcafe.com/rss/atom.xsd.xml does not even detect a missing <id> tag thus cannot be trusted.

The OASIS CMIS atom feed schema is broken; xmllint reports an error when I try to use it:

complex type 'atomPersonConstruct': The content model is not determinist.

Simply use the atom.rng file linked above instead.

Published on 2017-10-20 in , , ,


XHTML breakages

The HTML pages on my blog are served with the MIME content type application/xhtml+xml. This forces browsers to use an XML parser instead of a lenient HTML parser, and they will bail out with an error message if the XML is not well-formed.

Yesterday I was someone complained by e-mail that he could not read my blog because Firefox showed an XML parsing error. In addition to that, the archive.org version of my blog also only showed an XML parsing error.

Internet Archive

The internet archive version is broken because their software injects additional navigation header into the content, which is not well-formed at all:

Example: Goodbye, CAcert.org @2017-06-06.

Chromium 60 displaying an error for my blog's page on archive.org Firefox 55 displaying an error for my blog's page on archive.org

I opened a bug report for issue: internetarchive/wayback: #156 xhtml pages broken

Firefox

But my contact person also complained that his browser brought an XML parsing error:

XML-Verarbeitungsfehler: nicht wohlgeformt
Adresse: http://cweiske.de/tagebuch/
Zeile Nr. 42, Spalte 328:
function cleanCSS2277284469133491(d) { if (typeof d != 'string') return d; var fc = fontCache2277284469133491; var p = /font(\-family)?([\s]*:[\s]*)(((["'][\w\d\s\.\,\-@]*["'])|([\w\d\s\.\,\-@]))+)/gi; function r(m, pa, p0, p1, o, s) { var p1o = p1; p1 = p1.replace(/(^\s+)|(\s+$)/gi, '').replace(/\s+/gi, ' '); if (p1.length < 2) { p1o = ''; } else if (fc.indexOf(p1) == -1) { if (fc.length < fontCacheMax2277284469133491) { fc.push(p1); } else { p1o = fc[0]; } } return 'font' + pa + p0 + p1o; } fontCache2277284469133491 = fc; return d.replace(p, r); } 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------^

It turned out that he had the firegloves extension installed, which injects not well-formed HTML tags as well: #2: Breaks XHTML pages delivered as application/xhtml+xml.

Why XHTML?

My blog is static hand-written HTML, and I have a couple of scripts that help me writing articles: Image gallery creator, TOC creator, ID attribute adder and so on. Using an XML parser for those tools is so much easier than a HTML5-compliant parser.

Moving from my old lightbox gallery script to Photoswipe was only possible because I could automatically transform the XHTML code with XML command line tools.

Published on 2017-09-19 in , ,


The itemscope attribute in XHTML

At work we're using xmllint to syntax check TYPO3 Fluid template files. Sometimes microdata attributes like itemscope are used which don't have a value - and xmllint bails out because <div itemscope> is not well-formed:

$ xmllint --noout file.html
file.html: 26: parser error: Specification mandate value for attribute itemscope

What now?

Specifications!

The microdata specification's itemscope section says:

The itemscope attribute is a boolean attribute.

A boolean attribute may actually have values:

If the attribute is present, its value must either be the empty string or a value that is an ASCII case-insensitive match for the attribute's canonical name, with no leading or trailing whitespace.

Solution

So both the following variants are correct:

<div itemscope="">...</div>
<div itemscope="itemscope">...</div>

Published on 2017-01-18 in ,


TYPO3: Well-formed fluid templates

The Fluid template engine - developed for the Flow3 project - is used more and more in TYPO3's core and extensions.

Fluid templates look like XML. Every functionality is implemented as a custom XML tag or tag attribute - very unlike e.g. Smarty or Twig which invented a terse template markup language that's easy to write.

Basic fluid functionality is wrapped in tags that are prefixed with f:, like <f:if> or <f:comment>

Unfortunately, fluid's inventors did not follow the way of XML to the end: Most fluid templates are not even well-formed.

Fluid-XML

This is a typical "partial" template:

{headline}

{text}

]]>

Several problems make the file non-wellformed:

  1. XML declaration missing
  2. XML requires a single root tag, but the template contains multiple
  3. Namespace prefix f is not defined

Let's fix the issues:


{headline}

{text}

]]>

The XML is well-formed now. Unfortunately, the rendered template is broken:


Headline 1

Text 1

Headline 2

Text 2

... ]]>

The fluid template engine renders both XML declaration and the additional root tag, although we neither want nor need it. Depending on the context of our partial, we might even get invalid HTML.

Attempt to fix

To get rid of the root tag in our output, we could try to use a <f:if> tag with an always-true condition:



  

{headline}

{text}

]]>

Let's render it and ... congratulations, you just ran into bug #56481:

#1237823695: Argument "xmlns:f" was not registered

But even if my patch gets merged some day, the XML declaration will still get rendered.

Fixing broken output

Fluid's <f:render> tag supports a section attribute. It simply says that a certain section within the given template shall be rendered instead of the whole file:


  

]]>

Now we just have to wrap our partial's HTML code with a section tag:



  
    

{headline}

{text}

]]>

That's it. You can use that solution for templates and partials - but not for layouts.

Why?

Why do I want well-formed fluid templates?

Well-formed XML can be validated automatically through git pre-commit hooks. Utilizing them lets developers spot errors earlier.

Update 2015-06: A solution for TYPO3 7.3+

One year after my my bug had been rejected, a new one was opened and got fixed.

TYPO3 version 7.3 now allows you to use xmlns declarations on any elements in a fluid template. As long as their value begins with http://typo3.org/ns/, they are removed from the output.

The root tag issue was also fixed with that commit; adding data-namespace-typo3-fluid="true" on a root HTML tag in a fluid template causes it to not be rendered in the output.

This makes it possible to have well-formed layout templates, too - which was not possible with the section workaround.


You might want to read how to write well-formed templates with dynamic tags.

Published on 2014-03-28 in , ,


Shell command of the day: Image size as XML attributes

for file in `grep -l 'rel="shadowbox' raw/*.htm`; do echo $file; for imgsrc in `xmlstarlet sel -q -t -v '//_:a[@rel and not(@data-size)]/@href' "$file"`; do size=`exiftool -T -Imagesize raw/$imgsrc`; echo $imgsrc $size; xmlstarlet ed --inplace -P --append "//_:a[@href='$imgsrc' and not(@data-size)]" --type attr -n data-size --value "$size" "$file"; done; done

For this blog I wanted to have an image gallery that works on mobile devices. I found the open source PhotoSwipe library, and after some days I had it integrated in my blog.

PhotoSwipe requires you to specify the full image size when initializing; it does not auto-detect it. I had 29 blog posts with image galleries, and over a hundred images in them - adding the image sizes manually was not an option.

I opted for a HTML5 data attribute on the link to the large image:

<a href="image.jpg" data-size="1200x800">..

What I had to do:

  1. Find all files with galleries

    $ grep -l 'rel="shadowbox' raw/*.htm
  2. Extract image paths from the HTML files

    $ xmlstarlet sel -q -t -v '//_:a[@rel and not(@data-size)]/@href' "$file"
  3. Extract the image size

    $ exiftool -T -Imagesize "raw/$imgsrc"
  4. Add the data-size attribute to the link tags which link to the image:

    $ xmlstarlet ed --inplace -P --append "//_:a[@href='$imgsrc' and not(@data-size)]" --type attr -n data-size --value "$size" "$file"

And this all into one nice shell script:

for file in `grep -l 'rel="shadowbox' raw/*.htm`
do
    echo $file
    for imgsrc in `xmlstarlet sel -q -t -v '//_:a[@rel and not(@data-size)]/@href' "$file"`
    do
        size=`exiftool -T -Imagesize raw/$imgsrc`
        echo $imgsrc $size
        xmlstarlet ed --inplace -P --append "//_:a[@href='$imgsrc' and not(@data-size)]" --type attr -n data-size --value "$size" "$file"
    done
done

This all did only work because I my blog posts are XHTML.

You can see the new galleries in e.g. Kinderzimmerlampe im Eigenbau and Playing Tomb Raider 1 on OUYA.

Published on 2016-05-20 in , ,


Good idea of the day: checkstyle on jenkins

When I started a new project at work, I configured our Jenkins server to automatically deploy to production and testing servers when either git master or develop branches get pushed to - but only if all the tests pass.

In our case, it's only syntax checks for HTML, PHP, SCSS, SQL and XML files as well as coding style checks for those files.

But when things get time-critical, nobody can take an excuse that "it has to be quick now" - because your code will just not get live unless you follow the rules. And not only this, everybody else's code will not go live because of you.

This really helps keeping developers play by the rules :)

Tools

In case you were wondering which tools we use for syntax and style checking:

php -l
PHP syntax checking
xmllint --noout
XML syntax checking (and HTML, since I enforce XHTML well-formedness)
php-sqllint
SQL file syntax checking
scss-lint
SCSS syntax and coding style checking
phpcs PHP_CodeSniffer
PHP coding style checking

Published on 2015-12-15 in , , , ,


Importing huge XML files using PHP5 - efficiently and conveniently

At work I had the task to implement the synchronization between an online shop and a commodity management system. Data exchange format was XML - one big XML file for all of the products (some thousands with dozens of attributes). Big question: How do I import the file in a way that is most convenient for me as a programmer - and without exceeding the machine's RAM when loading a 1 GiB file?

I personally prefer SimpleXML for everything XML related in PHP - even to generate XML; although it was never meant to do that primarily. The big problem is that SimpleXML uses DOM in the background which builds the whole XML tree in memory. That's a no-go for large files.

So what's left? Yes, our old and rusty Sax parser. It's not really convenient - you have to catch all this actions for open tags, close tag, data section etc. - but it reads the xml file iteratively. Parsing huge files is no problem if you use Sax. PHP5's slightly enhanced Sax implementation/wrapper is XmlReader which I chose to make use of.

On the other side - my program that synched the data with the database - I wanted to have something dead simple, like a foreach loop. Now the task was to combine XmlReader and SPL's Iterator interface.

Sample XML



 
  Kate
  LGPL
  
Editor Nice KDE text editor
3.5.9 4.0.5
gedit LGPL
Editor Standard gnome text editor
2.22.3 2.22.4-rc1
]]>

Preferred PHP import code

The following code is as easy and beautiful as reading an XML file can get:

Iterator code

Here is the iteration code - without comments - in case you (or /me) need to do the same thing again.

strFile = $strFile;
    }

    public function current() {
        return $this->program;
    }

    public function key() {
        return $this->nKey;
    }

    public function next() {
        $this->program = null;
    }

    public function rewind() {
        $this->reader = new XMLReader();
        $this->reader->open($this->strFile);
        $this->program = null;
        $this->nKey    = null;
    }

    public function valid() {
        if ($this->program === null) {
            $this->loadNext();
        }

        return $this->program !== null;
    }

    /**
     * Loads the next program
     *
     * @return void
     */
    protected function loadNext()
    {
        $strElementName = null;
        $bCaptureValues = false;
        $arValues       = array();
        $arNesting      = array();

        while ($this->reader->read()) {
            switch ($this->reader->nodeType) {
                case XMLReader::ELEMENT:
                    $strElementName = $this->reader->name;
                    if ($bCaptureValues) {
                        if ($this->reader->isEmptyElement) {
                            $arValues[$strElementName] = null;
                        } else {
                            $arNesting[] = $strElementName;
                            $arValues[implode('-', $arNesting)] = null;
                        }
                    }
                    if ($strElementName == $this->strObjectTagname) {
                        $bCaptureValues = true;
                    }
                    break;

                case XMLReader::TEXT:
                    if ($bCaptureValues) {
                        $arValues[implode('-', $arNesting)] = $this->reader->value;
                    }
                    break;

                case XMLReader::CDATA:
                    if ($bCaptureValues) {
                        $arValues[implode('-', $arNesting)] = trim($this->reader->value);
                    }
                    break;

                case XMLReader::END_ELEMENT:
                    if ($this->reader->name == $this->strObjectTagname) {
                        $this->program = $arValues;
                        ++$this->nKey;
                        break 2;
                    }
                    if ($bCaptureValues) {
                        array_pop($arNesting);
                    }
                    break;
            }
        }
    }
}
]]>

There are some things missing, like: namespace and attribute support, handling of tags with the same name in different hierarchy levels, especially the main tag and generally tags that may show up several times. I didn't need it, so do it yourself if it's necessary.

Changelog

2008-08-22
First version
2011-04-27
Support for empty elements
2015-12-13
CDATA support

Published on 2008-08-22 in , ,


Fixing XML in databases with CLI tools

Recently I had to edit XML that was stored in columns of a MySQL database table. Instead of hacking a small PHP script, I chose to use a command line XML editing tool to master the task.

This article has originally been published on my employer's blog:
Fixing XML in databases with CLI tools @ netresearch .

The problem

In one of our TYPO3 projects we use Flux to add custom configuration options to page records. Those dynamic settings are stored, as usual in TYPO3, in an XML format called “FlexForms” which is then put into a column of the database table:



  
    
      
        0
        
          
            
              
            
            
              0
            
          
        
      
    
  

]]>

Now, due to some update in either TYPO3 itself or the Flux extension, the options did not get stored in the sDEF sheet anymore but in a new sheet options:



  
    
      
        
          
            
              
            
          
        
        0
      
    
    
      
        
          
        
        
          info
        
        
          
        
        
          0
        
      
    
  

]]>

TYPO3 does not remove old data when saving flexform fields, thus the old sDEF sheet as well as the new options sheet were both in the XML. Unfortunately, the TYPO3 API has a preference for sDEF - when it’s set, the values from that sheet are used:

This led to the situation that, although we changed the settings, they were not used by TYPO3 at all. The only way to fix it was to remove the sDEF sheet from the XML in the database columns tx_fed_page_flexform and tx_fed_page_flexform_sub of the pages table in the TYPO3 MySQL database.

Solution #1: A PHP script

A solution would have been to write a PHP script that connects to the TYPO3 database, fetches all records from the pages table, loads the flexform column data into a SimpleXML object, runs XPath on it, removes the node, re-serializes the XML and updates the database records.

This sounded like too much effort, given that I know that editing XML on the command line is a breeze with xmlstarlet.

Solution #2: mysqlfuse + xmlstarlet

XMLStarlet is a set of command line tools to edit and query XML files.

Removing the sDEF sheet node from an XML file is as simple as executing the following command:

The only question left was how to access the MySQL pages table with XMLStarlet.

FUSE and mysqlfuse

Linux has a mechanism called FUSE, the Filesystem in Userspace. With it, it’s possible to write user-space file system drivers that can expose about anything as a file system. FTPfs and SSHfs are examples, as well as WikipediaFS which allows you to read and edit wikipedia articles with a normal text editor.

There is also mysqlfuse, which is able to expose complete MySQL databases as a directory tree. Each record in a table is a directory, and each column is a file – exactly what I needed for my task.

Mounting the database

Mounting the MySQL database as file system was easy:

  1. Install python-fuse and python-mysqldb
  2. Download mysqlfuse:

    $ git clone https://github.com/clsn/mysqlfuse.git
  3. Mount your database:

Now I could list the tables:

And the pages table:

Every primary key is turned into a directory, uid is the only one in the pages table. Inside that directory, we have all records listed with their uid:

And each record directory exposes all columns as files:

Examining the contents of a column is as easy as reading it with cat:



  
    
[...]
]]>

Fixing the XML

With the mount in place, running XMLStarlet was simple:

 ~/tmp-flexdata;\
   cat ~/tmp-flexdata > $i\
  ); done
]]>

The shell command loops through all records with a tx_fed_page_flexform, checks if there is actual content in them (some records have no flexform options saved), edits and saves the resulting XML into a temporary file. The contents of the temp file are then written back to the column file.

I did the same for the tx_fed_page_flexform_sub column and was set.

A tiny bug

Examining the database, I noted that the XML in the flexform columns was not modified at all.

Debbuging the issue with wireshark revealed a bug: The python-mysqldb library had changed since mysqlfuse was written and now automatically disables the MySQL autocommit feature. Since mysqlfuse only executes the UPDATE SQL queries but never calls commit, the database never writes the changes back to disc.

A bugfix was quickly written, and now the columns were properly updated in the database.

Final words

Apart from the mysqlfuse bug I had to fix, the whole task was a breeze and much quicker done than writing a script in PHP or another language.

I’ll keep mysqlfuse in my toolbox for more tasks that can be solved with some unix command line tools and a bit of ingenuity.

Published on 2015-01-19 in , , ,