Fixing XML in databases with CLI tools

Recently I had to edit XML that was stored in columns of a MySQL database table. Instead of hacking a small PHP script, I chose to use a command line XML editing tool to master the task.

This article has originally been published on my employer's blog:
Fixing XML in databases with CLI tools @ netresearch .

The problem

In one of our TYPO3 projects we use Flux to add custom configuration options to page records. Those dynamic settings are stored, as usual in TYPO3, in an XML format called “FlexForms” which is then put into a column of the database table:

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<T3FlexForms>
  <data>
    <sheet index="sDEF">
      <language index="lDEF">
        <field index="_TOGGLE">0</field>
        <field index="ROOT">
          <el index="el">
            <field index="pageicon">
              <value index="vDEF"></value>
            </field>
            <field index="settings.entryLevel">
              <value index="vDEF">0</value>
            </field>
          </el>
        </field>
      </language>
    </sheet>
  </data>
</T3FlexForms>

Now, due to some update in either TYPO3 itself or the Flux extension, the options did not get stored in the sDEF sheet anymore but in a new sheet options:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<T3FlexForms>
  <data>
    <sheet index="sDEF">
      <language index="lDEF">
        <field index="ROOT">
          <el index="el">
            <field index="pageicon">
              <value index="vDEF"/>
            </field>
          </el>
        </field>
        <field index="_TOGGLE">0</field>
      </language>
    </sheet>
    <sheet index="options">
      <language index="lDEF">
        <field index="pageicon">
          <value index="vDEF"/>
        </field>
        <field index="brand">
          <value index="vDEF">info</value>
        </field>
        <field index="icon">
          <value index="vDEF"/>
        </field>
        <field index="disabled">
          <value index="vDEF">0</value>
        </field>
      </language>
    </sheet>
  </data>
</T3FlexForms>

TYPO3 does not remove old data when saving flexform fields, thus the old sDEF sheet as well as the new options sheet were both in the XML. Unfortunately, the TYPO3 API has a preference for sDEF - when it’s set, the values from that sheet are used:

TYPO3\CMS\Frontend\ContentObject\ContentObjectRenderer::readFlexformIntoConf(..)
{
    //...
    if (is_array($flexData) && isset($flexData['data']['sDEF']['lDEF'])) {
        $flexData = $flexData['data']['sDEF']['lDEF'];
    }
    //...
}

This led to the situation that, although we changed the settings, they were not used by TYPO3 at all. The only way to fix it was to remove the sDEF sheet from the XML in the database columns tx_fed_page_flexform and tx_fed_page_flexform_sub of the pages table in the TYPO3 MySQL database.

Solution #1: A PHP script

A solution would have been to write a PHP script that connects to the TYPO3 database, fetches all records from the pages table, loads the flexform column data into a SimpleXML object, runs XPath on it, removes the node, re-serializes the XML and updates the database records.

This sounded like too much effort, given that I know that editing XML on the command line is a breeze with xmlstarlet.

Solution #2: mysqlfuse + xmlstarlet

XMLStarlet is a set of command line tools to edit and query XML files.

Removing the sDEF sheet node from an XML file is as simple as executing the following command:

$ xmlstarlet ed -d '/T3FlexForms/data/sheet[@index="sDEF"]' file.xml

The only question left was how to access the MySQL pages table with XMLStarlet.

FUSE and mysqlfuse

Linux has a mechanism called FUSE, the Filesystem in Userspace. With it, it’s possible to write user-space file system drivers that can expose about anything as a file system. FTPfs and SSHfs are examples, as well as WikipediaFS which allows you to read and edit wikipedia articles with a normal text editor.

There is also mysqlfuse, which is able to expose complete MySQL databases as a directory tree. Each record in a table is a directory, and each column is a file – exactly what I needed for my task.

Mounting the database

Mounting the MySQL database as file system was easy:

  1. Install python-fuse and python-mysqldb
  2. Download mysqlfuse:

    $ git clone https://github.com/clsn/mysqlfuse.git
  3. Mount your database:

    $ mkdir mydb
    $ ./mysqlfuse.py -o host=localhost,user=dbuser,passwd=dbpass,db=mydb mydb
    

Now I could list the tables:

$ ls mydb
be_groups    fe_groups        pages
be_sessions  fe_sessions      pages_language_overlay
be_users     fe_session_data  ...

And the pages table:

$ ls mydb/pages
uid

Every primary key is turned into a directory, uid is the only one in the pages table. Inside that directory, we have all records listed with their uid:

$ ls mydb/pages/uid
1  6  10  14
2  7  11  20
4  8  12  21
5  9  13  ...

And each record directory exposes all columns as files:

$ ls mydb/pages/uid/20
abstract                   perms_userid
alias                      php_tree_stop
and_ship_areas             pid
[...]
login_required             tx_contexts_disable
media                      tx_contexts_enable
nav_title                  tx_fed_page_flexform
newUntil                   tx_fed_page_flexform_sub
[...]

Examining the contents of a column is as easy as reading it with cat:

$ cat mydb/pages/uid/20/tx_fed_page_flexform
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<T3FlexForms>
  <data>
    <sheet index="sDEF">
[...]

Fixing the XML

With the mount in place, running XMLStarlet was simple:

$ for i in mydb/pages/uid/*/tx_fed_page_flexform; do \
  test -s $i\
   && (xmlstarlet ed -d '/T3FlexForms/data/sheet[@index="sDEF"]' $i > ~/tmp-flexdata;\
   cat ~/tmp-flexdata > $i\
  ); done

The shell command loops through all records with a tx_fed_page_flexform, checks if there is actual content in them (some records have no flexform options saved), edits and saves the resulting XML into a temporary file. The contents of the temp file are then written back to the column file.

I did the same for the tx_fed_page_flexform_sub column and was set.

A tiny bug

Examining the database, I noted that the XML in the flexform columns was not modified at all.

Debbuging the issue with wireshark revealed a bug: The python-mysqldb library had changed since mysqlfuse was written and now automatically disables the MySQL autocommit feature. Since mysqlfuse only executes the UPDATE SQL queries but never calls commit, the database never writes the changes back to disc.

A bugfix was quickly written, and now the columns were properly updated in the database.

Final words

Apart from the mysqlfuse bug I had to fix, the whole task was a breeze and much quicker done than writing a script in PHP or another language.

I’ll keep mysqlfuse in my toolbox for more tasks that can be solved with some unix command line tools and a bit of ingenuity.

Written by Christian Weiske.

Comments? Please send an e-mail.