Importing huge XML files using PHP5 - efficiently and conveniently

At work I had the task to implement the synchronization between an online shop and a commodity management system. Data exchange format was XML - one big XML file for all of the products (some thousands with dozens of attributes). Big question: How do I import the file in a way that is most convenient for me as a programmer - and without exceeding the machine's RAM when loading a 1 GiB file?

I personally prefer SimpleXML for everything XML related in PHP - even to generate XML; although it was never meant to do that primarily. The big problem is that SimpleXML uses DOM in the background which builds the whole XML tree in memory. That's a no-go for large files.

So what's left? Yes, our old and rusty Sax parser. It's not really convenient - you have to catch all this actions for open tags, close tag, data section etc. - but it reads the xml file iteratively. Parsing huge files is no problem if you use Sax. PHP5's slightly enhanced Sax implementation/wrapper is XmlReader which I chose to make use of.

On the other side - my program that synched the data with the database - I wanted to have something dead simple, like a foreach loop. Now the task was to combine XmlReader and SPL's Iterator interface.

Sample XML

<?xml version="1.0" encoding="utf-8"?>
<programs>
 <program>
  <name>Kate</name>
  <license>LGPL</license>
  <details>
   <type>Editor</type>
   <description>Nice KDE text editor</description>
  </details>
  <version>
   <stable>3.5.9</stable>
   <beta>4.0.5</beta>
  </version>
 </program>
 
 <program>
  <name>gedit</name>
  <license>LGPL</license>
  <details>
   <type>Editor</type>
   <description>Standard gnome text editor</description>
  </details>
  <version>
   <stable>2.22.3</stable>
   <beta>2.22.4-rc1</beta>
  </version>
 </program>
</programs>

Preferred PHP import code

The following code is as easy and beautiful as reading an XML file can get:

$it = new ProgramIterator('programs.xml');
foreach ($it as $arProgram) {
    echo $arProgram['name'] . "\n";
    echo ' Latest stable version: ' . $arProgram['version-stable'] . "\n";
    //here you could do some db operations
}

Iterator code

Here is the iteration code - without comments - in case you (or /me) need to do the same thing again.

class ProgramIterator implements Iterator
{
    /**
     * XML file path
     *
     * @var string
     */
    protected $strFile = null;
 
    /**
     * XML reader object
     *
     * @var XMLReader
     */
    protected $reader = null;
 
    /**
     * Current program
     *
     * @var array
     */
    protected $program = null;
 
    /**
     * Dummy-key for iteration.
     * Has no real value except maybe act as a counter
     *
     * @var integer
     */
    protected $nKey = null;
 
    protected $strObjectTagname = 'program';
 
 
 
    function __construct($strFile)
    {
        $this->strFile = $strFile;
    }
 
    public function current() {
        return $this->program;
    }
 
    public function key() {
        return $this->nKey;
    }
 
    public function next() {
        $this->program = null;
    }
 
    public function rewind() {
        $this->reader = new XMLReader();
        $this->reader->open($this->strFile);
        $this->program = null;
        $this->nKey    = null;
    }
 
    public function valid() {
        if ($this->program === null) {
            $this->loadNext();
        }
 
        return $this->program !== null;
    }
 
    /**
     * Loads the next program
     *
     * @return void
     */
    protected function loadNext()
    {
        $strElementName = null;
        $bCaptureValues = false;
        $arValues       = array();
        $arNesting      = array();
 
        while ($this->reader->read()) {
            switch ($this->reader->nodeType) {
                case XMLReader::ELEMENT:
                    $strElementName = $this->reader->name;
                    if ($bCaptureValues) {
                        if ($this->reader->isEmptyElement) {
                            $arValues[$strElementName] = null;
                        } else {
                            $arNesting[] = $strElementName;
                            $arValues[implode('-', $arNesting)] = null;
                        }
                    }
                    if ($strElementName == $this->strObjectTagname) {
                        $bCaptureValues = true;
                    }
                    break;
 
                case XMLReader::TEXT:
                    if ($bCaptureValues) {
                        $arValues[implode('-', $arNesting)] = $this->reader->value;
                    }
                    break;
 
                case XMLReader::CDATA:
                    if ($bCaptureValues) {
                        $arValues[implode('-', $arNesting)] = trim($this->reader->value);
                    }
                    break;
 
                case XMLReader::END_ELEMENT:
                    if ($this->reader->name == $this->strObjectTagname) {
                        $this->program = $arValues;
                        ++$this->nKey;
                        break 2;
                    }
                    if ($bCaptureValues) {
                        array_pop($arNesting);
                    }
                    break;
            }
        }
    }
}

There are some things missing, like: namespace and attribute support, handling of tags with the same name in different hierarchy levels, especially the main tag and generally tags that may show up several times. I didn't need it, so do it yourself if it's necessary.

Changelog

2008-08-22
First version
2011-04-27
Support for empty elements
2015-12-13
CDATA support

Written by Christian Weiske.

Comments? Please send an e-mail.