Importing huge XML files using PHP5 - efficiently and conveniently

At work I had the task to implement the synchronization between an online shop and a commodity management system. Data exchange format was XML - one big XML file for all of the products (some thousands with dozens of attributes). Big question: How do I import the file in a way that is most convenient for me as a programmer - and without exceeding the machine's RAM when loading a 1 GiB file?

I personally prefer SimpleXML for everything XML related in PHP - even to generate XML; although it was never meant to do that primarily. The big problem is that SimpleXML uses DOM in the background which builds the whole XML tree in memory. That's a no-go for large files.

So what's left? Yes, our old and rusty Sax parser. It's not really convenient - you have to catch all this actions for open tags, close tag, data section etc. - but it reads the xml file iteratively. Parsing huge files is no problem if you use Sax. PHP5's slightly enhanced Sax implementation/wrapper is XmlReader which I chose to make use of.

On the other side - my program that synched the data with the database - I wanted to have something dead simple, like a foreach loop. Now the task was to combine XmlReader and SPL's Iterator interface.

Sample XML



 
  Kate
  LGPL
  
Editor Nice KDE text editor
3.5.9 4.0.5
gedit LGPL
Editor Standard gnome text editor
2.22.3 2.22.4-rc1
]]>

Preferred PHP import code

The following code is as easy and beautiful as reading an XML file can get:

Iterator code

Here is the iteration code - without comments - in case you (or /me) need to do the same thing again.

strFile = $strFile;
    }

    public function current() {
        return $this->program;
    }

    public function key() {
        return $this->nKey;
    }

    public function next() {
        $this->program = null;
    }

    public function rewind() {
        $this->reader = new XMLReader();
        $this->reader->open($this->strFile);
        $this->program = null;
        $this->nKey    = null;
    }

    public function valid() {
        if ($this->program === null) {
            $this->loadNext();
        }

        return $this->program !== null;
    }

    /**
     * Loads the next program
     *
     * @return void
     */
    protected function loadNext()
    {
        $strElementName = null;
        $bCaptureValues = false;
        $arValues       = array();
        $arNesting      = array();

        while ($this->reader->read()) {
            switch ($this->reader->nodeType) {
                case XMLReader::ELEMENT:
                    $strElementName = $this->reader->name;
                    if ($bCaptureValues) {
                        if ($this->reader->isEmptyElement) {
                            $arValues[$strElementName] = null;
                        } else {
                            $arNesting[] = $strElementName;
                            $arValues[implode('-', $arNesting)] = null;
                        }
                    }
                    if ($strElementName == $this->strObjectTagname) {
                        $bCaptureValues = true;
                    }
                    break;

                case XMLReader::TEXT:
                    if ($bCaptureValues) {
                        $arValues[implode('-', $arNesting)] = $this->reader->value;
                    }
                    break;

                case XMLReader::END_ELEMENT:
                    if ($this->reader->name == $this->strObjectTagname) {
                        $this->program = $arValues;
                        ++$this->nKey;
                        break 2;
                    }
                    if ($bCaptureValues) {
                        array_pop($arNesting);
                    }
                    break;
            }
        }
    }
}
]]>

There are some things missing, like: namespace and attribute support, handling of tags with the same name in different hierarchy levels, especially the main tag and generally tags that may show up several times. I didn't need it, so do it yourself if it's necessary.

Changelog

2008-08-22
First version
2011-04-27
Support for empty elements

Written by Christian Weiske.

Comments? Please send an e-mail.