At work I had the task to implement the synchronization between an online shop and a commodity management system. Data exchange format was XML - one big XML file for all of the products (some thousands with dozens of attributes). Big question: How do I import the file in a way that is most convenient for me as a programmer - and without exceeding the machine's RAM when loading a 1 GiB file?
I personally prefer SimpleXML for everything XML related in PHP - even to generate XML; although it was never meant to do that primarily. The big problem is that SimpleXML uses DOM in the background which builds the whole XML tree in memory. That's a no-go for large files.
So what's left? Yes, our old and rusty Sax parser. It's not really convenient - you have to catch all this actions for open tags, close tag, data section etc. - but it reads the xml file iteratively. Parsing huge files is no problem if you use Sax. PHP5's slightly enhanced Sax implementation/wrapper is XmlReader which I chose to make use of.
On the other side - my program that synched the data with the database - I wanted to have something dead simple, like a foreach loop. Now the task was to combine XmlReader and SPL's Iterator interface.
Sample XML
<?xml version="1.0" encoding="utf-8"?> <programs> <program> <name>Kate</name> <license>LGPL</license> <details> <type>Editor</type> <description>Nice KDE text editor</description> </details> <version> <stable>3.5.9</stable> <beta>4.0.5</beta> </version> </program> <program> <name>gedit</name> <license>LGPL</license> <details> <type>Editor</type> <description>Standard gnome text editor</description> </details> <version> <stable>2.22.3</stable> <beta>2.22.4-rc1</beta> </version> </program> </programs>
Preferred PHP import code
The following code is as easy and beautiful as reading an XML file can get:
$it = new ProgramIterator('programs.xml'); foreach ($it as $arProgram) { echo $arProgram['name'] . "\n"; echo ' Latest stable version: ' . $arProgram['version-stable'] . "\n"; //here you could do some db operations }
Iterator code
Here is the iteration code - without comments - in case you (or /me) need to do the same thing again.
class ProgramIterator implements Iterator { /** * XML file path * * @var string */ protected $strFile = null; /** * XML reader object * * @var XMLReader */ protected $reader = null; /** * Current program * * @var array */ protected $program = null; /** * Dummy-key for iteration. * Has no real value except maybe act as a counter * * @var integer */ protected $nKey = null; protected $strObjectTagname = 'program'; function __construct($strFile) { $this->strFile = $strFile; } public function current() { return $this->program; } public function key() { return $this->nKey; } public function next() { $this->program = null; } public function rewind() { $this->reader = new XMLReader(); $this->reader->open($this->strFile); $this->program = null; $this->nKey = null; } public function valid() { if ($this->program === null) { $this->loadNext(); } return $this->program !== null; } /** * Loads the next program * * @return void */ protected function loadNext() { $strElementName = null; $bCaptureValues = false; $arValues = array(); $arNesting = array(); while ($this->reader->read()) { switch ($this->reader->nodeType) { case XMLReader::ELEMENT: $strElementName = $this->reader->name; if ($bCaptureValues) { if ($this->reader->isEmptyElement) { $arValues[$strElementName] = null; } else { $arNesting[] = $strElementName; $arValues[implode('-', $arNesting)] = null; } } if ($strElementName == $this->strObjectTagname) { $bCaptureValues = true; } break; case XMLReader::TEXT: if ($bCaptureValues) { $arValues[implode('-', $arNesting)] = $this->reader->value; } break; case XMLReader::CDATA: if ($bCaptureValues) { $arValues[implode('-', $arNesting)] = trim($this->reader->value); } break; case XMLReader::END_ELEMENT: if ($this->reader->name == $this->strObjectTagname) { $this->program = $arValues; ++$this->nKey; break 2; } if ($bCaptureValues) { array_pop($arNesting); } break; } } } }
There are some things missing, like: namespace and attribute support, handling of tags with the same name in different hierarchy levels, especially the main tag and generally tags that may show up several times. I didn't need it, so do it yourself if it's necessary.
Changelog
- 2008-08-22
- First version
- 2011-04-27
- Support for empty elements
- 2015-12-13
- CDATA support