At work I had the task to implement the synchronization between an online shop and a commodity management system. Data exchange format was XML - one big XML file for all of the products (some thousands with dozens of attributes). Big question: How do I import the file in a way that is most convenient for me as a programmer - and without exceeding the machine's RAM when loading a 1 GiB file?
I personally prefer SimpleXML for everything XML related in PHP - even to generate XML; although it was never meant to do that primarily. The big problem is that SimpleXML uses DOM in the background which builds the whole XML tree in memory. That's a no-go for large files.
So what's left? Yes, our old and rusty Sax parser. It's not really convenient - you have to catch all this actions for open tags, close tag, data section etc. - but it reads the xml file iteratively. Parsing huge files is no problem if you use Sax. PHP5's slightly enhanced Sax implementation/wrapper is XmlReader which I chose to make use of.
On the other side - my program that synched the data with the database - I wanted to have something dead simple, like a foreach loop. Now the task was to combine XmlReader and SPL's Iterator interface.
]]> Kate LGPL Editor Nice KDE text editor 3.5.9 4.0.5 gedit LGPL Editor Standard gnome text editor 2.22.3 2.22.4-rc1
The following code is as easy and beautiful as reading an XML file can get:
Here is the iteration code - without comments - in case you (or /me) need to do the same thing again.
strFile = $strFile;
}
public function current() {
return $this->program;
}
public function key() {
return $this->nKey;
}
public function next() {
$this->program = null;
}
public function rewind() {
$this->reader = new XMLReader();
$this->reader->open($this->strFile);
$this->program = null;
$this->nKey = null;
}
public function valid() {
if ($this->program === null) {
$this->loadNext();
}
return $this->program !== null;
}
/**
* Loads the next program
*
* @return void
*/
protected function loadNext()
{
$nCount = 0;
$strElementName = null;
$bCaptureValues = false;
$arValues = array();
$arNesting = array();
while ($this->reader->read()) {
switch ($this->reader->nodeType) {
case XMLReader::ELEMENT:
$strElementName = $this->reader->name;
if ($bCaptureValues) {
$arNesting[] = $strElementName;
$arValues[implode('-', $arNesting)] = null;
}
if ($strElementName == $this->strObjectTagname) {
$bCaptureValues = true;
}
break;
case XMLReader::TEXT:
if ($bCaptureValues) {
$arValues[implode('-', $arNesting)] = $this->reader->value;
}
break;
case XMLReader::END_ELEMENT:
if ($this->reader->name == $this->strObjectTagname) {
$this->program = $arValues;
++$this->nKey;
break 2;
}
if ($bCaptureValues) {
array_pop($arNesting);
}
break;
}
}
}//protected function loadNext()
}
]]>
There are some things missing, like: namespace and attribute support, handling of tags with the same name in different hierarchy levels, especially the main tag and generally tags that may show up several times. I didn't need it, so do it yourself if it's necessary.