PEARhd steaming on

My last weeks have mostly been spent - beside work and normal housekeeping tasks - getting PEAR's documentation system (peardoc) to base on PhD, PHP's very own DocBook rendering system.

PhD, initiated as a $evilsearchenginename summer of code project, is a fully php-based tool to convert the documentation for PHP, which is written in Docbook 5, into XHTML, PDF and manpages. The reason for PhD to exist was that the previously used DSSSL based system was slow: a full build (all formats and all languages) took 24 hours to complete. Further, the tools the system based on were old, rusty and nobody understood why they broke on some machines, but also why they worked on other ones. Having a php-based system for PHP ensures that there is always someone around who can fix it if it's broken. This wasn't the case with the old documentation build system.

In PEAR and peardoc, we based on the same tools. The structure is a bit different as were the styles, but the foundation was equal. This didn't bother me much until serveral months ago when - out of sudden - peardoc wouldn't build for me anymore either. We have had some reports that people didn't get it working on their machines, but for most of us it just worked. Until now.

For someone feeling responsible for PEAR's documentation, not being able to build the docs is a serious problem. So after having followed the dribbling lonely mailing list posts about peardoc and PhD in the last year, I finally took the time to fully converting peardoc to shiny new PhD.

P H what?

The first "issue" to solve was getting PhD actually working on my system. PhD releases are installable via its very own PEAR channel, but version 0.2 was too outdated compared to the state in CVS. This version needed PHP5.3 - yet unreleased - so I had to install that from CVS HEAD. It wasn't hard, and PhD worked on phpdoc.

Conversion to Docbook 5

The most far reaching feature of PhD is that it works on Docbook 5 files only. phpdoc had been converted from Docbook 4 to 5 already, and now I had to do the same with peardoc. docbook.org offers a db4-upgrade.xsl upgrading script, but it has several flaws:

It uses XSLT, which is really fine for transforming XML. Unfortunately, reading that XML loads and replaces all entities. Since we use entities to include the thousands of files in peardoc, the script could only be used to convert the whole manual into one large file.
The xsl script does not do everything right. Table title tags are converted like normal title tags and put into an info section, which is not allowed in tables.

Luckily, Brett had already written a script that took the single XML files, escaped entities into comments, did the same for CDATA sections and piped that prepared data through the conversion script. It had a flaw that pages with multiple CDATA sections had only the contents of the first one after transformation, but that was fixed easily. It also kept failed to convert charsets, so I ended up having mixed utf-8 and iso-8859-1 chars in the same file at first.

I spent three days tweaking the xsl script until the converted files satisfied xmllint. Unfortunately, my now written configure.php told me that there are still more, subtle errors that break validation against the Docbook 5 DTD. We were using lines like

<parameter>$mode = &true;</parameter>

a lot. &true;, &false; and &null; were replaced with <constant>(true|false|null)</constant> - so we had a <constant> tag in <parameter> which is not allowed by Docbook 5. Since the entities should be kept, using xslt to transform them away was no option.. I had to add fixes to the conversion script which slowly grew into a small monster. After spending a day working full time on the conversion, the english version validated fully against the DTD.

P H fast!

Now that at the XML was shiny, too, it was time to actually use PhD on it. The numbers were amazing: While a build for one format and one language took around 40 minutes on my system (dual core Macbook with 2GHz and 2GiB RAM), building the same with PhD takes 45 seconds!

Having a fast build system is essential, if not crucial: When a newbie translator/documentor writes his first manual page, he doesn't know much about docbook, about its tags and so. But he wants to see something, and if it's only a clear message what he did wrong. Since it was so hard to setup the old build system with DSSSL, people committed files that had not been tested at all - the build broke, and if the commit happened saturdays or sunday mornings, the weekly manual rebuild on the live server was broken.

While I really hope that the new build system lowers the entry barrier for package developers to write nice documentation, experiences of the phpdoc people are disenchanting: No new documentors appeared, some old ones even have not been seen since.

Language support

PhD itself only swallows a huge xml file and spits out the desired format and theme, be it chunked xhtml, a big pdf page or files for pearweb. The translation system in peardoc (and phpdoc) works with entities: There is one large chapters.ent file that contains entities for all files in peardoc. When you want a different language, the entities need to reference different files, for example ja/package/mail.xml instead of en/package/mail.xml.

This is one of the things PhD does not do itself (although it's planned as PhDsetup). We need a config script which does it. While this in the beginning had been a file of 11 lines of PHP code, it has been growing to 216 lines at the time of writing this, with full command help, docblocks and all.

configure.php currently does three tasks:

Creating chapters.ent for the selected language, falling back to the english version if there is no translation for a file.
Creating a giant xml file out of the single one. This is not necessary, but it speeds up PhD quite a lot.
Validating the manual against the Docbok 5 DTD.

So after language selection was implemented, I could validate the translations which went relatively problem-less. Now the moment had come and I fully cleared peardoc's cvs module (after tagging the old state of course) and committed the Docbook 5 based files.

We're done! Well, not yet..

Now that we could tan ourselves in the reflections of shiny PEARhd (peardoc + PhD), thoughts drifted and I remembered the problems of the old xml structure. One of the biggest problems was that every package category was an own chapter, and the packages themselves had only section tags available to use.

This is a real problem since one could not properly structure a package's documentation. Also, integration of external documentation was nearly impossible. For example, Laurent Laville wrote TDGs for his packages - full <book>s in docbook format. But since they were books and not sections, there was no way to include them into peardoc.

So the xml files themselves needed to be restructured: Every package should get its own <book> tag. This transition was really daunting. It took a whole week with three new conversion scripts and a lot of manual fixing. I did what I could, but without the help of David and Ken, the french and japanese translations still would not be done now.

While working to get the translations build, I came across a problem that phpdoc solved with entity files: Package category pages in translations are often not updated when a new package has been documented, leaving the translation documents without even the english version of the package manual. Now, we have $category-entities.xml files for each category. They contain the list of package entites for that category and are shared between all translations. We should do the same thing in the packag docs themselves, but that's yet to be done.

A look ahead

There are still many things left to do. The manual itself needs a restructurization to make it easier to find answers to questions like "What is PEAR?" "Do I need to recompile PHP to use PEAR?" and so on. Currently, the manual starts with the developers guide which is not what most people expect.

Another task is thinking about renaming ids. xml:id attributes in sections, chapters and books determine their name in the chunked (multi-file) versions of the rendered manual. It would be cool to have all classes available under pear.php.net/manual/class.$classname.php, and packages as package.$packagename.php - without all the category fuzz. The category structurization is needed internally, but not for the generated files.

We also need to find a way to put examples into own files so we don't need to copy&paste them into xml. This would allow us to easily run the examples without extracting them first, and even to e.g. automatically pull package examples from the manual into package releases. Including external files can be done using xinclude, and I've already done this in php-gtk-doc.

Also on the TODO list is to make it easy to link to the API docs. The manual should give an overview about a package, show examples and explain how things work. It is not the place to tell the user about function parameters and return values but instead should link to that API doc files. Currently, linking to methods or classes is really hard, and that needs to be made really simple.

PhD?

While the things I wrote here seem to suggest that everything is done and we need to polish things only a bit, that's wrong. PhD renders only the tags that are used in phpdoc, and it generated HTML that is not XHTML and not valid. It does TOCs in peardoc wrong. I've already been fixing things, but there's quite a bunch of work left. We also need to get the new build system setup on pear.php.net which currently does not update the manual anymore. Our documentation coverage tool needs to be updated. We need to get CHM compilation working. And ...

No, we're not done yet. But peardoc did a great leap forward, and we're steadily getting closer to 100%.

Ah and many thanks to Hannes for all his help on #php.doc!

Tags