With SPARQLing eyes

Mid of november 2006 I finally found the theme for my diploma thesis: Taking RAP ( RDF API for PHP) and writing a better SPARQL engine that scales well on big models and operates directly on the database, instead of filtering and joining millions of triples in memory.

Slow beginnings

I began working on RAP in november, fixing small bugs that prevented RAP working on case-sensitive file systems and "short open tags" set to off, as well as some other outstanding bugs.

Mid of december, I had a first basic version of my SparqlEngineDb that could do basic SELECT statements with OPTIONAL clauses and LIMIT as well as OFFSET parts. I had nearly no time in the second half of december and the beginning of january 2007, since exams were showing their shadows..

At 18th of january, I got the existing unit tests for the memory SparqlEngine working unmodified for my DB engine. The first 10 or 15 of 140 unit tests passed - the most basic ones.

Specs & Order

Four days later, I had a crisis when trying to implement ORDER BY support that adheres fully to the specs. In SPARQL, result variables may consist of values of different categories and datatypes: Literals, Resources and Blank nodes, and strings, dateTime values, booleans and whatnot else. Now the standard explicitely tells you that blank nodes are to be sorted before IRIs which come before RDF literals. The different data types have also a specific order, and, if that was not enought, need to be casted depending on their RDF data type to get them sorted correctly in SQL (e.g. a "09" is greater than a "1", but you need to cast the value (that is stored as a blob) in mysql so it recognizes it). While this is easy for integers, you also have doubles, booleans and dateTime values. For each of them you need a different casting function - which brought me to the necessity of splitting query into multiple queries that only retrieve values of a certain datatype:

   SELECT t0.object FROM statements as t0 ORDER BY t0.object

needs to be split:

   SELECT t0.object FROM statements as t0
   WHERE t0.l_datatype = "http://www.w3.org/2001/XMLSchema#integer"
   ORDER BY CAST(t0.object as INT)
   
   SELECT t0.object FROM statements as t0
   WHERE t0.l_datatype = "http://www.w3.org/2001/XMLSchema#boolean"
   ORDER BY CAST(t0.object as BOOL)
   
   ... not to speak of datetime values

The most natural thing to do now is creating a huge UNION query and get all results at once. Wrong. UNION is a set operation, which means that the order of results is undefined! So I can order the single data as nicely as I want to, the result is unordered unless coincidence had the database's memory in a state to return ordered results. So my only option was to create distinct sql queries, send them one after another to the server and join the results on client side - not the best option performance-wise, but all the people I spoke with about that didn't have a better idea. (It is possible to use SORT BY outside the UNION clauses and sort by parts of the union, but that would require me to generate, as I called them in a CVS commit message, "queries of death".)

Now, having multiple queries returning data caused me to create workaround code for another part: OFFSET and LIMIT. While transforming SPARQL OFFSET and LIMIT clauses into SQL is trivial, it isn't anymore if your data are distributed over multiple result sets. Another class saw the light of my harddisk, the offset engine..

Preparing to be fast

Since my SparqlEngine is to be used in POWL (as base for OntoWiki), we had the first tests converting the powl api to use SPARQL instead of direct SQL calls - this allows switching data backends easily. One problem was performance: While speed greatly increased with my new database driven sparql engine, we still were way too slow to actually use OntoWiki properly - the powl API generates and executes up to some hundreds sparql queries to generate a single page, and parsing all those queries took quite some time.

Prepared statements are the way to go in such a case, and I went it. Currently, the SPARQL recommendation does not define anything in this direction, so I had to come up with a solution by myself. In a week, I had Prepared Statements For SPARQL implemented and working well.

The performance boost is dramatically: A simple query repeated 1000 times takes 3 instead of 12 seconds by using prepared statements, and this is without native prepared statements on database driver level. ADODB's mysqli driver currently does not support native prepared statements - with them, we will have another performance boost.

Filter

After DAWG, sort and limit test cases passed, it was time to go on to the filter code, one of the big features currently missing. After examining the filter code of the mem-based SparqlEngine I found out that it extracts the whole FILTER clause from a SPARQL query, does some regex on it and using evil eval() to execute it as pure PHP code. After five minutes I had a working exploit that deletes all files on the webserver a RAP SparqlEngine is running - there were no checks for validity or sanity of the regexe'd php code, it was just executed in the hope that nothing went wrong.

This approach works on PHP, but not on SQL - and I didn't want to open another barn-door wide hole by providing help for SQL injection attacts. So I sat down and extendet SparqlParser to fully parse FILTER clauses and put them into a nice tree. I tried multiple ways of getting the filter done the best way: Using a parser generator, writing it by hand by iterating over all characters, ... In the end, my do-it-yourself approach went into the same direction as the current parser was implemented, and I finally understand how it worked and why it had been writting that way.

In the coming weeks I actually implemented FILTER support and had them nearly fully working when I stumbled across UNIONs in the unit test's SPARQL queries. I had almost forgotten about them, but now I needed to implement them.. I thinkered if to implement a poor-man solution that would work on the most obvious cases, or doing a full-fledged version that would require changes in half of my code. After seeing that Powl needs to generate queries that would not work with the cheap way, I did the full work.

UNIONited in pleasure

Today, 28th of april 2007, I got the following line when running the unit tests for SparqlEngineDb:

   Test cases run: 1/1, Passes: 140, Failures: 0, Exceptions: 0

After two months of working now-and-then, and three months working nearly full-time on the engine, my SPARQL engine now passes all tests and implements the current specs fully. Yay!

My next and last task is to implement some extensions to SPARQL such as aggregation support. After this, I'll write everything down and will (hopefully) be done with my diploma work.

Tags