Extracting birth dates from Wikipedia

For the demo installation of bdrem (birthday reminder tool) I needed a list of birth dates, preferably public ones.

Finding a source

The largest free source of person data is Wikipedia, so I looked there for a list of persons. Beside lists, Wikipedia also has a List of lists. Its people category was what I was looking for; by drilling down three times I found the list of notable German scientists.

The meta data of nearly every of the linked scientists have a "Born" field in the right meta data field. Now the question was how to extract those data with out manually parsing all the HTML or mediawiki markup.

DBpedia

I remembered DBpedia from the time I wrote my diploma thesis; it is a database containing all the meta data from Wikipedia; updated in near realtime.

DBpedia is a SPARQL database (triple store) and has a public SPARQL query page.

But what should I use as query? What are the fields I can use?

Finding properties

The datasets page linked from the main page links to some resource pages, e.g. dbpedia.org/resource/Berlin.

I simply replaced Berlin in the URL with one of the scientists Alexander_von_Humboldt and had the resource page. There I saw the properties that I was interested in:

dcterms:subject     category:German_scientists
dbpprop:dateOfBirth 1769-09-14
dbpprop:name        Alexander von Humboldt

SPARQL

SPARQL is a bit like SQL (SELECT, WHERE, LIMIT), the actual conditions are sentences: subject predicate object. Knowing this and the properties above gave me the following query:

SELECT DISTINCT ?Name, $BirthDate
WHERE {
    ?Scientist dcterms:subject category:German_scientists.
    ?Scientist dbpprop:birthDate ?BirthDate.
    ?Scientist dbpprop:name ?Name.
}
LIMIT 100

An voila - I had a list of scientists, their name and birth date. The DBpedia SPARQL page also offers CSV export, and using it gave me:

"Name","BirthDate"
"Burkhard Rost",1961-07-11
"Rost, Burkhard",1961-07-11
"Victor Gustav Bloede",1849-03-14
"Bloede, Victor G",1849-03-14
"Wilhelm Körner",1839-04-20
...

bdrem

bdrem version 0.6 got support for CSV files. The following configuration is needed to make it display the scientist's birth dates:

$source = array(
    'Csv',
    array(
        'filename' => 'german-scientists.csv',
        'columns' => array(
            'name'  => 0,
            'event' => false,
            'date'  => 1
        ),
        'defaultEvent' => 'Birthday',
    )
);

bdrem now renders following output on the shell:

$ ./bin/bdrem.php
--------------------------------------------------------
Days  Age  Name                Event     Date        Day
--------------------------------------------------------
  -2  103  Braun, Wernher von  Birthday  23.03.1912  Mo
  10  144  Arthur Wehnelt      Birthday  04.04.1871  Sa
  10  144  Wehnelt, Arthur     Birthday  04.04.1871  Sa
  13   76  Bernd Brinkmann     Birthday  07.04.1939  Di
  13   76  Brinkmann, Bernd    Birthday  07.04.1939  Di

Published on 2015-03-25 in sparql, tools, web

Building the Semantic Web with PHP: The slides

My talk, Utilizing and building the Semantic Web with PHP at FrOSCon 2010 is over. The slides might be of interest for you.

Published on 2010-08-21 in conference, leben, php, tools, semantic, sparql

Building the Semantic Web with PHP @ FrOSCon 2010

During the weekend of 21-22nd August, I'll be visting our beloved FrOSCon. Like in the last years, there will be an own PHP track, and I will be giving the talk Utilizing and building the Semantic Web with PHP. It is on Saturday 21st at 14:00.

See you there!

Published on 2010-08-04 in conference, leben, php, tools, semantic, sparql

Semantische Templates mit LESS

Gerade habe ich einen kleinen Vortrag beim Leipzig Semantic Web Tag über LESS gehalten.

LESS ist eine semantische Templateengine, mit der man Daten aus RDFa- und SPARQL-Endpoints in HTML umwandeln und damit auf seiner Website, Blog oder sonstwo anzeigen kann.

Hauptteil des Vortrags war eine Demonstration, die Folien sind hier runterzuladen.

Published on 2010-05-06 in conference, html, netresearch, php, semantic, sparql, web

PHP Unconference 2008 in Hamburg

Dieses Wochenende (26./27. April) war ich in Hamburg auf der PHP Unconference 2008, organisiert von der PHP Usergroup Hamburg. Tobias Struckmeier war so lieb mich und mindestens 6 andere Leute für das Wochenende bei sich aufzunehmen, wodurch mir dir nervige Hotelsuche oder das Übernachten in der Uni erspart blieb. Außerdem wurde der Abend etwas länger; nachts um 1 saßen wir nach der Heimfahrt von der Recycel Bar vor den Laptops und informierten uns gegenseitig über coole selbstgebaute Software :)

Sonnabend in der Eröffnungssession wurden die Sessions festgelegt - schön basisdemokratisch nach Anzahl der Wünsche der Teilnehmer. Meinen Vortrag über PEAR2 wollten nicht genug Leute (<8) hören, das Angebort für den SPARQL bekam volle 9 Punkte und wurde als auf den Sonntag 16:00 in den letzten Slot gesetzt.

Den so "gewonnen" Freiraum konnte ich sehr schön mit dem Beisitzen in den anderen Sessions füllen und hier und da was Neues aufschnappen. Das tolle an der Unconf sind eigentlich - wie überall - die Gespräche zwischen den Vorträgen, von denen es reichlich interessante gab.

Ich wurde allerdings immer mehr in dem Gefühl bestärkt, auf den falschen Gebieten tätig zu sein: Der PEAR-Talk wurde nicht genommen, das PEAR-Installerbuch in der Verlosung als zweitletztes genommen und mein SPARQL-Vortrag am Sonntagnachmittag wollten ganze drei Leute hören... Nach der Vorstellung von DBpedia entwickelte sich aber doch ein interessantes Gespräch über die Verfügbarkeit von Daten über die Menschen. Sie existieren sowieso im Netz, allerdings haben nur ... Organisationen die Mittel, sie zusammenzuführen. Ergebnis: Damit alle die gleichen Chancen haben, müssen Informationen für alle verfügbar und frei sein. Denn irgendjemand hat immer Zugriff darauf - warum also nicht alle?

Zusammenfassend war es ein sehr interessantes und schönes Wochenende. Als zum Schluß darum gebeten wurde, Kritik und Verbesserungsvorschläge einzureichen fiel mir absolut nichts ein - perfekt!

Published on 2008-04-28 in php, conference, pear, sparql

SPARQL Engines Benchmark Results

Some time ago I published results of benchmarks I did for my diploma thesis. Since then I had Virtuoso put on my list of competitors; further I added three new queries that are not as artificial as the previous ones as they try to resemble queries used in the wild.

The competitors compared are:

RAP's old SparqlEngine
RAP's new SparqlEngineDb I wrote as part of my thesis
ARC, another PHP implementation made for performance (2006-10-24)
Jena SDB, a semantic web framework written in Java (beta 1)
Redland, a C implementation. (1.0.6)
Virtuoso Open Source Edition 5.0.1

The tests have been run on a Athlon XP 1700+ with 1GiB of RAM. Both PHP and Java have been assigned 768MiB of RAM. MySQL in version 5.0.38 on a current Gentoo Linux has been used with PHP 5.2.3 and Java 1.5.0_11. All the libraries except Virtuoso support MySQL as storage, so this was used as backend.

I used the data generator of the Lehigh University Benchmark to get 200.000 RDF triples. Those triples were imported into a fresh MySQL database using the libraries' native tools.

To cut out times for loading the classes or parsing the php files, I created a script that included all the necessary files first and executed the queries ten times in a row against the lib. Taking the time between begin and return of the query function in milliseconds, I executed all queries against different database sizes: From 200.000, 100.000, 50.000, ... down to 5. All result data have been put into some nice diagrams.

Library notes

Jena needed some special care since the first run was always slow, probably because the JVM needed to load all the classes in the run. So Jena got a dry run and ten of which the times were taken afterwards.

ARC didn't like the ?o2 != ?o1 part and threw an error. The complex queries resulted in a "false" value returned after some milliseconds. I assume that something failed internally.

Redland has been used through its PHP bindings. While this seems to make it slower, I found out that it seems to have a bug in librdf_free_query_results() that causes delays up to 10 seconds depending on the dataset size. In my benchmark script, I did not call this method in order to give the lib some chance against the others. If I would have freed the results after each query, librdf would be second last.

Since I did not get the ODBC drivers working correctly, I used the isql program delivered with Virtuoso to benchmark the server. Virtuoso also had a bug in regex handling, so I have no timings for those queries.

Results

Legend The first set of SPARQL queries were chosen to be data independent and concentrate on a single SPARQL feature only. Three additional queries have been created to see how the engines act on complex queries found in the real world outside. The y axis is a logarithmic scaled time axis in seconds, x displays the number of records in the database.

SELECT

Cross joining all triples

Regular expressions

ORDER BY

Complex queries

7 connected triples


PREFIX test: 
SELECT
 ?univ ?dpt ?prof ?assProf ?publ ?mailProf ?mailAssProf
WHERE {
 ?dpt test:subOrganizationOf ?univ.
 ?prof test:worksFor ?dpt.
 ?prof rdf:type

OPTIONAL


PREFIX test: 
SELECT ?publ ?p1 ?p1mail ?p2 ?p2mail ?p3 ?p3mail
WHERE {
 ?publ test:publicationAuthor ?p1.
 ?p1 test:emailAddress ?p1mail
 OPTIONAL {
  ?publ test:publicationAuthor ?p2.
  ?p2 test:emailAddress ?p2mail.
  FILTER(?p1 != ?p2)
 }
 OPTIONAL {
  ?publ test:publicationAuthor ?p3.
  ?p3 test:emailAddress ?p3mail.
  FILTER(?p1 != ?p3 && ?p2 != ?p3)
 }
}
LIMIT 10
]]>

UNION


PREFIX test: 
SELECT ?prof ?email ?telephone
WHERE {
 {
  ?prof rdf:type test:FullProfessor.
  ?prof test:emailAddress ?email.
  ?prof test:telephone ?telephone
 }
 UNION
 {
  ?prof rdf:type test:Lecturer.
  ?prof test:telephone ?telephone
 }
 UNION
 {
  ?prof rdf:type test:AssistantProfessor.
  ?prof test:emailAddress ?email
 }
}
LIMIT 10
]]>

Average timings

Conclusion

Jena has been the only engine beside SparqlEngineDb that executed all queries. ARC is not as fast as expected, and failed on nearly half of the queries. Redland is reasonable fast, although I expected it to gain more performance given that it is written in plain C. Virtuoso as the only commercially developed product is the fastest of all engines. But here and there, other engines have been faster which is nice to see :) And my SparqlEngineDb - I think it's pretty good, although the benchmark has shown enough points at which it can be made better and faster.

Published on 2007-10-05 in php, programming, sparql, sql

SPARQLer's best choice: SparqlEngineDb

For my diploma thesis I ran some benchmarks to compare my SparqlEngineDb implementation to some other implementations. The competitors were:

RAP's old SparqlEngine
ARC, another PHP implementation made for performance (2006-10-24)
Jena SDB, a semantic web framework written in Java (alpha 2)
Redland, a C implementation. (1.0.6)

The tests have been run on a Athlon XP 1700+ with 1GiB of RAM. Both PHP and Java have been assigned 768MiB of RAM. MySQL in version 5.0.38 on a current Gentoo Linux has been used with PHP 5.2.2 and Java 1.5.0_11. All the libraries support MySQL as storage so this was used as backend.

Without going into the same details as I did for my thesis, some more information:

I used the data generator of the Lehigh University Benchmark to get 200.000 RDF triples. Those triples were imported into a fresh MySQL database using the libraries' native tools.

To cut out times for loading the classes or parsing the php files, I created a script that included all the necessary files first and executed the queries ten times in a row against the lib. Taking the time between begin and return of the query function in milliseconds, I executed all queries against different database sizes: From 200.000, 100.000, 50.000, ... down to 5. All result data have been put into some nice diagrams.

Library notes

Jena needed some special care since the first run was always slow, probably because the JVM needed to load all the classes in the run. So Jena got a dry run and ten of which the times were taken afterwards.

ARC didn't like the ?o2 != ?o1 part and threw an error.

Redland has been used through its PHP bindings. While this seems to make it slower, I found out that it seems to have a bug in librdf_free_query_results() that causes delays up to 10 seconds depending on the dataset size. In my benchmark script, I did not call this method in order to give the lib some chance against the others. If I would have freed the results after each query, librdf would be second last.

Results

Testing RAP's old engine and Jena only first, I was surprised to see that my engine is on average 10 times faster then Jena and 14 times faster than RAP's SparqlEngine. Seeing that I can take the competition I ran tests against ARC and Redland - and was surprised again. ARC said of itself that it is made for speed, using PHP arrays instead of objects. Reading this I took it for granted that my engine can't be faster. Next, Redland is completely written in C making it extremely fast - no chance for my lib to win against. The more I wondered to get a speedup of 7.7 against ARC and 3.3 against Redland!

The SPARQL queries were chosen to be data independent. The y axis is a logarithmic scaled time axis in seconds, x displays the number of records in the database.

Here are the diagrams: , , , , , , , ,

As always with benchmarks, take them with a grain of salt. Don't believe any benchmarks you didn't fake yourself. With different queries, it might be that different results turn up. You see that Jena is even a bit better at this one simplest query when my engine needs to instantiate 1000 results - creating objects in PHP is slow, so this is the point your benchmarks can make SparqlEngineDb look slow.

Published on 2007-07-02 in php, programming, sparql, sql

With SPARQLing eyes

Mid of november 2006 I finally found the theme for my diploma thesis: Taking RAP ( RDF API for PHP) and writing a better SPARQL engine that scales well on big models and operates directly on the database, instead of filtering and joining millions of triples in memory.

Slow beginnings

I began working on RAP in november, fixing small bugs that prevented RAP working on case-sensitive file systems and "short open tags" set to off, as well as some other outstanding bugs.

Mid of december, I had a first basic version of my SparqlEngineDb that could do basic SELECT statements with OPTIONAL clauses and LIMIT as well as OFFSET parts. I had nearly no time in the second half of december and the beginning of january 2007, since exams were showing their shadows..

At 18th of january, I got the existing unit tests for the memory SparqlEngine working unmodified for my DB engine. The first 10 or 15 of 140 unit tests passed - the most basic ones.

Specs & Order

Four days later, I had a crisis when trying to implement ORDER BY support that adheres fully to the specs. In SPARQL, result variables may consist of values of different categories and datatypes: Literals, Resources and Blank nodes, and strings, dateTime values, booleans and whatnot else. Now the standard explicitely tells you that blank nodes are to be sorted before IRIs which come before RDF literals. The different data types have also a specific order, and, if that was not enought, need to be casted depending on their RDF data type to get them sorted correctly in SQL (e.g. a "09" is greater than a "1", but you need to cast the value (that is stored as a blob) in mysql so it recognizes it). While this is easy for integers, you also have doubles, booleans and dateTime values. For each of them you need a different casting function - which brought me to the necessity of splitting query into multiple queries that only retrieve values of a certain datatype:

   SELECT t0.object FROM statements as t0 ORDER BY t0.object

needs to be split:

   SELECT t0.object FROM statements as t0
   WHERE t0.l_datatype = "http://www.w3.org/2001/XMLSchema#integer"
   ORDER BY CAST(t0.object as INT)
   
   SELECT t0.object FROM statements as t0
   WHERE t0.l_datatype = "http://www.w3.org/2001/XMLSchema#boolean"
   ORDER BY CAST(t0.object as BOOL)
   
   ... not to speak of datetime values

The most natural thing to do now is creating a huge UNION query and get all results at once. Wrong. UNION is a set operation, which means that the order of results is undefined! So I can order the single data as nicely as I want to, the result is unordered unless coincidence had the database's memory in a state to return ordered results. So my only option was to create distinct sql queries, send them one after another to the server and join the results on client side - not the best option performance-wise, but all the people I spoke with about that didn't have a better idea. (It is possible to use SORT BY outside the UNION clauses and sort by parts of the union, but that would require me to generate, as I called them in a CVS commit message, "queries of death".)

Now, having multiple queries returning data caused me to create workaround code for another part: OFFSET and LIMIT. While transforming SPARQL OFFSET and LIMIT clauses into SQL is trivial, it isn't anymore if your data are distributed over multiple result sets. Another class saw the light of my harddisk, the offset engine..

Preparing to be fast

Since my SparqlEngine is to be used in POWL (as base for OntoWiki), we had the first tests converting the powl api to use SPARQL instead of direct SQL calls - this allows switching data backends easily. One problem was performance: While speed greatly increased with my new database driven sparql engine, we still were way too slow to actually use OntoWiki properly - the powl API generates and executes up to some hundreds sparql queries to generate a single page, and parsing all those queries took quite some time.

Prepared statements are the way to go in such a case, and I went it. Currently, the SPARQL recommendation does not define anything in this direction, so I had to come up with a solution by myself. In a week, I had Prepared Statements For SPARQL implemented and working well.

The performance boost is dramatically: A simple query repeated 1000 times takes 3 instead of 12 seconds by using prepared statements, and this is without native prepared statements on database driver level. ADODB's mysqli driver currently does not support native prepared statements - with them, we will have another performance boost.

Filter

After DAWG, sort and limit test cases passed, it was time to go on to the filter code, one of the big features currently missing. After examining the filter code of the mem-based SparqlEngine I found out that it extracts the whole FILTER clause from a SPARQL query, does some regex on it and using evil eval() to execute it as pure PHP code. After five minutes I had a working exploit that deletes all files on the webserver a RAP SparqlEngine is running - there were no checks for validity or sanity of the regexe'd php code, it was just executed in the hope that nothing went wrong.

This approach works on PHP, but not on SQL - and I didn't want to open another barn-door wide hole by providing help for SQL injection attacts. So I sat down and extendet SparqlParser to fully parse FILTER clauses and put them into a nice tree. I tried multiple ways of getting the filter done the best way: Using a parser generator, writing it by hand by iterating over all characters, ... In the end, my do-it-yourself approach went into the same direction as the current parser was implemented, and I finally understand how it worked and why it had been writting that way.

In the coming weeks I actually implemented FILTER support and had them nearly fully working when I stumbled across UNIONs in the unit test's SPARQL queries. I had almost forgotten about them, but now I needed to implement them.. I thinkered if to implement a poor-man solution that would work on the most obvious cases, or doing a full-fledged version that would require changes in half of my code. After seeing that Powl needs to generate queries that would not work with the cheap way, I did the full work.

UNIONited in pleasure

Today, 28th of april 2007, I got the following line when running the unit tests for SparqlEngineDb:

   Test cases run: 1/1, Passes: 140, Failures: 0, Exceptions: 0

After two months of working now-and-then, and three months working nearly full-time on the engine, my SPARQL engine now passes all tests and implements the current specs fully. Yay!

My next and last task is to implement some extensions to SPARQL such as aggregation support. After this, I'll write everything down and will (hopefully) be done with my diploma work.

Published on 2007-04-28 in php, programming, sparql, sql

Christians Tagebuch: sparql

Extracting birth dates from Wikipedia

Finding a source

DBpedia

Finding properties

SPARQL

bdrem

Building the Semantic Web with PHP: The slides

Building the Semantic Web with PHP @ FrOSCon 2010

Semantische Templates mit LESS

PHP Unconference 2008 in Hamburg

SPARQL Engines Benchmark Results

Library notes

Results

SELECT

Cross joining all triples

Regular expressions

ORDER BY

Complex queries

Average timings

Conclusion

SPARQLer's best choice: SparqlEngineDb

Library notes

Results

With SPARQLing eyes

Slow beginnings

Specs & Order

Preparing to be fast

Filter

UNIONited in pleasure