Extracting birth dates from Wikipedia

For the demo installation of bdrem (birthday reminder tool) I needed a list of birth dates, preferably public ones.

Finding a source

The largest free source of person data is Wikipedia, so I looked there for a list of persons. Beside lists, Wikipedia also has a List of lists. Its people category was what I was looking for; by drilling down three times I found the list of notable German scientists.

The meta data of nearly every of the linked scientists have a "Born" field in the right meta data field. Now the question was how to extract those data with out manually parsing all the HTML or mediawiki markup.

DBpedia

I remembered DBpedia from the time I wrote my diploma thesis; it is a database containing all the meta data from Wikipedia; updated in near realtime.

DBpedia is a SPARQL database (triple store) and has a public SPARQL query page.

But what should I use as query? What are the fields I can use?

Finding properties

The datasets page linked from the main page links to some resource pages, e.g. dbpedia.org/resource/Berlin.

I simply replaced Berlin in the URL with one of the scientists Alexander_von_Humboldt and had the resource page. There I saw the properties that I was interested in:

dcterms:subject     category:German_scientists
dbpprop:dateOfBirth 1769-09-14
dbpprop:name        Alexander von Humboldt

SPARQL

SPARQL is a bit like SQL (SELECT, WHERE, LIMIT), the actual conditions are sentences: subject predicate object. Knowing this and the properties above gave me the following query:

SELECT DISTINCT ?Name, $BirthDate
WHERE {
    ?Scientist dcterms:subject category:German_scientists.
    ?Scientist dbpprop:birthDate ?BirthDate.
    ?Scientist dbpprop:name ?Name.
}
LIMIT 100

An voila - I had a list of scientists, their name and birth date. The DBpedia SPARQL page also offers CSV export, and using it gave me:

"Name","BirthDate"
"Burkhard Rost",1961-07-11
"Rost, Burkhard",1961-07-11
"Victor Gustav Bloede",1849-03-14
"Bloede, Victor G",1849-03-14
"Wilhelm Körner",1839-04-20
...

bdrem

bdrem version 0.6 got support for CSV files. The following configuration is needed to make it display the scientist's birth dates:

$source = array(
    'Csv',
    array(
        'filename' => 'german-scientists.csv',
        'columns' => array(
            'name'  => 0,
            'event' => false,
            'date'  => 1
        ),
        'defaultEvent' => 'Birthday',
    )
);

bdrem now renders following output on the shell:

$ ./bin/bdrem.php
--------------------------------------------------------
Days  Age  Name                Event     Date        Day
--------------------------------------------------------
  -2  103  Braun, Wernher von  Birthday  23.03.1912  Mo
  10  144  Arthur Wehnelt      Birthday  04.04.1871  Sa
  10  144  Wehnelt, Arthur     Birthday  04.04.1871  Sa
  13   76  Bernd Brinkmann     Birthday  07.04.1939  Di
  13   76  Brinkmann, Bernd    Birthday  07.04.1939  Di

Written by Christian Weiske.

Comments? Please send an e-mail.