Fixing Text_LanguageDetect on PHP7

I recently upgraded my personal development laptop to PHP7, and suddenly writing blog posts was not possible anymore. Text_LanguageDetect, the library I use to autodetect the language of my blog posts from its title, had stopped working.

Debugging

The library has unit tests, and some of them failed on PHP7 but ran fine on PHP5:

1) Text_LanguageDetectTest::test_confidence
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
-'english'
+'portuguese'

The library was 1166 lines big, and I had no idea where to start.

Trace diff

Without knowing where to look, I thought that simply diffing the function trace would be a good idea: If the library's behavior in PHP7 differed from PHP5, a diff off the trace would show it.

So I added a xdebug_start_trace() call at the beginning of the failing test method and ran it on PHP7. Thanks to phpfarm I already had multiple PHP5 versions installed, and now I only had to install xdebug for my latest PHP5. I ran the test again, and then opened both trace files in meld:

$ phpunit --filter test_confidence tests/
$ cp /tmp/trace* trace7.xt
 
$ php-5.6.3 `which phpunit` --filter test_confidence tests/
$ cp /tmp/trace* trace5.xt
 
$ cut -b40- trace5.xt > cut-trace5.xt
$ cut -b40- trace7.xt > cut-trace7.xt
 
$ meld trace5.xt trace7.xt

I had to filter out memory and timing information with cut, because otherwise I had 100% difference. But what I saw was not very helpful:

Function trace diffing

The traces generated by xdebug in PHP5 and PHP7 were too different to be useful.

Migration guide

The PHP manual has a nice guide Migrating from PHP 5.6.x to PHP 7.0.x. I manually walked through every section of the Backward incompatible changes chapter.

For each incompatible change, I grepped the .php sources for the relevant keywords - but found nothing.

Coverage

The last idea I had was comparing the code coverage in the different PHP versions. I ran phpunit with --coverage-html, opened the generated html files in two browser tabs and manually inspected them for differences.

And then I found it. _unicode_block_name() had much more coverage in PHP7 than it had in PHP5.

PHPUnit coverage comparison in Firefox

In PHP5, only the first if statement was covered, in PHP7 nearly all of the method:

if ($unicode <= $blocks[0][1]) {
    return $blocks[0];
}

Let's dump those:

> var_dump($unicode, $blocks[0][1]);
int(116)
string(6) "0x007F"

And now I remembered: Hexadecimal strings are no longer considered numeric in PHP7.

Solution

The unicode blocks are defined in data/unicode_blocks.dat as a serialized php array. That's the reason I did not find anything earlier, because it's not a .php file.

The file contains unicode blocks, build like this:

0 => array(
    0 => "0x0000",
    1 => "0x007F",
    2 => 'Basic Latin',
)

A regex replacement converted the numbers to real ones, and with that Text_LanguageDetect works on PHP7.

At last I replaced PHP4-style constructors and object references with their PHP5/7 counterpart, added property and method visibility, fixed all code style errors and made the tests run on travis-ci.

Version 1.0.0 of Text_LanguageDetect has been released on 2017-03-02. It detects the language of text samples via 3-gram frequencies and is installable via PEAR and composer

Written by Christian Weiske.

Comments? Please send an e-mail.