Creating a custom magic file database

The unix file utility command uses a "magic" database to determine which type of data a file contains, independently of the file's name or extension.

Here is how I created a custom magic database for testing purposes:

Test files

At first I created some files to run the tests on:

test.html
<html></html>
test.php
<?php echo 'foo'; ?>
test.23
Test 23

Let's see what the standard magic database detects here:

$ file test.*
test.23:   ASCII text
test.foo:  PHP script, ASCII text
test.html: HTML document, ASCII text
 
$ file -i test.*
test.23:   text/plain; charset=us-ascii
test.foo:  text/x-php; charset=us-ascii
test.html: text/html; charset=us-ascii

Magic database

The magic database contains the rules that are used to detect the type.

It's a plain text file with a rule on each line. Lines may refer to the previous line, so that rules can be combined. The full documentation is available in the magic man page.

Here is my simple file that detects "23" within the first 16 bytes of the file and returns the "text/x-23" MIME type:

my-magic
0 search/16 23 File containing "23"
!:mime text/x-23

We can already use it:

$ file -m my-magic test.23 
test.23: File containing "23", ASCII text

Compilation

If you want to use it many times, you should compile it to a binary file for speed reasons:

$ file -C -m my-magic
$ file -m my-magic.mgc test.*
test.23:   File containing "23", ASCII text
test.foo:  ASCII text
test.html: ASCII text
 
$ file -i -m my-magic.mgc test.*
test.23:   text/x-23; charset=us-ascii
test.foo:  text/plain; charset=us-ascii
test.html: text/plain; charset=us-ascii

The HTML and PHP files that have been detected properly earlier are not detected anymore, because my own magic database does not contain the rules of the standard magic file (/usr/share/misc/magic.mgc).

You may however pass multiple magic files to use, separated with a :

$ file -i -m my-magic.mgc:/usr/share/misc/magic.mgc test.*
test.23:   text/x-23; charset=us-ascii
test.foo:  text/x-php; charset=us-ascii
test.html: text/html; charset=us-ascii

Programming language detection

With this knowledge, I wrote a magic file that detects the programming language in source code files, so that phorkie can automatically choose the correct file extension: MIME_Type_PlainDetect.

Written by Christian Weiske.

Comments? Please send an e-mail.