The unix file utility command uses a "magic" database to determine which type of data a file contains, independently of the file's name or extension.
Here is how I created a custom magic database for testing purposes:
Test files
At first I created some files to run the tests on:
<html></html>
<?php echo 'foo'; ?>
Test 23
Let's see what the standard magic database detects here:
$ file test.* test.23: ASCII text test.foo: PHP script, ASCII text test.html: HTML document, ASCII text $ file -i test.* test.23: text/plain; charset=us-ascii test.foo: text/x-php; charset=us-ascii test.html: text/html; charset=us-ascii
Magic database
The magic database contains the rules that are used to detect the type.
It's a plain text file with a rule on each line. Lines may refer to the previous line, so that rules can be combined. The full documentation is available in the magic man page.
Here is my simple file that detects "23" within the first 16 bytes of the file and returns the "text/x-23" MIME type:
0 search/16 23 File containing "23" !:mime text/x-23
We can already use it:
$ file -m my-magic test.23 test.23: File containing "23", ASCII text
Compilation
If you want to use it many times, you should compile it to a binary file for speed reasons:
$ file -C -m my-magic $ file -m my-magic.mgc test.* test.23: File containing "23", ASCII text test.foo: ASCII text test.html: ASCII text $ file -i -m my-magic.mgc test.* test.23: text/x-23; charset=us-ascii test.foo: text/plain; charset=us-ascii test.html: text/plain; charset=us-ascii
The HTML and PHP files that have been detected properly earlier are not detected anymore, because my own magic database does not contain the rules of the standard magic file (/usr/share/misc/magic.mgc).
You may however pass multiple magic files to use, separated with a :
$ file -i -m my-magic.mgc:/usr/share/misc/magic.mgc test.* test.23: text/x-23; charset=us-ascii test.foo: text/x-php; charset=us-ascii test.html: text/html; charset=us-ascii
Programming language detection
With this knowledge, I wrote a magic file that detects the programming language in source code files, so that phorkie can automatically choose the correct file extension: MIME_Type_PlainDetect.