PHP: Determine absolute link URLs

When parsing HTML and following links, it is necessary to calculate absolute URLs from the href attribute values in <a> and <link> tags.

Link classes

Different types of link classes may occur in an HTML document:

Absolute URL
http://example.org/foo.html
An URL with scheme (protocol), host and path.
Absolute URL without scheme
//example.org/foo.html
The scheme is missing, but host and path are given. The document's protocol has to be used in this case, according to RFC 3986 section 4.2 and section 5.2.2.
Path-absolute URL without host
/path/to/file.html
Scheme, hostname and port are missing - only an absolute path is given.
Relative path
../foo/bar.html
A simple relative path.
Fragment only
#baz
An anchor with a hash sign in front. Links to another section in the same document.

To resolve those URLs, you need both the document URL and the link href value.

Code

Implementing the whole resolving algorithm is tedious, and you don't have to do it yourself. There are several implementations out there.

Net_URL2

PEAR offers the Net_URL2 package. Its resolve() method implements the procedure properly, is unit-tested and has no other dependencies. Example:

<?php
require_once 'Net/URL2.php';
$base = new Net_URL2('http://example.org/foo/bar.htm');
$abs = $base->resolve('../baz.jpg');
// $abs is 'http://example.org/baz.jpg'
?>

Absolute URL deriver

absolute-url-deriver is a small composer-installable lib for resolving relative URLs.

While this library consists of one file only, it depends on another lib (much larger) that provides URL handling.

Empty URLs

HTML5 allows empty action attributes in <form> tags. Both libraries listed above cope with that; they return the source URL when the "target" URL is empty.

Base href

HTML documents may have a <base href=".."/> tag in their head section. When resolving links, you need to use this one instead of the document's URL itself. See my XPath article for more information about extracting attribute values from HTML.

Written by Christian Weiske.

Comments? Please send an e-mail.