When parsing HTML and following links, it is necessary to calculate absolute URLs from the href attribute values in <a> and <link> tags.
Different types of link classes may occur in an HTML document:
- Absolute URL
- An URL with scheme (protocol), host and path.
- Absolute URL without scheme
- The scheme is missing, but host and path are given. The document's protocol has to be used in this case, according to RFC 3986 section 4.2 and section 5.2.2.
- Path-absolute URL without host
- Scheme, hostname and port are missing - only an absolute path is given.
- Relative path
- A simple relative path.
- Fragment only
- An anchor with a hash sign in front. Links to another section in the same document.
To resolve those URLs, you need both the document URL and the link href value.
Implementing the whole resolving algorithm is tedious, and you don't have to do it yourself. There are several implementations out there.
<?php require_once 'Net/URL2.php'; $base = new Net_URL2('http://example.org/foo/bar.htm'); $abs = $base->resolve('../baz.jpg'); // $abs is 'http://example.org/baz.jpg' ?>
Absolute URL deriver
While this library consists of one file only, it depends on another lib (much larger) that provides URL handling.
HTML5 allows empty action attributes in <form> tags. Both libraries listed above cope with that; they return the source URL when the "target" URL is empty.
HTML documents may have a <base href=".."/> tag in their head section. When resolving links, you need to use this one instead of the document's URL itself. See my XPath article for more information about extracting attribute values from HTML.