Recipe 11.9 Extracting Links from an HTML File
11.9.1 Problem
You need to extract the URLs that are
specified inside an HTML document.
11.9.2 Solution
Use the pc_link_extractor(
) function shown in Example 11-2.
Example 11-2. pc_link_extractor( ) function pc_link_extractor($s) {
$a = array();
if (preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i',
$s,$matches,PREG_SET_ORDER)) {
foreach($matches as $match) {
array_push($a,array($match[1],$match[2]));
}
}
return $a;
}
For example:
$links = pc_link_extractor($page);
11.9.3 Discussion
The pc_link_extractor(
) function returns an array. Each element
of that array is itself a two-element array. The first element is the
target of the link, and the second element is the text that is
linked. For example:
$links=<<<END
Click <a href="http://www.oreilly.com">here</a> to visit a computer book
publisher. Click <a href="http://www.sklar.com">over here</a> to visit
a computer book author.
END;
$a = pc_link_extractor($links);
print_r($a);
Array
(
[0] => Array
(
[0] => http://www.oreilly.com
[1] => here
)
[1] => Array
(
[0] => http://www.sklar.com
[1] => over here
)
)
The regular expression in pc_link_extractor( )
won't work on all links, such as those that are
constructed with JavaScript or some hexadecimal escapes, but it
should function on the majority of reasonably well-formed HTML.
11.9.4 See Also
Recipe 13.8 for information on capturing text inside HTML tags;
documentation on preg_match_all( ) at
http://www.php.net/preg-match-all.
|