Recipe 11.11 Converting HTML to ASCII
11.11.1 Problem
You need to convert HTML to readable,
formatted ASCII text.
11.11.2 Solution
If you have access to an external program that formats HTML as ASCII,
such as lynx, call it like so:
$file = escapeshellarg($file);
$ascii = `lynx -dump $file`;
11.11.3 Discussion
If you can't use an external formatter, the
pc_html2ascii( ) function shown in Example 11-4 handles a reasonable subset of HTML (no tables
or frames, though).
Example 11-4. pc_html2ascii( ) function pc_html2ascii($s) {
// convert links
$s = preg_replace('/<a\s+.*?href="?([^\" >]*)"?[^>]*>(.*?)<\/a>/i',
'$2 ($1)', $s);
// convert <br>, <hr>, <p>, <div> to line breaks
$s = preg_replace('@<(b|h)r[^>]*>@i',"\n",$s);
$s = preg_replace('@<p[^>]*>@i',"\n\n",$s);
$s = preg_replace('@<div[^>]*>(.*)</div>@i',"\n".'$1'."\n",$s);
// convert bold and italic
$s = preg_replace('@<b[^>]*>(.*?)</b>@i','*$1*',$s);
$s = preg_replace('@<i[^>]*>(.*?)</i>@i','/$1/',$s);
// decode named entities
$s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES)));
// decode numbered entities
$s = preg_replace('//e','chr(\\1)',$s);
// remove any remaining tags
$s = strip_tags($s);
return $s;
}
11.11.4 See Also
Recipe 9.9 for more on
get_html_translation_table(); documentation on
preg_replace( ) at
http://www.php.net/preg-replace,
get_html_translation_table( ) at
http://www.php.net/get-html-translation-table,
and strip_tags( ) at
http://www.php.net/strip-tags.
|