Recipe 18.8 Processing Every Word in a File
18.8.1 Problem
You want to do something with every word
in a file.
18.8.2 Solution
Read in each line with fgets( ), separate the line
into words, and process each word:
$fh = fopen('great-american-novel.txt','r') or die($php_errormsg);
while (! feof($fh)) {
if ($s = fgets($fh,1048576)) {
$words = preg_split('/\s+/',$s,-1,PREG_SPLIT_NO_EMPTY);
// process words
}
}
fclose($fh) or die($php_errormsg);
18.8.3 Discussion
Here's how to calculate average word length in a
file:
$word_count = $word_length = 0;
if ($fh = fopen('great-american-novel.txt','r')) {
while (! feof($fh)) {
if ($s = fgets($fh,1048576)) {
$words = preg_split('/\s+/',$s,-1,PREG_SPLIT_NO_EMPTY);
foreach ($words as $word) {
$word_count++;
$word_length += strlen($word);
}
}
}
}
print sprintf("The average word length over %d words is %.02f characters.",
$word_count,
$word_length/$word_count);
Processing every word proceeds differently depending on how
"word" is defined. The code in this
recipe uses the
Perl-compatible
regular-expression engine's \s
whitespace metacharacter, which includes space, tab, newline,
carriage return, and formfeed. Section 2.6
breaks apart a line into words by splitting on a space, which is
useful in that recipe because the words have to be rejoined with
spaces. The Perl-compatible engine also has a word-boundary assertion
(\b) that matches between a word character
(alphanumeric) and a nonword character (anything else). Using
\b instead of \s to delimit
words most noticeably treats differently words with embedded
punctuation. The term 6 o'clock
is two words when split by whitespace (6 and
o'clock); it's four words when
split by word boundaries (6, o,
', and clock).
18.8.4 See Also
Section 13.3 discusses regular expressions to
match words; Section 1.5 for breaking apart a
line by words; documentation on fgets( ) at
http://www.php.net/fgets, on
preg_split( ) at http://www.php.net/preg-split, and on the
Perl-compatible regular expression extension at http://www.php.net/pcre.
|