Recipe 13.3 Matching Words
13.3.1 Problem
You want to pull out all words from a
string.
13.3.2 Solution
The key to this is carefully defining what you mean by a word. Once
you've created your definition, use the special
character types to create your regular expression:
/\S+/ // everything that isn't whitespace
/[A-Z'-]+/i // all upper and lowercase letters, apostrophes, and hyphens
13.3.3 Discussion
The simple question "what is a
word?" is surprisingly complicated. While the Perl
compatible regular expressions have a built-in word character type,
specified by \w, it's important to
understand exactly how PHP defines a word. Otherwise, your results
may not be what you expect.
Normally, because it comes directly from Perl's
definition of a word, \w encompasses all letters,
digits, and underscores; this means a_z is a word,
but the email address php@example.com is not.
In this recipe, we only consider English
words, but other languages use different alphabets. Because
Perl-compatible regular expressions use the current locale to define
its settings, altering the locale can switch the definition of a
letter, which then redefines the meaning of a word.
To combat this, you may want to explicitly enumerate the characters
belonging to your words inside a character class. To add a
nonstandard character, use
\ddd , where ddd is a
character's octal code.
13.3.4 See Also
Recipe 16.3 for information about setting
locales.
|