PHP CookBook Free Open Book

PHP CookBook

Previous Section Next Section

Recipe 13.5 Choosing Greedy or Nongreedy Matches

13.5.1 Problem

You want your pattern to match the smallest possible string instead of the largest.

13.5.2 Solution

Place a ? after a quantifier to alter that portion of the pattern:

// find all bolded sections
preg_match_all('#<b>.+?</b>#', $html, $matches);

Or, use the U pattern modifier ending to invert all quantifiers from greedy to nongreedy:

// find all bolded sections
preg_match_all('#<b>.+</b>#U', $html, $matches);

13.5.3 Discussion

By default, all regular expressions in PHP are what's known as greedy. This means a quantifier always tries to match as many characters as possible.

For example, take the pattern p.*, which matches a p and then 0 or more characters, and match it against the string php. A greedy regular expression finds one match, because after it grabs the opening p, it continues on and also matches the hp. A nongreedy regular expression, on the other hand, finds a pair of matches. As before, it matches the p and also the h, but then instead of continuing on, it backs off and leaves the final p uncaptured. A second match then goes ahead and takes the closing letter.

The following code shows that the greedy match finds only one hit; the nongreedy ones find two:

print preg_match_all('/p.*/', "php");  // greedy
print preg_match_all('/p.*?/', "php"); // nongreedy
print preg_match_all('/p.*/U', "php"); // nongreedy
1
2
2

Greedy matching is also known as maximal matching and nongreedy matching can be called minimal matching, because these options match either the maximum or minimum number of characters possible.

Initially, all regular expressions were strictly greedy. Therefore, you can't use this syntax with ereg( ) or ereg_replace( ). Greedy matching isn't supported by the older engine that powers these functions; instead, you must use Perl-compatible functions.

Nongreedy matching is frequently useful when trying to perform simplistic HTML parsing. Let's say you want to find all text between bold tags. With greedy matching, you get this:

$html = '<b>I am bold.</b> <i>I am italic.</i> <b>I am also bold.</b>';
preg_match_all('#<b>(.+)</b>#', $html, $bolds);
print_r($bolds[1]);
Array
(
    [0] => I am bold.</b> <i>I am italic.</i> <b>I am also bold.

)

Because there's a second set of bold tags, the pattern extends past the first </b>, which makes it impossible to correctly break up the HTML. If you use minimal matching, each set of tags is self-contained:

$html = '<b>I am bold.</b> <i>I am italic.</i> <b>I am also bold.</b>';
preg_match_all('#<b>(.+?)</b>#', $html, $bolds);
print_r($bolds[1]);
Array
(
    [0] => I am bold.
    [1] => I am also bold.
)

Of course, this can break down if your markup isn't 100% valid, and there are stray bold tags lying around.[2] If your goal is just to remove all (or some) HTML tags from a block of text, you're better off not using a regular expression. Instead, use the built-in function strip_tags( ); it's faster and it works correctly. See Recipe 11.12 for more details.

[2] It's possible to have valid HTML and still get into trouble. For instance, if you have bold tags inside a comment. A true HTML parser ignores this section, but our pattern won't.

Finally, even though the idea of nongreedy matching comes from Perl, the -U modifier is incompatible with Perl and is unique to PHP's Perl-compatible regular expressions. It inverts all quantifiers, turning them from greedy to nongreedy and also the reverse. So, to get a greedy quantifier inside of a pattern operating under a trailing /U, just add a ? to the end, the same way you would normally turn a greedy quantifier into a nongreedy one.

13.5.4 See Also

Recipe 13.9 for more on capturing text inside HTML tags; Recipe 11.12 for more on stripping HTML tags; documentation on preg_match_all( ) at http://www.php.net/preg-match-all.

    Previous Section Next Section
    Index: [SYMBOL][A][B][C][D][E][F][G][H][I][J][K][L][M][N][O][P][Q][R][S][T][U][V][W][X][Z]


         Main Menu
    Main Page
    Table of content
    Copyright
    Preface
    Chapter 1. Strings
    Chapter 2. Numbers
    Chapter 3. Dates and Times
    Chapter 4. Arrays
    Chapter 5. Variables
    Chapter 6. Functions
    Chapter 7. Classes and Objects
    Chapter 8. Web Basics
    Chapter 9. Forms
    Chapter 10. Database Access
    Chapter 11. Web Automation
    Chapter 12. XML
    Chapter 13. Regular Expressions
    13.1 Introduction
    Recipe 13.2 Switching From ereg to preg
    Recipe 13.3 Matching Words
    Recipe 13.4 Finding the nth Occurrence of a Match
    Recipe 13.5 Choosing Greedy or Nongreedy Matches
    Recipe 13.6 Matching a Valid Email Address
    Recipe 13.7 Finding All Lines in a File That Match a Pattern
    Recipe 13.8 Capturing Text Inside HTML Tags
    Recipe 13.9 Escaping Special Characters in a Regular Expression
    Recipe 13.10 Reading Records with a Pattern Separator
    Chapter 14. Encryption and Security
    Chapter 15. Graphics
    Chapter 16. Internationalization and Localization
    Chapter 17. Internet Services
    Chapter 18. Files
    Chapter 19. Directories
    Chapter 20. Client-Side PHP
    Chapter 21. PEAR
    Colophon
    Index


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele