PHP CookBook Free Open Book

PHP CookBook

Previous Section Next Section

Recipe 12.5 Parsing XML with SAX

12.5.1 Problem

You want to parse an XML document and format it on an event basis, such as when the parser encounters a new opening or closing element tag. For instance, you want to turn an RSS feed into HTML.

12.5.2 Solution

Use the parsing functions in PHP's XML extension:

$xml = xml_parser_create();
$obj = new Parser_Object;  // a class to assist with parsing

xml_set_object($xml,$obj);
xml_set_element_handler($xml, 'start_element', 'end_element');
xml_set_character_data_handler($xml, 'character_data');
xml_parser_set_option($xml, XML_OPTION_CASE_FOLDING, false);

$fp = fopen('data.xml', 'r') or die("Can't read XML data.");
while ($data = fread($fp, 4096)) {
  xml_parse($xml, $data, feof($fp)) or die("Can't parse XML data");
}       
fclose($fp);

xml_parser_free($xml);

12.5.3 Discussion

These XML parsing functions require the expat library. However, because Apache 1.3.7 and later is bundled with expat, this library is already installed on most machines. Therefore, PHP enables these functions by default, and you don't need to explicitly configure PHP to support XML.

expat parses XML documents and allows you to configure the parser to call functions when it encounters different parts of the file, such as an opening or closing element tag or character data (the text between tags). Based on the tag name, you can then choose whether to format or ignore the data. This is known as event-based parsing and contrasts with DOM XML, which use a tree-based parser.

A popular API for event-based XML parsing is SAX: Simple API for XML. Originally developed only for Java, SAX has spread to other languages. PHP's XML functions follow SAX conventions. For more on the latest version of SAX — SAX2 — see SAX2 by David Brownell (O'Reilly).

PHP supports two interfaces to expat: a procedural one and an object-oriented one. Since the procedural interface practically forces you to use global variables to accomplish any meaningful task, we prefer the object-oriented version. With the object-oriented interface, you can bind an object to the parser and interact with the object while processing XML. This allows you to use object properties instead of global variables.

Here's an example application of expat that shows how to process an RSS feed and transform it into HTML. For more on RSS, see Recipe 12.12. The script starts with the standard XML processing code, followed by the objects created to parse RSS specifically:

$xml = xml_parser_create( );
$rss = new pc_RSS_parser;

xml_set_object($xml, $rss);
xml_set_element_handler($xml, 'start_element', 'end_element');
xml_set_character_data_handler($xml, 'character_data');
xml_parser_set_option($xml, XML_OPTION_CASE_FOLDING, false);

$feed = 'http://pear.php.net/rss.php';
$fp = fopen($feed, 'r') or die("Can't read RSS data.");
while ($data = fread($fp, 4096)) {
  xml_parse($xml, $data, feof($fp)) or die("Can't parse RSS data");
}       
fclose($fp);

xml_parser_free($xml);

After creating a new XML parser and an instance of the pc_RSS_parser class, configure the parser. First, bind the object to the parser; this tells the parser to call the object's methods instead of global functions. Then call xml_set_element_handler( ) and xml_set_character_data_handler( ) to specify the method names the parser should call when it encounters elements and character data. The first argument to both functions is the parser instance; the other arguments are the function names. With xml_set_element_handler( ), the middle and last arguments are the functions to call when a tag opens and closes, respectively. The xml_set_character_data_handler( ) function takes only one additional argument — the function to call when it processes character data.

Because an object has been associated with our parser, when that parser finds the string <tag>data</tag>, it calls $rss->start_element( ) when it reaches <tag>; $rss->character_data( ) when it reaches data; and $rss->end_element( ) when it reaches </tag>. The parser can't be configured to automatically call individual methods for each specific tag; instead, you must handle this yourself. However, the PEAR package XML_Transform provides an easy way to assign handlers on a tag-by-by basis.

The last XML parser configuration option tells the parser not to automatically convert all tags to uppercase. By default, the parser folds tags into capital letters, so <tag> and <TAG> both become the same element. Since XML is case-sensitive, and most feeds use lowercase element names, this feature should be disabled.

With the parser configured, feed the data to the parser:

$feed = 'http://pear.php.net/rss.php';
$fp = fopen($feed, 'r') or die("Can't read RSS data.");
while ($data = fread($fp, 4096)) {
  xml_parse($xml, $data, feof($fp)) or die("Can't parse RSS data");
}       
fclose($fp);

In order to curb memory usage, load the file in 4096-byte chunks, and feed each piece to the parser one at a time. This requires you to write the handler functions that will accommodate text arriving in multiple calls and not assume the entire string comes in all at once.

Last, while PHP cleans up any open parsers when the request ends, you can also manually close the parser by calling xml_parser_free( ) .

Now that the generic parsing is properly set up, add the pc_RSS_item and pc_RSS_parser classes, as shown in Examples Example 12-1 and Example 12-2, to handle a RSS document.

Example 12-1. pc_RSS_item
class pc_RSS_item {

  var $title = '';
  var $description = '';
  var $link = '';

  function display() {
    printf('<p><a href="%s">%s</a><br />%s</p>',
            $this->link,htmlspecialchars($this->title),
            htmlspecialchars($this->description));
  }
}
Example 12-2. pc_RSS_parser
class pc_RSS_parser {
  
  var $tag;
  var $item;
  
  function start_element($parser, $tag, $attributes) {
    if ('item' == $tag) {
      $this->item = new pc_RSS_item;
    } elseif (!empty($this->item)) {
      $this->tag = $tag;
    }
  }
  
  function end_element($parser, $tag) {
    if ('item' == $tag) {
      $this->item->display();
      unset($this->item); 
    }
  }
  
  function character_data($parser, $data) {
    if (!empty($this->item)) {
      if (isset($this->item->{$this->tag})) {
        $this->item->{$this->tag} .= trim($data);
      }
    }
  }
}  

The pc_RSS_item class provides an interface to an individual feed item. This removes the details of displaying each item from the general parsing code and makes it easy to reset the data for a new item by calling unset( ).

The pc_RSS_item::display( ) method prints out an HTML-formatted RSS item. It calls htmlspecialchars( ) to reencode any necessary entities, because expat decodes them into regular characters while parsing the document. This reencoding, however, breaks on feeds that place HTML in the title and description instead of plaintext.

Within pc_RSS_parser( ), the start_element( ) method takes three parameters: the XML parser, the name of the tag, and an array of attribute/value pairs (if any) from the element. PHP automatically supplies these values to the handler as part of the parsing process.

The start_element( ) method checks the value of $tag. If it's item, the parser's found a new RSS item, and a new pc_RSS_item object is instantiated. Otherwise, it checks to see if $this->item is empty( ); if it isn't, the parser is inside an item element. It's then necessary to record the tag's name, so that the character_data( ) method knows which property to assign its value to. If it is empty, this part of the RSS feed isn't necessary for our application, and it's ignored.

When the parser finds a closing item tag, the corresponding end_element( ) method first prints the RSS item, then cleans up by deleting the object.

Finally, the character_data( ) method is responsible for assigning the values of title, description, and link to the RSS item. After making sure it's inside an item element, it checks that the current tag is one of the properties of pc_RSS_item. Without this check, if the parser encountered an element other than those three, its value would also be assigned to the object. The { } s are needed to set the object property dereferencing order. Notice how trim($data) is appended to the property instead of a direct assignment. This is done to handle cases in which the character data is split across the 4096-byte chunks retrieved by fread( ); it also removes the surrounding whitespace found in the RSS feed.

If you run the code on this sample RSS feed:

<?xml version="1.0"?>
<rss version="0.93">
<channel>
  <title>PHP Announcements</title>
  <link>http://www.php.net/</link>
  <description>All the latest information on PHP.</description>

  <item>
    <title>PHP 5.0 Released!</title>
    <link>http://www.php.net/downloads.php</link>
    <description>The newest version of PHP is now available.</description>
  </item>
</channel>
</rss>

It produces this HTML:

<p><a href="http://www.php.net/downloads.php">PHP 5.0 Released!</a><br />
The newest version of PHP is now available.</p>

12.5.4 See Also

Recipe 12.4 for tree-based XML parsing with DOM; Recipe 12.12 for more on parsing RSS; documentation on xml_parser_create( ) at http://www.php.net/xml-parser-create, xml_element_handler( ) at http://www.php.net/xml-element-handler, xml_character_handler( ) at http://www.php.net/xml-character-handler, xml_parse( ) at http://www.php.net/xml-parse, and the XML functions in general at http://www.php.net/xml; the official SAX site at http://www.saxproject.org/.

    Previous Section Next Section
    Index: [SYMBOL][A][B][C][D][E][F][G][H][I][J][K][L][M][N][O][P][Q][R][S][T][U][V][W][X][Z]


         Main Menu
    Main Page
    Table of content
    Copyright
    Preface
    Chapter 1. Strings
    Chapter 2. Numbers
    Chapter 3. Dates and Times
    Chapter 4. Arrays
    Chapter 5. Variables
    Chapter 6. Functions
    Chapter 7. Classes and Objects
    Chapter 8. Web Basics
    Chapter 9. Forms
    Chapter 10. Database Access
    Chapter 11. Web Automation
    Chapter 12. XML
    12.1 Introduction
    Recipe 12.2 Generating XML Manually
    Recipe 12.3 Generating XML with the DOM
    Recipe 12.4 Parsing XML with the DOM
    Recipe 12.5 Parsing XML with SAX
    Recipe 12.6 Transforming XML with XSLT
    Recipe 12.7 Sending XML-RPC Requests
    Recipe 12.8 Receiving XML-RPC Requests
    Recipe 12.9 Sending SOAP Requests
    Recipe 12.10 Receiving SOAP Requests
    Recipe 12.11 Exchanging Data with WDDX
    Recipe 12.12 Reading RSS Feeds
    Chapter 13. Regular Expressions
    Chapter 14. Encryption and Security
    Chapter 15. Graphics
    Chapter 16. Internationalization and Localization
    Chapter 17. Internet Services
    Chapter 18. Files
    Chapter 19. Directories
    Chapter 20. Client-Side PHP
    Chapter 21. PEAR
    Colophon
    Index


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele