Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax Free Open Book

Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax

Previous Section Next Section

Callback Interfaces

SAX uses the Observer design pattern to tell client applications what's in a document.[1] Java developers are most familiar with this pattern from the event architecture of the AWT and Swing. In that context, the client programmer implements an interface such as MouseListener that receives events through well-known methods. Then the programmer registers the MouseListener object with a component such as a Button using the setMouseListener() method. When the end user moves or clicks the mouse in the button's area, the button invokes a method in the registered MouseListener object. In this example, the Button class plays the role of the Subject; the MouseListener interface plays the role of the Observer; and the client-defined implementation of that interface plays the role of the ConcreteObserver.

[1] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Reading, Mass.: Addison-Wesley, 1995, 293–303.

SAX works in a very similar way, except that XMLReader plays the role of Subject; and the org.xml.sax.ContentHandler interface plays the role of Observer. The biggest difference between the AWT and SAX is that SAX does not allow more than one listener to be registered with each XMLReader. Otherwise, the pattern is exactly the same.

Example 6.2 shows the SAX ContentHandler interface. Almost any significant SAX program you write is going to use this interface in one form or another.

Example 6.2 The SAX ContentHandler Interface
package org.xml.sax;


public interface ContentHandler {

  public void setDocumentLocator(Locator locator);
  public void startDocument() throws SAXException;
  public void endDocument() throws SAXException;
  public void startPrefixMapping(String prefix, String uri)
   throws SAXException;
  public void endPrefixMapping(String prefix)
   throws SAXException;
  public void startElement(String namespaceURI, String localName,
   String qualifiedName, Attributes atts) throws SAXException;
  public void endElement(String namespaceURI, String localName,
   String qualifiedName) throws SAXException;
  public void characters(char[] text, int start, int length)
   throws SAXException;
  public void ignorableWhitespace(char[] text, int start,
   int length) throws SAXException;
  public void processingInstruction(String target, String data)
   throws SAXException;
  public void skippedEntity(String name)
   throws SAXException;

}

The XMLReader interface declares eleven methods. As the parser—that is, the XMLReader—reads a document, it invokes the methods in this interface. When the parser reads a start-tag, it calls the startElement() method. When the parser reads some text content, it calls the characters() method. When the parser reads an end-tag, it calls the endElement() method. When the parser reads a processing instruction, it calls the processingInstruction() method. The details of what the parser reads—for example, the name and attributes of a start-tag—are passed as arguments to the method.

Order is maintained throughout. That is, the parser always invokes these methods in the same order in which it sees items in the document. In many cases, the parser calls back to these methods immediately. For example, the parser calls the startElement() method as soon as it has read a complete start-tag. It will not read past that start-tag until the startElement() method has returned. This means that you'll generally receive some content from invalid and even malformed documents before the parser detects the error. Consequently, it's important not to take undoable actions until you've reached the end of a document.

Implementing ContentHandler

A concrete example should help clarify this. I'm going to write a very simple program that extracts all the text content from an XML document while stripping out all the tags, comments, and processing instructions. This will be divided into two parts: a class that implements ContentHandler and a class that feeds the document into the parser.

Example 6.3, TextExtractor, is the class that implements ContentHandler. It has to provide all eleven methods declared in ContentHandler. However, the only method that's actually needed is characters(). The other ten are do-nothing methods. They have empty method bodies, and nothing happens when the parser invokes them.

Example 6.3 A SAX ContentHandler That Writes All #PCDATA onto a java.io.Writer
import org.xml.sax.*;
import java.io.*;


public class TextExtractor implements ContentHandler {

  private Writer out;

  public TextExtractor(Writer out) {
    this.out = out;
  }

  public void characters(char[] text, int start, int length)
   throws SAXException {

    try {
      out.write(text, start, length);
    }
    catch (IOException e) {
      throw new SAXException(e);
    }

  }

  // do-nothing methods
  public void setDocumentLocator(Locator locator) {}
  public void startDocument() {}
  public void endDocument() {}
  public void startPrefixMapping(String prefix, String uri) {}
  public void endPrefixMapping(String prefix) {}
  public void startElement(String namespaceURI, String localName,
   String qualifiedName, Attributes atts) {}
  public void endElement(String namespaceURI, String localName,
   String qualifiedName) {}
  public void ignorableWhitespace(char[] text, int start,
   int length) throws SAXException {}
  public void processingInstruction(String target, String data){}
  public void skippedEntity(String name) {}

}// end TextExtractor

In addition to the eleven methods declared in ContentHandler, TextExtractor has a constructor and an out field. The constructor sets this field to the Writer on which the parsed text will be output. You can always add as many additional methods, fields, and constructors as you need. You're not limited to just those declared in the interface.

All the real work of this class happens inside characters(). When the parser reads content between tags, it passes this text to the characters() method inside an array of chars. The index of the first character of the text inside that array is given by the start argument. The number of characters is given by the length argument. In this class, the characters() method writes the subarray of text from start to start+length onto the Writer stored in the out field.

The characters() method in this class invokes the write() method in java.io.Writer. It happens that the write() method is declared to throw an IOException. The ContentHandler interface does not declare that characters() throws IOException; therefore, this exception must be caught. But rather than simply ignoring it or printing a pointless message on System.err, we can wrap the IOException inside SAXException, which characters() is declared to throw, and then throw that exception. This signals the parser that something went wrong, and the parser will pass the exception along to the client application. If the client application wants to know what originally went wrong, it can find out by invoking SAXException's getException() method.

In contrast, none of the do-nothing methods such as startElement() and processingInstruction() will ever throw any exceptions. Therefore, they are not declared to throw SAXException even though ContentHandler would support this declaration. There's no need to clutter up the code with unnecessary throws clauses, nor is it good programming practice to advertise a possible exception in the method signature when you know that exception will never occur.

Using the ContentHandler

By itself the TextExtractor class does nothing. There's no code in the class to invoke any of the methods or parse a document. Although code to do this could be placed in a main() method in TextExtractor, I prefer to place it in a class of its own called ExtractorDriver, which is shown in Example 6.4.

Example 6.4 The Driver Method for the Text Extractor Program
import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.*;


public class ExtractorDriver {

  public static void main(String[] args) {

    if (args.length <= 0) {
      System.out.println(
       "Usage: java ExtractorDriver url"
      );
      return;
    }

    try {
      XMLReader parser = XMLReaderFactory.createXMLReader();

      // Since this just writes onto the console, it's best
      // to use the system default encoding, which is what
      // we get by not specifying an explicit encoding here.
      Writer out = new OutputStreamWriter(System.out);
      ContentHandler handler = new TextExtractor(out);
      parser.setContentHandler(handler);

      parser.parse(args[0]);

      out.flush();
    }
    catch (Exception e) {
      System.err.println(e);
    }

  }

}

The main() method in this class performs the following steps.

  1. Build an instance of XMLReader using the XMLReaderFactory.createXMLReader() method.

  2. Construct a new TextExtractor object.

  3. Pass this object to the setContentHandler() method of the XMLReader.

  4. Pass the URL of the document you want to parse (read from the command line) to the XMLReader's parse() method.

One thing to note: there's still no code that actually invokes the characters() or any other method in the TextExtractor class! This is for the same reason that you never see any code to invoke actionPerformed() or mouseClicked() when writing GUI programs in Java. The code that actually calls these methods is hidden deep inside the class library. You rarely need to concern yourself with it directly. Here the relevant code that calls characters() is hiding somewhere inside the parser-specific implementation of the XMLReader interface.

Let's suppose you run this program over the original XML order document, Example 1.2 from Chapter 1. The results look like this:

C:\>java ExtractorDriver order.xml 

  Chez Fred

    Birdsong Clock
    244
    12
    21.95


    135 Airline Highway
    Narragansett RI 02882

  263.40
  18.44
  8.95
  290.79

The text of the original document, including white space, has been preserved. However, the markup has all been stripped. This is exactly what we asked for.

In the next few sections, we'll explore the individual methods of the ContentHandler interface and their behavior in more detail.

The DefaultHandler Adapter Class

TextExtractor really only used one of the eleven methods declared in ContentHandler. The other ten methods were all do-nothing methods with empty bodies. In fact, few SAX programs actually use all eleven methods. Most of the time about half suffice. To take advantage of this, SAX includes the org.xml.sax.helpers.DefaultHandler convenience class that implements the ContentHandler interface (and several other callback interfaces discussed in the next chapter) with do-nothing methods:

public class DefaultHandler implements ContentHandler,
DTDHandler, EntityResolver, ErrorHandler

Rather than implementing ContentHandler directly and cluttering up your code with irrelevant methods, you can instead extend DefaultHandler. Then you only have to override the methods you actually care about, not all eleven. For example, if TextExtractor were built on top of DefaultHandler, it would be the smaller and simpler class shown in Example 6.5.

Example 6.5 A Subclass of DefaultHandler That Writes All #PCDATA onto a java.io.Writer
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import java.io.*;


public class TextExtractor extends DefaultHandler {

  private Writer out;

  public TextExtractor(Writer out) {
    this.out = out;
  }

  public void characters(char[] text, int start, int length)
   throws SAXException {
    try {
      out.write(text, start, length);
    }
    catch (IOException e) {
      throw new SAXException(e);
    }
  }

}

The programs in this book use content handlers that implement ContentHandler directly and content handlers that extend DefaultHandler, mostly depending on which subjectively feels more natural to the problem at hand. You should feel free to use whichever variation you prefer.

    Previous Section Next Section


         Main Menu
    Main Page
    Table of content
    Copyright
    Praise for Elliotte Rusty Harold's 'Processing XML with Java™'
    List of Examples
    List of Figures
    Preface
    Part I: XML
    Part II: SAX
    Chapter 6. SAX
    What Is SAX?
    Parsing
    Callback Interfaces
    Receiving Documents
    Receiving Elements
    Handling Attributes
    Receiving Characters
    Receiving Processing Instructions
    Receiving Namespace Mappings
    'Ignorable White Space'
    Receiving Skipped Entities
    Receiving Locators
    What the ContentHandler Doesn't Tell You
    Summary
    Chapter 7. The XMLReader Interface
    Chapter 8. SAX Filters
    Part III: DOM
    Part IV: JDOM
    Part V: XPath/XSLT
    Part VI: Appendixes


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele