Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax Free Open Book

Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax

Previous Section Next Section

Parsing

Parsing is the process of reading an XML document and reporting its content to a client application while checking the document for well-formedness. SAX represents parsers as instances of the XMLReader interface. The specific class that implements this interface varies from parser to parser. For example, in Xerces it's org.apache.xerces.parsers.SAXParser. In Crimson it's org.apache.crimson. parser.XMLReaderImpl. Most of the time you don't construct instances of this interface directly; instead, you use the static XMLReaderFactory.createXMLReader() factory method to create a parser-specific instance of this class. Then you pass InputSource objects containing XML documents to the parse() method of XMLReader. The parser reads the document, and throws an exception if it detects any well-formedness errors.

SAX in Other Languages

SAX has been unofficially ported to several other object-oriented languages, including C++, Visual Basic, Python, and Perl. The general patterns and names of most functions are the same, but the details of implementation are quite a bit different. For example, C++ doesn't have interfaces but does have multiple inheritance, so ContentHandler, XMLReader and the like become classes containing nothing but pure virtual functions. And because C++ string classes can't handle Unicode, parsers must instead use pointers to arrays of custom types such as XMLCh. Unfortunately, there's no standard C++ binding for SAX, so the custom classes vary from one parser to the next, and you can't easily port C++ SAX programs between different compilers and platforms in either binary or source form.

Although supporting the "Desperate Perl Hacker" was a goal of the original XML working group, Perl has always lagged behind other languages quite a bit when it comes to XML. The initial problem was the lack of support for Unicode, a sine qua non for XML. Today Perl has decent Unicode support. To really handle XML, you need at least version 5.005_52 of Perl, and preferably Perl 5.6.1 or later and ideally Perl 5.8.

Several XML parsers are available for Perl, but far and away the most popular is Larry Wall and Clark Cooper's XML::Parser. This is a wrapper around James Clark's expat [http://www.jclark.com/xml/expat.html], an XML parser written in C. However, this parser isn't truly SAX compatible, even though it's used in a lot of legacy code. New projects should use XML::SAX [http://sax.perl.org/] instead.

In my opinion, however, even with this module, Perl is still not as ideal a language for processing XML as you might expect. Perl's strength is its ability to work with the implicit structure in text documents, such as tab-delimited text files and comma-separated values (CSV) files. However, XML documents tend to have very explicit structure that is easily addressed by a language like Java. Perl's strengths don't come into play, but you still suffer the numerous well-known disadvantages of working with Perl. The inevitable obfuscation of Perl code seems to me too high a price to pay.

Python probably has the best support for SAX and XML of any of the non-Java languages. XML parsing including a SAX port has been a standard part of Python since version 2.0. Furthermore, Python has a standard Unicode string type. This is not quite the same as Python's regular string type, but Python's weak typing means this isn't nearly as big an inconvenience as it is in C++. However, the fact remains that SAX is designed in and for Java, and Java is certainly the most convenient language with which to write SAX programs.

Example 6.1 demonstrates the complete process with a simple program whose main() method parses a document found at a URL entered on the command line. If this document is well-formed, a simple message to that effect is printed on System.out. Otherwise, if the document is not well-formed, the parser throws a SAXException. If an I/O error such as a broken network connection occurs, then the parse() method throws an IOException. In this case, you don't know whether or not the document is well-formed.

Example 6.1 A SAX Program That Parses a Document
import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.IOException;


public class SAXChecker {

  public static void main(String[] args) {

    if (args.length <= 0) {
      System.out.println("Usage: java SAXChecker URL");
      return;
    }

    try {
      XMLReader parser = XMLReaderFactory.createXMLReader();
      parser.parse(args[0]);
      System.out.println(args[0] + " is well-formed.");
    }
    catch (SAXException e) {
      System.out.println(args[0] + " is not well-formed.");
    }
    catch (IOException e) {
      System.out.println(
       "Due to an IOException, the parser could not check "
       + args[0]
      );
    }
  }

}

Note

Don't forget that you'll probably need to install a parser such as Xerces or Ælfred somewhere in your class path before you can compile or run this program. Only Java 1.4 and later include a built-in parser.


This program's output is straightforward. For example, here's the output I got when I first ran it across my Cafe con Leche home page:

%java SAXChecker http://www.cafeconleche.org 
http://www.cafeconleche.org is not well-formed.

After I located and fixed the bugs in that document, I got this output:

%java SAXChecker http://www.cafeconleche.org 
http://www.cafeconleche.org is well-formed.

However, some readers will encounter a different result when they run this program. In particular, you may get this output:

%java SAXChecker http://www.cafeconleche.org 
org.xml.sax.SAXException: System property org.xml.sax.driver not
specified

What this really means is that your parser has not properly customized its version of the XMLReaderFactory class. Unfortunately, far too many parsers, including Xerces and Crimson, fail to do this. Consequently you need to set the org.xml.sax.driver Java system property to the fully package-qualified name of the Java class for your parser. For Xerces, it's org.apache.xerces.parsers.SAXParser. For Crimson, it's org.apache.crimson.parser.XMLReaderImpl. For other parsers, consult the parser documentation. You can specify a one-time value for this property using the -D flag to the Java interpeter like this:

%java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser 
 SAXChecker http://www.cafeconleche.org/
http://www.cafeconleche.org is well-formed.
    Previous Section Next Section


         Main Menu
    Main Page
    Table of content
    Copyright
    Praise for Elliotte Rusty Harold's 'Processing XML with Java™'
    List of Examples
    List of Figures
    Preface
    Part I: XML
    Part II: SAX
    Chapter 6. SAX
    What Is SAX?
    Parsing
    Callback Interfaces
    Receiving Documents
    Receiving Elements
    Handling Attributes
    Receiving Characters
    Receiving Processing Instructions
    Receiving Namespace Mappings
    'Ignorable White Space'
    Receiving Skipped Entities
    Receiving Locators
    What the ContentHandler Doesn't Tell You
    Summary
    Chapter 7. The XMLReader Interface
    Chapter 8. SAX Filters
    Part III: DOM
    Part IV: JDOM
    Part V: XPath/XSLT
    Part VI: Appendixes


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele