Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax Free Open Book

Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax

Previous Section Next Section

DOM Parsers for Java

DOM is defined almost completely in terms of interfaces rather than classes. Different parsers provide their own custom implementations of these standard interfaces. This offers a great deal of flexibility. Generally you do not install the DOM interfaces on their own. Instead they come bundled with a parser distribution that provides the detailed implementation classes. DOM isn't quite as broadly supported as SAX, but most of the major Java parsers provide it, including Crimson, Xerces, XML for Java, the Oracle XML Parser for Java, and GNU JAXP.

DOM is not complete to itself. Almost all significant DOM programs need to use some parser-specific classes. DOM programs are not too difficult to port from one parser to another, but a recompile is normally required. You can't just change a system property to switch from one parser to another, as you can with SAX. In particular, DOM2 does not specify how one parses a document, creates a new document, or serializes a document into a file or onto a stream. These important functions are all performed by parser-specific classes.

JAXP, the Java API for XML Processing, fills in a few of the holes in DOM by providing standard parser-independent means to parse existing documents, create new documents, and serialize in-memory DOM trees to XML files. Most current Java parsers that support DOM2 also support JAXP 1.1. JAXP is a standard part of Java 1.4. Although JAXP is not included in earlier versions of Java, it does work with Java 1.1 and later and is bundled with most parser class libraries. DOM3 promises to fill the same holes that JAXP fills (that is, parsing, serializing, and bootstrapping), but it is not yet finished and not yet supported in a large way by any parsers.

Because DOM depends so heavily on parser classes, its performance characteristics vary widely from one parser to the next. Speed is something of a concern, but memory consumption is a much bigger issue for most applications. All DOM implementations I've seen use more space for the in-memory DOM tree than the actual file on the disk occupies. Generally the in-memory DOM trees range from three to ten times as large as the actual XML text. Some parsers including Xerces offer a "lazy DOM" that leaves most of the document on the disk and reads into memory only those parts of the document that the client actually requests.

Another distinguishing factor between different DOM implementations is the extra features the parser provides. Most parsers provide methods to parse XML documents and serialize DOM trees to XML. Other useful features include schema validation, database access, XInclude, XSLT, XPath, support for different character sets, and application-specific DOMs like the MathML, SVG, and WML DOMs.

For example, the Oracle and Xerces parsers provide schema validation. Ælfred and Crimson don't. Ælfred has partial support for XInclude. The other three don't. The Oracle XML parser can produce a DOM Document object from a SQL query against a relational database or a JDBC ResultSet object. The other three can't. The Oracle XML parser can decode the WAP binary XML format. The other three can't. Xerces has specialized DOMs for HTML and WML documents. The other three don't. These are all nonstandard features; but if they're useful to you, that would be a good reason to choose one parser over another. Table 9.2 summarizes parser support for various useful features.

Measuring DOM Size

To test the memory usage of various implementations, I wrote a simple program that loaded the second edition of the XML 1.0 specification into a DOM Document object. The specification's text format is 197K (not including the DTD, which adds another 56K but isn't really modeled by DOM at all). Following is the approximate amount of memory that various parsers used to build Document objects from this file:

  • Xerces-J 2.0.1: 1489K

  • Crimson 1.1.3 ( JDK 1.4 default): 1230K

  • Oracle XML Parser for Java 9.2.0.2.0: 2500K

I used a couple of different techniques to measure the memory used. In one case, I used OptimizeIt and the Java Virtual Machine Profiling Interface (JVMPI) to check the heap size. I ran the program both with and without loading the document. I subtracted the total heap memory used without loading the document from the memory used when the document was loaded to get the numbers reported above. In the other test, I used the Runtime class to measure the total memory and the free memory before and after the Document was created. In both cases, I garbage collected before taking the final measurements. The results from the separate tests were within 15 percent of each other. I performed all tests in Sun's JDK 1.4.0 using Hotspot on Windows NT 4.0SP6.

I don't claim these numbers to be exact, and I certainly don't think this one test document justifies any claims whatsoever about the relative efficiency of the different DOM implementations. The difference between Crimson and Xerces is well within my margin of error. A more serious test would have to look at how the different implementations scale with the size of the initial document, and perhaps graph the curves of memory size versus file size. For example, it's possible that each of these requires a minimum of 1024K per document, but grows relatively slowly after that point. I did run the same tests on a minimal document that contained a single empty element. The results ranged from 3K to 131K for this document. However, these numbers were extremely sensitive to exactly when and how garbage was collected. I wouldn't claim the results are accurate to better than ±300K. However, I do think that together these tests demonstrate just how inefficient DOM is.

Table 9.2. DOM Parser Features
  Xerces Ælfred Oracle Crimson
DTDs X X X X
Schemas X   X  
Namespaces X X X X
Lazy DOM X      
HTML DOM X      
Views        
Stylesheets        
CSS        
CSS2        
Events X X X  
UI events   X    
Mouse events        
Mutation events X X    
HTML events   X    
Traversal X Partial X  
Range     X  
XSLT/XPath Via Xalan-J   X  
XInclude   X    

    Previous Section Next Section


         Main Menu
    Main Page
    Table of content
    Copyright
    Praise for Elliotte Rusty Harold's 'Processing XML with Java™'
    List of Examples
    List of Figures
    Preface
    Part I: XML
    Part II: SAX
    Part III: DOM
    Chapter 9. The Document Object Model
    The Evolution of DOM
    DOM Modules
    Application-Specific DOMs
    Trees
    DOM Parsers for Java
    Parsing Documents with a DOM Parser
    The Node Interface
    The NodeList Interface
    JAXP Serialization
    DOMException
    Choosing between SAX and DOM
    Summary
    Chapter 10. Creating XML Documents with DOM
    Chapter 11. The DOM Core
    Chapter 12. The DOM Traversal Module
    Chapter 13. Output from DOM
    Part IV: JDOM
    Part V: XPath/XSLT
    Part VI: Appendixes


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele