Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax Free Open Book

Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax

Previous Section Next Section

Handling Attributes

Attributes are not reported through separate callbacks. Instead an Attributes object containing all the attributes of an element is passed to the startElement() method for the start-tag or empty-element tag of the element that possesses the attributes. Example 6.8 summarizes the Attributes interface.

Example 6.8 The SAX Attributes Interface
package org.xml.sax;


public interface Attributes {

  public int    getLength ();

  public String getQName(int index);
  public String getURI(int index);
  public String getLocalName(int index);
  public int    getIndex(String uri, String localPart);
  public int    getIndex(String qualifiedName);
  public String getType(String uri, String localName);
  public String getType(String qualifiedName);
  public String getType(int index);
  public String getValue(String uri, String localName);
  public String getValue(String qualifiedName);
  public String getValue(int index);

}

If you know the qualified name or namespace URI and local name of the attribute you want, Attributes can look up its value and type. If you don't know the names of the attributes at compile-time, you can iterate through all of the attributes of an element instead. Attributes are unordered. However, for programmer convenience the Attributes interface is designed as a list. You can ask for the value, local name, qualified name, type, and namespace URI of an attribute by giving its index into the list. Just don't assume that the order of the attributes in this list is necessarily the same as in the original document. More often than not, it isn't.

The type of the attribute is reported as one of these nine constant strings, exactly as types would be indicated in an ATTLIST declaration in a DTD:

  • CDATA

  • ID

  • IDREF

  • IDREFS

  • NMTOKEN

  • NMTOKENS

  • ENTITY

  • ENTITIES

  • NOTATION

Enumerated types are reported as having type NMTOKEN. Undeclared attributes are reported as having type CDATA. SAX does not yet support schema types such as int or gYear. Maybe in SAX 3.0.

Caution

A few parsers are not 100 percent compliant with the SAX specification here. In particular, Crimson and Xerces 2.0.x use the string ENUMERATION for enumerated types instead of NMTOKEN. Xerces 1.4. reports an enumerated type as a string containing the actual enumeration, for example, ( yes | no | maybe).


If a declared attribute has any type other than CDATA, then the parser normalizes its value. This means that all tabs, carriage returns, and linefeeds are converted to a single space; runs of spaces are converted to a single space; and leading and trailing white space is stripped. Only normalized values are reported by the getValue() methods. However, in order to determine an attribute type, the parser must read the DTD. If an attribute is declared in the external DTD subset, then nonvalidating parsers that do not read the external subset will assume the attribute has type CDATA, and fail to normalize.

If you ask an Attributes object for information about an attribute (for example, type, name, or value) that is not in that particular list, then all of the methods that normally return a String return null instead. The getIndex() methods return -1. None of these methods throws any exceptions. However, if you try to use the return values without checking for null or -1 first, then you're asking for a NullPointerException or an ArrayIndexOutOfBoundsException. SAX 2.0 does not distinguish between attributes that were present in the instance document and attributes that were defaulted in from the DTD or schema. This may be added in SAX 2.1.

For an example, I'm going to develop a web spider that follows simple XLinks. XLink is an attribute-based syntax for embedding hypertext in arbitrary XML documents. Elements are identified as XLinks by an xlink:type attribute with the value simple. (There's also a more powerful and more complex extended XLink, which I'm going to ignore for the purposes of this example.) The URL the link points to is contained in an xlink:href attribute. The xlink prefix is mapped to the namespace URI http://www.w3.org/TR/1999/xlink. As always, the prefix can change as long as the URI stays the same. For example, this is an XLink that points to The Nation's home page:

<magazine xmlns:xlink="http://www.w3.org/TR/1999/xlink" 
 xlink:type="simple" xlink:href="http://www.thenation.com/">
  The Nation
</magazine>

Note especially that the element name and content are irrelevant to the link, which is encoded purely in attributes. The same link could be written as follows:

<foo xmlns:xlink="http://www.w3.org/TR/1999/xlink" 
 xlink:type="simple" xlink:href="http://www.thenation.com/">
  Foo
</foo>

All of the information required to process the link is included in the attributes. Consequently, we can use the Attributes interface and the startElement() method to design a spider that follows XLinks. Example 6.9 is such a program. Currently this spider does nothing more than follow the links and print their URLs, but it would not be hard to add code to load the discovered documents into a database or perform some other useful operation.

Example 6.9 A ContentHandler Class That Spiders XLinks
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;
import java.util.*;


public class SAXSpider extends DefaultHandler {

  // Need to keep track of where we've been
  // so we don't get stuck in an infinite loop
  private List spideredURIs = new Vector();

  // This linked list keeps track of where we're going.
  // Although the LinkedList class does not guarantee queue like
  // access, I always access it in a first-in/first-out fashion.
  private LinkedList queue = new LinkedList();

  private String    currentURI;
  private XMLReader parser;

  public SAXSpider(XMLReader parser, String uri) {
    this.parser = parser;
    this.currentURI = uri;
  }

  public void endDocument() {

    spideredURIs.add(currentURI);
    System.out.println("Visited " + currentURI);
    String uri;
    try {
      uri = (String) queue.removeLast();
    }
    catch (NoSuchElementException e) {
      // The queue is empty; we're finished.
      return;
    }
    this.currentURI = uri;
    try {
      parser.parse(uri);
    }
    catch (Exception e) {
      // just skip this one and move on to the next
      this.endDocument();
    }

  }

  public void startElement(String namespaceURI, String localName,
   String qualifiedName, Attributes atts) {

    String type
     = atts.getValue("http://www.w3.org/1999/xlink", "type");
    if (type != null) {
      String href
       = atts.getValue("http://www.w3.org/1999/xlink", "href");
      if (href != null) {
        if (!spideredURIs.contains(href)) {
          queue.addFirst(href);
        }
      }
    }

  }


  public static void main(String[] args) {

    if (args.length == 0) {
      System.out.println("Usage: java SAXSpider URL1");
    }
    String uri = args[0];

    try {
      XMLReader parser = XMLReaderFactory.createXMLReader(
        "org.apache.xerces.parsers.SAXParser"
      );

    // Install the ContentHandler
    ContentHandler spider = new SAXSpider(parser, uri);
    parser.setContentHandler(spider);
    parser.parse(uri);

    }
    catch (Exception e) {
      System.err.println(e);
    }

  }// end main

}// end SAXSpider

The startElement() method simply inspects the tag for the two relevant XLink attributes. It looks for them by namespace and local name. If it finds any for which it hasn't yet visited the URL, then it adds that URL to the end of the queue of URLs that need to be visited.

The endDocument() method prints out the URL of the document it has just finished parsing. Then it retrieves the next URL from the top of the queue and parses it. This program is a little unusual in that not only does the XMLReader call back to the ContentHandler, but the ContentHandler also calls back to its XMLReader.

The main() method reads the starting URL from the command line, constructs an XMLReader and a SAXSpider, and parses the initial URL. The program runs automatically from there. There's no limit to the depth or number of documents this spider will search, although currently the paucity of XLinked documents on the Web makes it unlikely that this program will run forever. Furthermore, because it isn't designed to run in parallel, there's little chance of it overwhelming anybody's server. Nonetheless, limiting its search depth would be a good feature to add.

    Previous Section Next Section


         Main Menu
    Main Page
    Table of content
    Copyright
    Praise for Elliotte Rusty Harold's 'Processing XML with Java™'
    List of Examples
    List of Figures
    Preface
    Part I: XML
    Part II: SAX
    Chapter 6. SAX
    What Is SAX?
    Parsing
    Callback Interfaces
    Receiving Documents
    Receiving Elements
    Handling Attributes
    Receiving Characters
    Receiving Processing Instructions
    Receiving Namespace Mappings
    'Ignorable White Space'
    Receiving Skipped Entities
    Receiving Locators
    What the ContentHandler Doesn't Tell You
    Summary
    Chapter 7. The XMLReader Interface
    Chapter 8. SAX Filters
    Part III: DOM
    Part IV: JDOM
    Part V: XPath/XSLT
    Part VI: Appendixes


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele