Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax Free Open Book

Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax

Previous Section Next Section

XML as a Message Format

One of the major uses of XML is the exchange of data between heterogeneous systems. Given almost any collection of data, it's straightforward to design some XML markup that fits it. Because XML is natively supported on essentially any platform of interest, you can send data encoded in such an XML application from point A to point B without worrying about whether point A and point B agree on how many bytes there are in a float, whether ints are big-endian or little-endian, whether strings are null delimited or use an initial length byte, or any of the myriad of other issues that arise when moving data between systems. As long as both ends of the connection agree on the XML application used, they can exchange information regardless of what software produced the data. One side can use Perl and the other Java. One can use Windows and the other Unix. One can run on a mainframe and the other on a Mac. The document can be passed over HTTP, e-mail, NFS, BEEP, Jabber, or sneakernet. Everything except the XML document itself can be ignored.

The details of the XML markup used depend heavily on the information being exchanged. If you're exchanging financial data, you might use the Open Financial Exchange (OFX) [http://www.ofx.net/ofx/] If you're exchanging genetic codes, you might use the Gene Expression Markup Language (GEML) [http://www.rosettabio.com/products/conductor/geml/] If you're exchanging news articles in a syndication service, you might use NewsML [http://www.xmlnews.org/NewsML/]. And if no standard XML application exists that fits your needs, you'll probably invent your own. But whatever XML application you choose, certain features will crop up again and again that can benefit from standardization. These include the envelope used to pass the data and the representations of basic data types, such as integer and date.

Envelopes

An envelope may not be needed if (a) only two systems are involved, (b) they talk only to each other, and (c) they always send the same type of message. It's enough for one system to send the other the message in the agreed-upon XML format. However, when there are many dozens, hundreds, or even thousands of different systems exchanging many different kinds of messages in many different ways, it's useful to have some standards that are independent of the message content. This offers up some hope that when a message in an unrecognized format is received, it can still be processed in a reasonable fashion. For example, a system might receive a message ordering 1,000 "Frodo Lives" buttons but not know how to handle that order. However, it may be able to read enough information from the envelope to route the request to the program that does know how to process the order.

In XML-RPC, the envelope is essentially all the markup, and the data inside the envelope is all the text content. SOAP and RSS are a little more complex. For SOAP, the envelope is an XML document, and the data is too. In some ways RSS, especially RSS 1.0, is the most complex of all because it's based on the relatively complex RDF syntax. RDF mixes the envelope and the data together so that you can't point to any one element in the document and say, "That's the envelope," or "That element is the data." Instead, pieces of both the envelope and the data are intermingled throughout the complete document. In all three cases, however, it's straightforward to extract the data from the envelope for further processing.

Data Representation

Another area ripe for standardization is the proper representation of low-level data such as dates and numbers. Nobody really cares how many bytes there are in an int, as long as there are enough to hold all of the values they want to hold. Nobody really cares whether dates are written Day-Month-Year or Month-Day-Year, as long as it's easy to tell which is which. It doesn't really matter how this information is passed, as long as there's one standard way of doing it that everyone can agree on and process without excessive hassle.

In XML all data of any type must be passed as text, but the proper textual representation of simple data types such as integer and date is trickier than most developers initially assume. For example, integers can be uncomplicatedly represented in the form 42, -76, +34562, 0, and so forth. The normal base-10 representation with optional plus or minus signs is fully adequate for most needs. However, consider the number 28562476535, the dollar value of Bill Gates' Microsoft stock holdings alone as of July 24, 2002. This is a perfectly good integer, albeit a large one. However, it's so large that trying to use it in many applications will lead to a crash or some other form of error.

Floating-point numbers are even worse. Two different computers can look at an unambiguous string such as 65431987467.324345192 and interpret it as two different numbers. Dates cause problems even for humans. Is 07/04/01 the Fourth of July, 2001? the Fourth of July, 1901? the seventh of April, 2001? Some other date? These are all very real issues that cause real problems in systems today.

XML itself doesn't standardize the text representation of data, but the W3C XML Schema Language does. In particular, schemas define the 44 simple data types shown in Table 2.1. By assigning these data types to particular elements, you can clearly state what a particular string means in a syntax everyone can understand. And if these data types aren't enough, the W3C XML Schema Language also lets you define new types that are combinations or restrictions of these basic types.

Table 2.1. Primitive Data Types Defined in the W3C XML Schema Language
Data Type Meaning
xsd:string The schema equivalent of #PCDATA, any string of Unicode characters that may appear in an XML document.
xsd:boolean True, false; 1, 0.
xsd:decimal A decimal number, such as 44.145629 or -0.32, with an arbitrary size and precision; similar to the java.math.BigDecimal class.
xsd:float The four-byte IEEE-754 floating point number that best approximates the specified decimal string; equivalent to Java's float type.
xsd:double The eight-byte IEEE-754 floating point number that best approximates the specified decimal string; equivalent to Java's double type.
xsd:integer An integer of arbitrary size; similar to the java.math.BigInteger class.
xsd:positiveInteger An integer strictly greater than zero.
xsd:nonPositiveInteger An integer less than or equal to zero.
xsd:negativeInteger An integer strictly less than zero.
xsd:nonNegativeInteger An integer greater than or equal to zero.
xsd:long An integer between -9223372036854775808 and +9223372036854775807 inclusive; equivalent to Java's long primitive data type.
xsd:int An integer between -2147483648 and 2147483647 inclusive; equivalent to Java's int primitive data type.
xsd:short An integer between -32768 and 32767 inclusive; equivalent to Java's short primitive data type.
xsd:byte An integer between -128 and 127 inclusive; equivalent to Java's byte primitive data type.
xsd:unsignedLong An integer between 0 and 18446744073709551615.
xsd:unsignedInt An integer between 0 and 4294967295.
xsd:unsignedShort An integer between 0 and 65535.
xsd:unsignedByte An integer between 0 and 255.
xsd:duration A length of time given in the ISO 8601 extended format: PnYn Mn DTn Hn Mn S. The number of seconds can be a decimal or an integer. All other values must be nonnegative integers. For example, P1Y2M3DT4H5M6.7S represents 1 year, 2 months, 3 days, 4 hours, 5 minutes, and 6.7 seconds.
xsd:dateTime A particular moment of time on a particular day up to an arbitrary fraction of a second in the ISO 8601 format: CCYY-MM-DD Thh:mm:ss. This can have a Z suffix to indicate Coordinated Universal Time (UTC) or an offset from UTC. For example, Neil Armstrong set foot on the moon at 1969-07-20T21:28:00-06:00 by the clock in Houston mission control, alternately represented as 1969-07-21T02:28:00Z.
xsd:time A certain time of day on no particular day in the ISO 8601 format: hh:mm:ss.sss. A time zone specified as an offset from UTC is optional. For example, on most days I wake up about 07:00:00.000-05:00 and go to bed about 23:30:00.000-05:00.
xsd:date A particular date in history given in ISO 8601 format: YYYYMMDD, for example, 20010706 or 19690920.
xsd:gYearMonth A certain month in a certain year, for example, 2001-12 or 1999-03.
xsd:gYear A year in the Gregorian calendar ranging from 0001 to 2001, to 9999, 10000, 10001, and beyond. Earlier dates can be represented as -0001, -0002, -0003, and so forth back to the big bang. There is no year zero, however.
xsd:gMonthDay A specific day of a specific month in no particular year, in the form --02-28. For example, Christmas falls on --12-25.
xsd:gDay A particular day of no particular month, in the form ---01, ---02, ---03, through ---31.
xsd:gMonth A particular month in no particular year, in the form --01--, --02--, --03--, through --12--.
xsd:hexBinary Hexadecimal encoded binary data; each byte of the data is replaced by the two hexadecimal digits that represent its unsigned value.
xsd:base64Binary Base-64 encoded binary data.
xsd:anyURI An absolute or relative URL or a URN.
xsd:QName An optionally prefixed XML name such as SOAP-ENV:Body or Body. Unprefixed names must be in the default namespace.
xsd:NOTATION The name of a notation declared in the current schema.
xsd:normalizedString A string in which carriage returns (\r), linefeeds (\n) and tab (\t) characters should be treated the same as spaces.
xsd:token A string in which all runs of white space should be treated the same as a single space.
xsd:language An RFC 1766 [http://www.ietf.org/rfc/rfc1766.txt] language identifier such as en, fr-CA, or i-klingon.
xsd:NMTOKEN An XML name token.
xsd:NMTOKENS A white-space-separated list of XML name tokens.
xsd:Name An XML name.
xsd:NCName An XML name that does not contain any colons; that is, an unprefixed name.
xsd:ID An NCName that is unique among other things of ID type in the same document.
xsd:IDREF An NCName used as an ID somewhere in the document.
xsd:IDREFS A white-space-separated list of IDREFs.
xsd:ENTITY An NCName that has been declared as an unparsed entity in the document's DTD.
xsd:ENTITIES A white-space-separated list of ENTITY names.

Even without using schema validation or the full schema apparatus, you can use these data types in your own documents. Simply attach an xsi:type attribute to any element identifying the type of that element's content. The xsi prefix is mapped to the http://www.w3.org/2001/XMLSchema-instance namespace URI. Example 2.1 is an XML document that uses these data types to label different parts of an order document. Notice that some things that naively might be assumed to be numeric types are in fact strings.

Example 2.1 An XML Document That Labels Elements with Schema Simple Types
<?xml version="1.0" encoding="ISO-8859-1"?>
<Order xmlns:xsd="http://www.w3.org/2001/XMLSchema"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <Customer id="c32" xsi:type="xsd:string">Chez Fred</Customer>
  <Product>
    <Name xsi:type="xsd:string">Birdsong Clock</Name>
    <SKU xsi:type="xsd:string">244</SKU>
    <Quantity xsi:type="xsd:positiveInteger">12</Quantity>
    <Price currency="USD"  xsi:type="xsd:decimal">21.95</Price>
    <ShipTo>
      <Street xsi:type="xsd:string">135 Airline Highway</Street>
      <City xsi:type="xsd:string">Narragansett</City>
      <State xsi:type="xsd:NMTOKEN">RI</State>
      <Zip xsi:type="xsd:string">02882</Zip>
    </ShipTo>
  </Product>
  <Product>
    <Name xsi:type="xsd:string">Brass Ship's Bell</Name>
    <SKU xsi:type="xsd:string">258</SKU>
    <Quantity xsi:type="xsd:positiveInteger">1</Quantity>
    <Price currency="USD" xsi:type="xsd:decimal">144.95</Price>
    <Discount xsi:type="xsd:decimal">.10</Discount>
    <ShipTo>
      <GiftRecipient xsi:type="xsd:string">
        Samuel Johnson
      </GiftRecipient>
     <Street xsi:type="xsd:string">271 Old Homestead Way</Street>
      <City xsi:type="xsd:string">Woonsocket</City>
      <State xsi:type="xsd:NMTOKEN">RI</State>
      <Zip xsi:type="xsd:string">02895</Zip>
    </ShipTo>
    <GiftMessage xsi:type="xsd:string">
      Happy Father's Day to a great Dad!

      Love,
      Sam and Beatrice
    </GiftMessage>
  </Product>
  <Subtotal currency='USD' xsi:type="xsd:decimal">
    393.85
  </Subtotal>
  <Tax rate="7.0"
       currency='USD' xsi:type="xsd:decimal">28.20</Tax>
  <Shipping method="USPS" currency='USD'
            xsi:type="xsd:decimal">8.95</Shipping>
  <Total currency='USD' xsi:type="xsd:decimal">431.00</Total>
</Order>

As well as using a schema for explicit labeling, a document can use a schema to indicate the type. However, right now the APIs for such things aren't finished, so it's best to explicitly label elements when the types are important.

XML-RPC uses only the int, boolean, decimal, dateTime, and base64 types as well as a string type that's restricted to ASCII. Furthermore, it does not allow the NaN, Inf, and -Inf values for double. It does not use xsi:type attributes, relying instead on predefined semantics for particular elements. SOAP allows all 44 types and does use xsi:type attributes to label elements.

    Previous Section Next Section


         Main Menu
    Main Page
    Table of content
    Copyright
    Praise for Elliotte Rusty Harold's 'Processing XML with Java™'
    List of Examples
    List of Figures
    Preface
    Part I: XML
    Chapter 1. XML for Data
    Chapter 2. XML Protocols: XML-RPC and SOAP
    XML as a Message Format
    HTTP as a Transport Protocol
    RSS
    Customizing the Request
    XML-RPC
    SOAP
    Custom Protocols
    Summary
    Chapter 3. Writing XML with Java
    Chapter 4. Converting Flat Files to XML
    Chapter 5. Reading XML
    Part II: SAX
    Part III: DOM
    Part IV: JDOM
    Part V: XPath/XSLT
    Part VI: Appendixes


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele