Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax Free Open Book

Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax

Previous Section Next Section

Input

By far the hardest part of this or any similar problem is parsing the non-XML input data. Everything else pales by comparison. Unlike parsing XML, you generally cannot rely on a library to do the hard work for you. You have to do it yourself. And also unlike XML, there's little guarantee that the data is well-formed. More likely than not, you will encounter incorrectly formatted data.

In this case, because the records are separated into lines, I'll read each line, one at a time, using the readLine() method of java.io.BufferedReader. This method works well enough as long as the data is in a file, although it's potentially buggy when the data is served over a network socket.

Each line is dissected into its component fields inside the splitLine() method. Each record is stored in its own map. The keys for the map are read from a constant array, because the fields are always in the same position in each record.

Caution

For parsing the data out of each line, a lot of Java developers immediately reach for the java.util.StringTokenizer or java.io.StreamTokenizer classes. Don't. These classes are very strangely designed and rarely do what developers expect them to do. For example, if StreamTokenizer encounters a \n inside a string literal, it will convert it to a linefeed. This makes sense when parsing Java source code, but in most other environments \n is just another two characters with no special meaning. Java's tokenizer classes are designed for and suited to parsing Java source code. They are not suitable for reading tab- or comma-delimited data. If you want to design your program around a tokenization function, you should write one yourself that behaves appropriately for your data format.


Example 4.1 shows the input code. To use this, open an input stream to the file containing the budget data and pass that stream as an argument to the parse() method. You'll get back a List containing the parsed data. Each object in this list is a Map containing the data for one line item. Both keys and values for this map are strings. Because the keys are constant, they're stored in a final static array named keys. At various times I plan to use the keys as XML element names, XML attribute names, or SQL field names. Therefore, it's necessary to begin each of them with letters. Thus the keys for the fiscal year fields will be named FY1976, FY1977, FY1978, and so forth instead of just 1976, 1977, 1978, and so forth. This means we won't trivially be able to store the keys as ints. However, this turns out not to have been the case anyway because one of the year fields turns out to be the transitional quarter in 1976—which does not represent a full year and does not have a numeric name.

Caution

In 1976 the government's fiscal year shifted forward one quarter. As a result, the 1977 fiscal year started in October, a quarter after the 1976 fiscal year ended. There was a transitional quarter from July through September that year; therefore, some of the data actually represents less than a whole year. Here, the special case is very much a result of the data itself. Thus the data can't be fixed but still requires extra code, making the examples less clean than they otherwise would be.

This sort of funky data (a year with only three months in it that can easily be confused with another year) is exactly the sort of thing you have to watch out for when processing legacy data. The real world does not always fit into neatly typed categories. There's almost always some outlier data that just doesn't fit the schema. All too often it's been forced into the existing system by some manager or data entry clerk in ways the original designers never intended. This happens all the time. You cannot assume the data actually adheres to its schema, either implicit or explicit.


The code to parse each line of input is hidden inside the private splitLine() method. This code is relatively complex. It iterates through the record looking for comma delimiter characters but has to ignore commas that appear inside quoted strings. Furthermore, it must recognize that the end of the string delimits the last token. Even so, this method is not very robust. It will throw an uncaught exception if any quotes are omitted, or if there are too few fields. It will not notice and report the error if a record contains too many fields.

Example 4.1 A Class That Parses Comma-Separated Values into a List of HashMaps
import java.io.*;
import java.util.*;


public class BudgetData {

  public static List parse(InputStream src) throws IOException {

    // The document as published by the OMB is encoded in Latin-1
    InputStreamReader isr = new InputStreamReader(src, "8859_1");
    BufferedReader in = new BufferedReader(isr);
    List records = new ArrayList();
    String lineItem;
    while ((lineItem = in.readLine()) != null) {
      records.add(splitLine(lineItem));
    }
    return records;

  }

  // the field names in order
  public final static String[] keys = {
    "AgencyCode",
    "AgencyName",
    "BureauCode",
    "BureauName",
    "AccountCode",
    "AccountName",
    "TreasuryAgencyCode",
    "SubfunctionCode",
    "SubfunctionTitle",
    "BEACategory",
    "On-Off-BudgetIndicator",
    "FY1976", "TransitionQuarter", "FY1977", "FY1978", "FY1979",
    "FY1980", "FY1981", "FY1982", "FY1983", "FY1984", "FY1985",
    "FY1986", "FY1987", "FY1988", "FY1989", "FY1990", "FY1991",
    "FY1992", "FY1993", "FY1994", "FY1995", "FY1996", "FY1997",
    "FY1998", "FY1999", "FY2000", "FY2001", "FY2002", "FY2003",
    "FY2004", "FY2005", "FY2006"
   };

  private static Map splitLine(String record) {

    record = record.trim();

    int index = 0;
    Map result = new HashMap();
    for (int i = 0; i < keys.length; i++) {
      //find the next comma
      StringBuffer sb = new StringBuffer();
      char c;
      boolean inString = false;
      while (true) {
        c = record.charAt(index);
        if (!inString && c == '"') inString = true;
        else if (inString && c == '"') inString = false;
        else if (!inString && c == ',') break;
        else sb.append(c);
        index++;
        if (index == record.length()) break;
      }
      String s = sb.toString().trim();
      result.put(keys[i], s);
      index++;
    }

    return result;

  }

}
    Previous Section Next Section


         Main Menu
    Main Page
    Table of content
    Copyright
    Praise for Elliotte Rusty Harold's 'Processing XML with Java™'
    List of Examples
    List of Figures
    Preface
    Part I: XML
    Chapter 1. XML for Data
    Chapter 2. XML Protocols: XML-RPC and SOAP
    Chapter 3. Writing XML with Java
    Chapter 4. Converting Flat Files to XML
    The Budget
    The Model
    Input
    Determining the Output Format
    Building Hierarchical Structures from Flat Data
    Alternatives to Java
    Relational Databases
    Summary
    Chapter 5. Reading XML
    Part II: SAX
    Part III: DOM
    Part IV: JDOM
    Part V: XPath/XSLT
    Part VI: Appendixes


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele