MySQL Cookbook Free Open Book

MySQL Cookbook

Previous Section Next Section

10.38 A LOAD DATA Diagnostic Utility

10.38.1 Problem

LOAD DATA or mysqlimport indicates a nonzero warning count when you load a datafile into MySQL, but you have no idea which rows or columns were problematic.

10.38.2 Solution

Run the file through a utility that diagnoses which data values caused the warnings.

10.38.3 Discussion

As a bulk loader, LOAD DATA is very efficient; it can run many times faster than a set of INSERT statements that adds the same rows. However, LOAD DATA also is not very informative. It returns only a message that indicates the number of records processed, and a few other status counts. For example, in the previous section, we generated a datafile managers.txt to use with guess_table.pl for guessing the structure of the baseball1.com managers table. If you create that table using the resulting CREATE TABLE statement and then load the datafile into it, you will observe the following result:

mysql> LOAD DATA LOCAL INFILE 'managers.txt' INTO TABLE managers
    -> IGNORE 1 LINES;
Query OK, 2841 rows affected (0.06 sec)
Records: 2841  Deleted: 0  Skipped: 0  Warnings: 5082

Evidently, there were a quite a few problems with the file. Unfortunately, the message produced by LOAD DATA doesn't tell you anything about which rows and columns caused them. The mysqlimport program is similarly terse, because its message is the same as the one returned by LOAD DATA.

We'll revisit this example at the end of the section, but first consider LOAD DATA's output style. On the one hand, the minimal-report approach is the right one to take. If warning information were to be returned to the client, it potentially could include a diagnostic message for each input row, or even for each column! This might be overwhelming and certainly would entirely defeat the high-efficiency nature of LOAD DATA. On the other hand, more information about the source of errors could be useful for fixing the file to eliminate the warnings.

It's on the MySQL development "to do" list to allow LOAD DATA errors to be logged to another table so that you can get extended diagnostic information. In the meantime, you can use the load_diag.pl utility included in the transfer directory of the recipes distribution. load_diag.pl is useful for "pre-flighting" a datafile to get an idea of how well the file will load into the table you intend it for, and to pinpoint problems so that you can clean up the file before loading it into MySQL "for real."

load_diag.pl also can help you identify patterns of problems for situations in which it may be beneficial to write a preprocessing filter. Suppose you periodically receive files containing data to be loaded into a given MySQL table. The more frequently this occurs, the more highly motivated you'll be to automate as much of the data transfer process as possible. This may involve writing a filter to convert data values from the format in which you receive them to a format more appropriate for MySQL. Running the datafiles through load_diag.pl can help you assess which columns tend to be problematic and thereby assist you in determining where to concentrate your efforts in creating a transformation program for rewriting the files so they will load cleanly into MySQL.

To run load_diag.pl, specify the name of the database and table you intend to load the datafile into, as well as the name of the file itself:

% load_diag.pl  db_name  tbl_name  file_name 

load_diag.pl won't actually load anything into the table named on the command line, but it needs to know what the table is so that it can create a temporary table that has the same column structure to use for testing.

Initially, load_diag.pl loads the entire datafile into the temporary table to see if there are any warnings. If not, there's nothing else to do, so load_diag.pl drops the temporary table and exits. Otherwise, it loads each line of the datafile into the table individually to determine which lines caused problems, using the following procedure:

  • It writes the line to a temporary file and issues a LOAD DATA statement to load the file into the table. If the warning count is zero, the line is assumed to be okay.

  • If the warning count for the line is nonzero, load_diag.pl examines each of its columns in turn by using a series of single-column LOAD DATA statements to find out which ones generate warnings.

  • If a column-specific warning occurs and the data value is empty, load_diag.pl determines whether the warning goes away by loading a NULL value instead. It does this because if a datafile contains empty values, you can often get better results by loading NULL than by loading empty strings. (For example, if you load an empty string into an INT column, MySQL converts the value to 0 and issues a warning.) If a datafile turns out to have a significant reduction in warnings when loading NULL rather than empty strings, you may find it useful to run the file through to_null.pl before loading it.

  • It's also possible for warnings to occur if a line contains fewer or more columns than the number of columns in the table, so load_diag.pl checks that, too.

load_diag.pl prints diagnostic information about its findings while testing each input line, then prints a summary report after the entire file has been processed. The report indicates the number of lines in the file, how many warnings the initial full-file load caused, and the number of lines that had too few or too many columns. The report also includes a list that shows for each column how many values were missing, the number of warnings that occurred, how many of those warnings occurred for empty values, and the number of empty-value warnings that went away by loading NULL instead.

As you might guess, all this activity means that load_diag.pl isn't nearly as efficient as LOAD DATA. In fact, it has the potential to exercise your server rather heavily! But its goal is to provide maximal information, not minimal execution time. (Note too that if your MySQL server has logging enabled, using load_diag.pl with large datafiles can cause the logs to grow quickly.)

To see how load_diag.pl works, assume you have a simple table named diag_test that contains string, date, and number columns:

CREATE TABLE diag_test
(
    str     CHAR(10),
    date    DATE,
    num     INT
);

Assume you also have a datafile named diag_sample.dat that you plan to load into the table:

str1    01-20-2001      97
str2    02-28-2002
        03-01-2002      64      extra junk

To see if the file will have any problems loading, check it like this:

% load_diag.pl cookbook diag_test diag_sample.dat
line 1: 1 warning
  column 2 (date): bad value = (01-20-2001)
line 2: 2 warnings
  too few columns
  column 2 (date): bad value = (02-28-2002)
  column 3 (num): missing from input line
  column 3 (num): bad value = ( ) (inserting NULL worked better)
line 3: 1 warning
  excess number of columns

Number of lines in file: 3
Warnings found when loading entire file: 4
Lines containing too few column values: 1
Lines containing excess column values: 1

Warnings per column:

Column    Times      Total     Warnings for     Improved
         missing    warnings   empty columns    with NULL
str            0           0               0            0
date           0           2               0            0
num            1           1               1            1

It appears that the dates don't load very well. That's not surprising, because they appear to be in U.S. format and should be rewritten in ISO format. Converting empty fields to \N may also be beneficial, and you can get rid of the extra column value in line 3. Using some of the utilities developed earlier in this chapter, perform all those transformations, writing the result to a temporary file:

% yank_col.pl --columns=1-3 diag_sample.dat \
    | cvt_date.pl --iformat=us --oformat=iso \
    | to_null.pl > tmp

The tmp file produced by that command looks like this:

str1    2001-01-20      97
str2    2002-02-28      \N
\N      2002-03-01      64

Using load_diag.pl to check the new file produces the following result:

% load_diag.pl cookbook diag_test tmp
File loaded with no warnings, no per-record tests performed

This indicates that if you load tmp into the diag_test table, you should get good results, and indeed that is true:

mysql> LOAD DATA LOCAL INFILE 'tmp' INTO TABLE diag_test;
Records: 3  Deleted: 0  Skipped: 0  Warnings: 0

Clearly, that's a lot of messing around just to make a three-line file load into MySQL better. But the point of the example is to illustrate that the feedback load_diag.pl provides can help you figure out what's wrong with a datafile so that you can clean it up.

In addition to the required arguments that name the database, table, and datafile, load_diag.pl understands several options:

--columns= name1,name2,name3,...

By default, load_diag.pl assumes the datafile contains columns that correspond in number and order to the columns in the table. If that is not true, use this option to specify the names of the columns that are present in the file, and in what order.

--labels

This option indicates that the datafile contains an initial row of labels that should be skipped. (Loading labels into a table typically results in spurious warnings.)

--skip-full-load

Skip the initial test that loads the entire datafile.

--tmp-table= tbl_name

Specify the name to use for the temporary table. The default is _load_diag_n, where n is load_diag.pl's process ID.

If necessary, you can also specify standard connection parameter options like --user or --host. Any options must precede the database name argument.

Use of load_diag.pl is subject to the following constraints and limitations:

  • The input must be in tab-delimited, linefeed-terminated format.

  • Record loading is performed with the LOCAL option of the LOAD DATA statement. LOCAL capability requires MySQL 3.22.15 or higher (and, as of 3.23.49, requires that your MySQL distribution not have been built with that capability disabled).

  • When load_diag.pl creates the temporary table, it omits any indexes that are present in the original table. This results in faster record loading time (particularly for the initial test that loads the entire datafile). On the other hand, not using indexes means that load_diag.pl won't find warnings that result from duplicate key values on unique indexes.

Returning to the example with which this section began, what about all those warnings that resulted from loading the managers.txt file into the managers table? load_diag.pl identifies them all as being due to missing or empty columns at the end of some of the lines:

% load_diag.pl --labels cookbook managers managers.txt
line 2: 2 warnings
  column 14 (postwins): bad value = ( ) (inserting NULL worked better)
  column 15 (postlosses): bad value = ( ) (inserting NULL worked better)
line 3: 2 warnings
  column 14 (postwins): bad value = ( ) (inserting NULL worked better)
  column 15 (postlosses): bad value = ( ) (inserting NULL worked better)
...
line 2839: 2 warnings
  column 14 (postwins): bad value = ( ) (inserting NULL worked better)
  column 15 (postlosses): bad value = ( ) (inserting NULL worked better)
line 2842: 2 warnings
  column 14 (postwins): bad value = ( ) (inserting NULL worked better)
  column 15 (postlosses): bad value = ( ) (inserting NULL worked better)

Number of lines in file: 2842
Warnings found when loading entire file: 5082
Lines containing too few column values: 416
Lines containing excess column values: 0

Warnings per column:

Column        Times      Total     Warnings for     Improved
             missing    warnings   empty columns    with NULL
lahmanid           0           0               0            0
year               0           0               0            0
team               0           0               0            0
lg                 0           0               0            0
div                0           0               0            0
g                  0           0               0            0
w                  0           0               0            0
l                  0           0               0            0
pct                0           0               0            0
std                0           0               0            0
half               0           0               0            0
mgrorder           0           0               0            0
plyrmgr           16           0               0            0
postwins         416        2533            2533         2533
postlosses       416        2533            2533         2533

From this result, we can determine that 416 lines were missing the postwins and postlosses columns (and 16 of those were missing the plyrmgr column as well). The remaining errors were due to lines for which the postwins and postlosses columns were present but empty. The entire-file warning count of 5082 can be accounted for as the the number of plyrmgr values that were missing, plus the total warnings from the postwins and postlosses columns (16+2533+2533 = 5082).

The Total warnings value for the plyrmgr column is zero because it's a CHAR column, and thus loading empty values into it is legal. The Total warnings value for postwins and postlosses is nonzero because they are INT columns and loading empty values into them result in a conversion-to-zero operations. All of these problems are of the sort that can be made to go away by converting empty or missing values to \N. Run the file through yank_col.pl to force each line to have 15 columns, and run the result through to_null.pl to convert empty values to \N:

% yank_col.pl --columns=1-15 managers.txt | to_null.pl > tmp

Then see what load_diag.pl has to say about the resulting file:

% load_diag.pl --labels cookbook managers tmp
File loaded with no warnings, no per-record tests performed

If you load tmp into the managers table, no problems should occur:

mysql> LOAD DATA LOCAL INFILE 'tmp' INTO TABLE managers IGNORE 1 LINES;
Query OK, 2841 rows affected (0.13 sec)
Records: 2841  Deleted: 0  Skipped: 0  Warnings: 0
    Previous Section Next Section
    Index: [SYMBOL][A][B][C][D][E][F][G][H][I][J][K][L][M][N][O][P][Q][R][S][T][U][V][W][X][Y][Z]


         Main Menu
    Main Page
    Table of content
    Copyright
    Preface
    Chapter 1. Using the mysql Client Program
    Chapter 2. Writing MySQL-Based Programs
    Chapter 3. Record Selection Techniques
    Chapter 4. Working with Strings
    Chapter 5. Working with Dates and Times
    Chapter 6. Sorting Query Results
    Chapter 7. Generating Summaries
    Chapter 8. Modifying Tables with ALTER TABLE
    Chapter 9. Obtaining and Using Metadata
    Chapter 10. Importing and Exporting Data
    10.1 Introduction
    10.2 Importing Data with LOAD DATA and mysqlimport
    10.3 Specifying the Datafile Location
    10.4 Specifying the Datafile Format
    10.5 Dealing with Quotes and Special Characters
    10.6 Importing CSV Files
    10.7 Reading Files from Different Operating Systems
    10.8 Handling Duplicate Index Values
    10.9 Getting LOAD DATA to Cough Up More Information
    10.10 Don't Assume LOAD DATA Knows More than It Does
    10.11 Skipping Datafile Lines
    10.12 Specifying Input Column Order
    10.13 Skipping Datafile Columns
    10.14 Exporting Query Results from MySQL
    10.15 Exporting Tables as Raw Data
    10.16 Exporting Table Contents or Definitions in SQL Format
    10.17 Copying Tables or Databases to Another Server
    10.18 Writing Your Own Export Programs
    10.19 Converting Datafiles from One Format to Another
    10.20 Extracting and Rearranging Datafile Columns
    10.21 Validating and Transforming Data
    10.22 Validation by Direct Comparison
    10.23 Validation by Pattern Matching
    10.24 Using Patterns to Match Broad Content Types
    10.25 Using Patterns to Match Numeric Values
    10.26 Using Patterns to Match Dates or Times
    10.27 Using Patterns to Match Email Addresses and URLs
    10.28 Validation Using Table Metadata
    10.29 Validation Using a Lookup Table
    10.30 Converting Two-Digit Year Values to Four-Digit Form
    10.31 Performing Validity Checking on Date or Time Subparts
    10.32 Writing Date-Processing Utilities
    10.33 Using Dates with Missing Components
    10.34 Performing Date Conversion Using SQL
    10.35 Using Temporary Tables for Data Transformation
    10.36 Dealing with NULL Values
    10.37 Guessing Table Structure from a Datafile
    10.38 A LOAD DATA Diagnostic Utility
    10.39 Exchanging Data Between MySQL and Microsoft Access
    10.40 Exchanging Data Between MySQL and Microsoft Excel
    10.41 Exchanging Data Between MySQL and FileMaker Pro
    10.42 Exporting Query Results as XML
    10.43 Importing XML into MySQL
    10.44 Epilog
    Chapter 11. Generating and Using Sequences
    Chapter 12. Using Multiple Tables
    Chapter 13. Statistical Techniques
    Chapter 14. Handling Duplicates
    Chapter 15. Performing Transactions
    Chapter 16. Introduction to MySQL on the Web
    Chapter 17. Incorporating Query Resultsinto Web Pages
    Chapter 18. Processing Web Input with MySQL
    Chapter 19. Using MySQL-Based Web Session Management
    Appendix A. Obtaining MySQL Software
    Appendix B. JSP and Tomcat Primer
    Appendix C. References
    Colophone
    Index


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele