MySQL Cookbook Free Open Book

MySQL Cookbook

Previous Section Next Section

10.37 Guessing Table Structure from a Datafile

10.37.1 Problem

Someone gives you a datafile and says, "Here, put this into MySQL for me." But no table yet exists to hold the data.

10.37.2 Solution

Write the CREATE TABLE statement yourself. Or use a utility that guesses the table structure by examining the contents of the datafile.

10.37.3 Discussion

Sometimes you need to import data into MySQL for which no table has yet been set up. You can create the table yourself, based on any knowledge you might have about the contents of the file. Or you may be able to avoid some of the work by using guess_table.pl, a utility located in the transfer directory of the recipes distribution. guess_table.pl reads the datafile to see what kind of information it contains, then attempts to produce an appropriate CREATE TABLE statement that matches the contents of the file. This script is necessarily imperfect, because column contents sometimes are ambiguous. (For example, a column containing a small number of distinct strings might be a CHAR column or an ENUM.) Still, it's often easier to tweak the statement that guess_table.pl produces than to write the entire statement from scratch. This utility also has a diagnostic function, though that's not its primary purpose. For example, you might believe a column contains only numbers, but if guess_table.pl indicates that it should be created using a CHAR type, that tells you the column contains at least one non-numeric value.

guess_table.pl assumes that its input is in tab-delimited, linefeed-terminated format. It also assumes valid input, because any attempt to guess column types based on possibly flawed data is doomed to failure. This means, for example, that if a date column is to be recognized as such, it should be in ISO format. Otherwise, guess_table.pl may characterize it as a CHAR column. If a datafile doesn't satisfy these assumptions, you may be able to reformat it first using the cvt_file.pl and cvt_date.pl utilities described in Recipe 10.19 and Recipe 10.32.

guess_table.pl understands the following options:

--labels

Interpret the first input line as a row of column labels and use them for table column names. If this option is omitted, guess_table.pl uses default column names of c1, c2, and so forth. Note that if the file contains a row of labels and you neglect to specify this option, the labels will be treated as data values by guess_table.pl. The likely result is that the script will mischaracterize any numeric column as a CHAR column, due to the presence of a non-numeric value in the column.

--lower, --upper

Force column names in the CREATE TABLE statement to be lowercase or uppercase.

--quote-names

Quote table and column names in the CREATE TABLE statement with ` characters (for example, `mytbl`). This can be useful if a name is a reserved word. The resulting statement requires MySQL 3.23.6, because quoted names are not understood by earlier versions.

--report

Generate a report rather than a CREATE TABLE statement. The script displays the information that it gathered about each column.

--tbl_name= tbl_name

Specify the table name to use in the CREATE TABLE statement. The default name is t.

Here's an example of how guess_table.pl works, using the managers.csv file from the CSV version of the baseball1.com baseball database distribution. This file contains records for team managers. It begins with a row of column labels, followed by rows containing data values:

LahmanID,Year,Team,Lg,DIV,G,W,L,Pct,Std,Half,Order,PlyrMgr,PostWins,PostLosses
cravebi01,1871,TRO,NA,,25,12,12,0.5,6,0,2,,,
deaneha01,1871,KEK,NA,,5,2,3,0.4,8,0,2,,,
hastisc01,1871,ROK,NA,,25,4,21,0.16,9,0,0,,,
paborch01,1871,CLE,NA,,29,10,19,0.345,7,0,0,,,
wrighha01,1871,BOS,NA,,31,20,10,0.667,3,0,0,,,
youngni99,1871,OLY,NA,,32,15,15,0.5,5,0,0,,,
clappjo01,1872,MAN,NA,,24,5,19,0.208,8,0,0,,,
clintji01,1872,ECK,NA,,11,0,11,0,9,0,1,,,
fergubo01,1872,BRA,NA,,37,9,28,0.243,6,0,0,,,
...

The first row indicates the column labels, and the following rows contain data records, one per line. guess_table.pl requires input in tab-delimited, linefeed-terminated format, so to work with the managers.csv file, first convert it using cvt_file.pl, writing the result to a temporary file, managers.txt:

% cvt_file.pl --iformat=csv --ieol="\r\n" managers.csv > managers.txt

Then run the temporary file through guess_table.pl (the command shown here uses --lower because I prefer lowercase column names):

% guess_table.pl --table=managers --labels --lower managers.txt > managers.sql

The CREATE TABLE statement that guess_table.pl writes to managers.sql looks like this:

CREATE TABLE managers
(
    lahmanid CHAR(9) NOT NULL,
    year INT UNSIGNED NOT NULL,
    team CHAR(3) NOT NULL,
    lg CHAR(2) NOT NULL,
    div CHAR(1) NULL,
    g INT UNSIGNED NOT NULL,
    w INT UNSIGNED NOT NULL,
    l INT UNSIGNED NOT NULL,
    pct FLOAT NOT NULL,
    std INT UNSIGNED NOT NULL,
    half INT UNSIGNED NOT NULL,
    order INT UNSIGNED NOT NULL,
    plyrmgr CHAR(1) NULL,
    postwins INT UNSIGNED NULL,
    postlosses INT UNSIGNED NULL
);

guess_table.pl produces that statement based on deductions such as the following:

  • If a column contains only integer values, it's assumed to be an INT. If none of the values are negative, the column is likely to be UNSIGNED as well.

  • If a column contains no empty values, guess_table.pl assumes that it's probably NOT NULL.

  • Columns that cannot be classified as numbers or dates are taken to be CHAR columns, with a length equal to the longest value present in the column.

You might want to edit the CREATE TABLE statement that guess_table.pl produces, to make modifications such as increasing the size of character fields, changing CHAR to VARCHAR, or adding indexes. Another reason to edit the statement is that if a column has a name that is a reserved word in MySQL, you can change it to a different name. For example, the managers table definition created by guess_table.pl contains a column named order, which is a reserved keyword. The column represents the order of the manager during the season (in case a team had more than one manager), so a reasonable alternative name is mgrorder. After editing the statement in the managers.sql file to make that change, execute it to create the table:

% mysql cookbook < managers.sql

Then you can load the datafile into the table (skipping the initial row of labels):

mysql> LOAD DATA LOCAL INFILE 'managers.txt' INTO TABLE managers
    -> IGNORE 1 LINES;

(When you do this, you'll notice that LOAD DATA reports some warnings. These are investigated further in Recipe 10.38.)

The baseball1.com database also is available in Access format. The Access database contains explicit information about the structure of the managers table, and this information is available to utilities like DBTools and MySQLFront that can use it to create the MySQL table for you. (See Recipe 10.39 for information about these programs.) This affords us the opportunity to see how well guess_table.pl guesses the table structure using only the datafile, compared to programs that have more information available to them.

One problem with utilities like DBTools or MySQLFront is that if an Access table column has a name that is a reserved word, you cannot import it into MySQL without changing the Access table to use a different column name. This is the case for the Order column in the managers table. With guess_table.pl, that wasn't a problem, because you can just edit the CREATE TABLE statement that it produces to change the name to something legal.[7] However, to deal with the Order column in the managers table for purposes of DBTools or MySQLFront, you should change the Access database itself to rename the column (for example, to MgrOrder.)

[7] Another approach is to use the --quote-names option when you run guess_table.pl. That allows you to create the table without changing the column name, although then you must put the name within backticks whether you refer to it.

The managers table structure produced by DBTools looks like this:

CREATE TABLE managers (
  LahmanID char(9) NOT NULL default '',
  Year int(11) default NULL,
  Team char(3) default NULL,
  Lg char(2) default NULL,
  Div char(2) default NULL,
  G int(11) default NULL,
  W int(11) default NULL,
  L int(11) default NULL,
  Pct double default NULL,
  Std int(11) default NULL,
  Half int(11) default NULL,
  MgrOrder int(11) default NULL,
  PlyrMgr char(1) default NULL,
  PostWins int(11) default NULL,
  PostLosses int(11) default NULL,
  KEY LahmanID (LahmanID)
);

MySQLFront creates the table like this:

CREATE TABLE managers (
  LahmanID longtext,
  Year int(11) default NULL,
  Team longtext,
  Lg longtext,
  Div longtext,
  G int(11) default NULL,
  W int(11) default NULL,
  L int(11) default NULL,
  Pct float default NULL,
  Std int(11) default NULL,
  Half int(11) default NULL,
  MgrOrder int(11) default NULL,
  PlyrMgr longtext,
  PostWins int(11) default NULL,
  PostLosses int(11) default NULL
);

Of the three programs, DBTools does the best job of determining the structure of the MySQL table. It uses the index information present in the Access file to write the KEY definition, and to create string columns with the proper lengths. MySQLFront doesn't produce the key definition and it defines strings as LONGTEXT columns—even the PlyrMgr column, which never contains a value longer than one character. The quality of the output produced by guess_table.pl appears to be somewhere in between. It doesn't write the key definition, but neither does it write every string column as the longest possible type. (On the other hand, the column lengths are somewhat conservative.) All in all, that's not bad, considering that guess_table.pl doesn't have available to it all the information contained in the original Access file. And you can use it on a cross-platform basis.

These results indicate that if you're using Windows and your records are stored in an Access file, you're probably best off letting DBTools create your MySQL tables for you. In other situations (such as when you're running under Unix or your datafile comes from a source other than Access), guess_table.pl can be beneficial.

    Previous Section Next Section
    Index: [SYMBOL][A][B][C][D][E][F][G][H][I][J][K][L][M][N][O][P][Q][R][S][T][U][V][W][X][Y][Z]


         Main Menu
    Main Page
    Table of content
    Copyright
    Preface
    Chapter 1. Using the mysql Client Program
    Chapter 2. Writing MySQL-Based Programs
    Chapter 3. Record Selection Techniques
    Chapter 4. Working with Strings
    Chapter 5. Working with Dates and Times
    Chapter 6. Sorting Query Results
    Chapter 7. Generating Summaries
    Chapter 8. Modifying Tables with ALTER TABLE
    Chapter 9. Obtaining and Using Metadata
    Chapter 10. Importing and Exporting Data
    10.1 Introduction
    10.2 Importing Data with LOAD DATA and mysqlimport
    10.3 Specifying the Datafile Location
    10.4 Specifying the Datafile Format
    10.5 Dealing with Quotes and Special Characters
    10.6 Importing CSV Files
    10.7 Reading Files from Different Operating Systems
    10.8 Handling Duplicate Index Values
    10.9 Getting LOAD DATA to Cough Up More Information
    10.10 Don't Assume LOAD DATA Knows More than It Does
    10.11 Skipping Datafile Lines
    10.12 Specifying Input Column Order
    10.13 Skipping Datafile Columns
    10.14 Exporting Query Results from MySQL
    10.15 Exporting Tables as Raw Data
    10.16 Exporting Table Contents or Definitions in SQL Format
    10.17 Copying Tables or Databases to Another Server
    10.18 Writing Your Own Export Programs
    10.19 Converting Datafiles from One Format to Another
    10.20 Extracting and Rearranging Datafile Columns
    10.21 Validating and Transforming Data
    10.22 Validation by Direct Comparison
    10.23 Validation by Pattern Matching
    10.24 Using Patterns to Match Broad Content Types
    10.25 Using Patterns to Match Numeric Values
    10.26 Using Patterns to Match Dates or Times
    10.27 Using Patterns to Match Email Addresses and URLs
    10.28 Validation Using Table Metadata
    10.29 Validation Using a Lookup Table
    10.30 Converting Two-Digit Year Values to Four-Digit Form
    10.31 Performing Validity Checking on Date or Time Subparts
    10.32 Writing Date-Processing Utilities
    10.33 Using Dates with Missing Components
    10.34 Performing Date Conversion Using SQL
    10.35 Using Temporary Tables for Data Transformation
    10.36 Dealing with NULL Values
    10.37 Guessing Table Structure from a Datafile
    10.38 A LOAD DATA Diagnostic Utility
    10.39 Exchanging Data Between MySQL and Microsoft Access
    10.40 Exchanging Data Between MySQL and Microsoft Excel
    10.41 Exchanging Data Between MySQL and FileMaker Pro
    10.42 Exporting Query Results as XML
    10.43 Importing XML into MySQL
    10.44 Epilog
    Chapter 11. Generating and Using Sequences
    Chapter 12. Using Multiple Tables
    Chapter 13. Statistical Techniques
    Chapter 14. Handling Duplicates
    Chapter 15. Performing Transactions
    Chapter 16. Introduction to MySQL on the Web
    Chapter 17. Incorporating Query Resultsinto Web Pages
    Chapter 18. Processing Web Input with MySQL
    Chapter 19. Using MySQL-Based Web Session Management
    Appendix A. Obtaining MySQL Software
    Appendix B. JSP and Tomcat Primer
    Appendix C. References
    Colophone
    Index


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele