PHP Hacks Free Open Book

PHP Hacks

Origami Paper Planes
Paper Airplane Origami Boats. Learn hot to flod this crafts
Previous Page
Next Page

Hack 47. Search Microsoft Word Documents

Search the text in Microsoft Word documents by parsing WordML files.

A lot of valuable data is locked up in Microsoft Word documents. In particular, documents such as resumes are particularly tempting for data-mining applications. Job boards need code that parses Word documents and finds keywords or phrases to categorize the job candidates. This hack demonstrates how to search Word documents saved as WordML for text strings.

5.15.1. The Code

Save the code shown in Example 5-37 as index.php.

Example 5-37. HTML that handles data uploads
<html>
<body>
	<form enctype="multipart/form-data" action="search.php" method="post">
	WordML file: <input type="hidden" name="MAX_FILE_SIZE" value="2000000" />
	<input type="file" name="file" /><br/>
<input type="submit" value="Upload" />
</form>
</body>
</html>

Save the code in Example 5-38 as search.php. This script looks through the uploaded WordML for specific features.

Example 5-38. Script that handles searching
<html>
<body>
<?php
$wordlist = array();

$dom = new DOMDocument();
if ( $_FILES['file']['tmp_name'] )
{
	$dom->load( $_FILES['file']['tmp_name'] );
	$found = $dom->getElementsByTagName( "t" );

	foreach( $found as $element )
	{
		$words = split( ' ', $element->nodeValue );
		foreach( $words as $word )
	{
		
	$word = preg_replace( '/[,]|[.]/', '', $word );
		$word = preg_replace( '/^\s+/', '', $word );
		$word = preg_replace( '/\s+$/', '', $word );
		if ( strlen( $word ) > 0 )
		{
		$word = strtolower( $word ); 
		$wordlist[ $word ] = 0; 
		}
	}
}
}
$words = array_keys( $wordlist );
sort( $words );

foreach( $words as $word ) {
?>
<?php echo( $word ); ?><br/>
<?php } ?>
</body>
</html>

The search.php script starts by taking the uploaded WordML file and opening it using the XML DOM objects. Then it finds all of the t nodes. t nodes are where the text of the document is stored. From there, it removes any punctuation. It then chops up the remaining text into words and stores those words into a hash table called $wordlist. That word list is then written out at the end of the script.

5.15.2. Running the Hack

Write a simple Microsoft Word 2003 document and save it as a WordML file somewhere on your disk. Then upload these files to your web server and navigate your browser to index.php. It should look like Figure 5-19.

Click on the Browse button and select the WordML file. Then click on the Upload button. That will send the file to the search.php script. That script uses the XML DOM to read the file. The data in the WordML file is sorted and reported on the HTML page, as shown in Figure 5-20.

From here, you can look for specific words, or count the occurrence of certain words [Hack #24].

Figure 5-19. The upload page


Figure 5-20. The words found in the uploaded document


WordML is only supported by Microsoft Word 2003 and later versions. It's not currently supported on the Macintosh, though I expect it will be in later versions. To support older versions of Microsoft Word, you might want to rewrite the hack code to parse RTF instead of WordML. Every recent version of Microsoft Word supports RTF.


5.15.3. See Also

Previous Page
Next Page
Index: [SYMBOL][A][B][C][D][E][F][G][H][I][J][K][L][M][N][O][P][Q][R][S][T][U][V][W][X][Y][Z]

Origami Paper AirPlane
Paper Airplane Origami Boats

     Main Menu
PHP Hacks
Table of Contents
Copyright
Credits
Preface
Chapter 1.  Installation and Basics
Chapter 2.  Web Design
Chapter 3.  DHTML
Chapter 4.  Graphics
Chapter 5.  Databases and XML
Section 5.1.  Hacks 3450: Introduction
Hack 34. Design Better SQL Schemas
Hack 35. Create Bulletproof Database Access
Hack 36. Create Dynamic Database Access Objects
Hack 37. Generate CRUD Database Code
Hack 38. Read XML on the Cheap with Regular Expressions
Hack 39. Export Database Schema as XML
Hack 40. Create a Simple XML Query Handler for Database Access
Hack 41. Generate Database SQL
Hack 42. Generate Database Select Code
Hack 43. Convert CSV to PHP
Hack 44. Scrape Web Pages for Data
Hack 45. Suck Data from Excel Uploads
Hack 46. Load Your Database from Excel
Hack 47. Search Microsoft Word Documents
Hack 48. Create RTF Documents Dynamically
Hack 49. Create Excel Spreadsheets Dynamically
Hack 50. Create a Message Queue
Chapter 6.  Application Design
Chapter 7.  Patterns
Chapter 8.  Testing
Chapter 9.  Alternative UIs
Chapter 10.  Fun Stuff
Colophon
Index


More Books
PHP Hacks
Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
The Koran (Holy Qur'an)
Macromedia Flash 8 Bible
Search Engine Optimization for Dummies
YouTube Traffic
PHP 5 for Dummies
Harry Potter and The Chamber of Secrets
Harry Potter and the Sorcerer's Stone
The Pilgrim's Progress
Wireless Hacks
Flash Hacks. 100 Industrial-Strength Tips & Tools
PayPal Hacks. 100 Industrial-Strength Tips and Tools
Amazon Hacks
Pdf Hacks
The Da Vinci Code
Google Hacks
The Holy Bible
Windows XP For Dummies
Harry Potter and the Half-Blood Prince
Seo Book
Upgrading and Repairing Networks
Macromedia Dreamweaver 8 UNLEASHED
Windows XP Annoyances
Windows XP Hacks
Microsoft Windows XP Power Toolkit
Teach Yourself MS Office In 24Hours
iPod & iTunes Missing Manual
PC Hacks 100 Industrial-Strength Tips and Tools
PC Overclocking, Optimization, and Tuning - 2th Edition
PC Hardware In A Nutshell 3rd Edition
PC Hardware in a Nutshell, 2nd Edition
Upgrading and Repairing PCs
Google for Dummies
MySQL Cookbook
Teach Yourself Macromedia Flash 8 In 24 Hours
PHP CookBook
Sams Teach Yourself JavaScript in 24 Hours
PHP5 Manual
Free Games Paper Airplane
Paper Airplane - Paper Airplane - Paper Airplane - Paper Airplane - Paper Airplane - Paper Airplane - Paper Airplane - Paper Airplane -