PHP Hacks Free Open Book

PHP Hacks

Origami Paper Planes
Paper Airplane Origami Boats. Learn hot to flod this crafts
Previous Page
Next Page

Hack 44. Scrape Web Pages for Data

Use regular expressions to scrape data from sources like Metacritic.

What do you do when you want the data from a site, but the site won't let you export that data in a predictable format (like XML [Hack #38] or CSV [Hack #43])? One popular option is to perform what's called a screen scrape on the HTML to extract the data. Screen scraping starts with downloading the contents of the page containing the data into either a string in memory or a file. Regular expressions are then used to extract the relevant data from the string or file.

You can scrape almost any web site for data; for the example in this hack, I chose the Metacritic DVD review page (http://www.metacritic.com/video/).

Figure 5-9. The resulting generated PHP


Metacritic is a site where movies, music, and video games are given a review score based on a selection of reviews. Figure 5-10 shows the Metacritic page that I scraped for this hack. On the lefthand side of the window is a list of movies ordered by name, along with their review scores.

I can tell from the size of the page that I want only a small portion of the HTML. I use View Source to see what the code looks like, and indeed there is a section for these scores well defined by a div tag that contains what I'm looking for:

	</TR>
	</TABLE>
	  <DIV ID="sortbyname1">
	  <P CLASS="listing">
	  <SPAN CLASS="yellow">51</SPAN>
		  <A HREF="/video/titles/800bullets">800 Bullets</A><BR>
	  <SPAN CLASS="yellow">58</SPAN>
		  <A HREF="/video/titles/actsofworship">Acts of Worship</A><BR>
	  <SPAN CLASS="green">81</SPAN>
		<A HREF="/video/titles/badeducation"><B>Bad Education</B></A><IMG
	  SRC="/_images/scores/star.gif" WIDTH="11" HEIGHT="11" ALIGN="absmiddle"><BR>
	  …

The first step will be to extract just this div tag. Then we need to use another regular expression to pick out each movie entry from text within the div tag. Notice that each movie listing starts with a span tag and ends with a br tag; that's good enough to delineate each movie. The third listing has some extra stuff around the movie title that I strip out with another set of regular expressions.

Figure 5-10. The Metacritic DVD and Video Review page


I strongly recommend using a divide-and-conquer technique when writing screen-scraping code. Don't try to do all of the work with a single regular expression, or you'll end up with indecipherable code that even you can't maintain.

5.12.1. The Code

Save the code in Example 5-33 as scrapecritic.php.

Example 5-33. PHP for loading a URL and scraping content from it
<html>
<?
// Set up the CURL object
$ch = curl_init( "http://www.metacritic.com/video/" );

// Fake out the User Agent
curl_setopt( $ch, CURLOPT_USERAGENT, "Internet Explorer" );

// Start the output buffering
ob_start();

// Get the HTML from MetaCritic
curl_exec( $ch );
curl_close( $ch );

// Get the contents of the output buffer
$str = ob_get_contents();
ob_end_clean();

// Get just the list sorted by name
preg_match( "/\<DIV ID=\"sortbyname1\"\>(.*?)\<\/DIV\>/is",
		$str, $byname );

// Get each of the movie entries
preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is",
		$byname[0], $moviedata );

// Work through the raw movie data
$movies = array();
for( $i = 0; $i < count( $moviedata[1] ); $i++ )
{
		// The score is ok already
		$score = $moviedata[1][$i];
			
		// We need to remove tags from the title and decode
		// the HTML entities
		$title = $moviedata[2][$i];
		$title = preg_replace( "/<.*?>/", "", $title );
		$title = html_entity_decode( $title );
			
		// Then add the movie to the array
		$movies []= array( $score, $title );
}
?>
<body>
<table>
<tr>
<th>Name</th><th>Score</th>
</tr>
<? foreach( $movies as $movie ) { ?>
<tr>
<td><? echo( $movie[1] ) ?></td>
<td><? echo( $movie[0] ) ?></td>
</tr>
<? } ?>
</table>
</body>
</html>

The scrapecritic.php script starts by downloading the current contents of the Metacritic DVD page into a string. It does this by using the ob_start( ), ob_get_contents(), and ob_end_clean() functions to grab the text that curl_exec() would have put into the page, and instead copies it into a string.

The next step is to grab just the div tag that corresponds to the list sorted by name, using a preg_match() with a regular expression customized to this particular page. This is a clear demonstration of the primary technical problem with screen scraping: if the site being scraped changes its formatting in even the slightest way, it can (and probably will) break the scraping code. It's always better to get an XML feed for the data if that's possible. XML is far more resilient to changes in format.

With the name-sorted list in hand, the script then uses preg_match_all() to extract all of the movie names and scores into an array. The final step is to take this array of movies and strip the movie name of any extraneous tags or formatting.

At this point, the data is cleaned and ready to be presented. The script uses a simple foreach loop to create a table that shows the name of the movie and the aggregated review score.

5.12.2. Running the Hack

To run the hack, copy the file onto your PHP server and surf to it in your web browser. The result should look like Figure 5-11.

Another use for screen scraping is content type conversion. You can take what was an HTML page and turn it into a WML page for web-enabled phones, or an RSS feed for news aggregators.


5.12.3. Problems with Screen Scraping

There are two major problems with screen scraping. The first is technical and the second is legal. On the technical side, screen scraping is inclined to break when the site being scraped changes its format. In addition, the scraping code for one site will likely not work on other sites because of formatting issues. Finally, screen scraping can be slow or even break when the target site is not responding to web requests in a timely manner.

Judiciously choosing which pages you can scrape is also important. Look for pages that were generated by a web application, as opposed to written by hand. Handwritten pages will have almost random markup; application-generated pages usually have a predictable format that will make writing regular expressions to match the format a lot easier.

Figure 5-11. The resulting screen-scraped page


Web application pages normally end with extensions such as .php, .jsp, .asp, or some similar variant. Handwritten pages usually have the .htm or .html extension.


On the legal side, you must always make sure that you have permission to use the data in this way before adding this functionality to your site. There's nothing worse than writing lots of screen-scraping code only to find out that the content you've scraped was obtained illegally and cannot be used.

5.12.4. See Also

Previous Page
Next Page
Index: [SYMBOL][A][B][C][D][E][F][G][H][I][J][K][L][M][N][O][P][Q][R][S][T][U][V][W][X][Y][Z]

Origami Paper AirPlane
Paper Airplane Origami Boats

     Main Menu
PHP Hacks
Table of Contents
Copyright
Credits
Preface
Chapter 1.  Installation and Basics
Chapter 2.  Web Design
Chapter 3.  DHTML
Chapter 4.  Graphics
Chapter 5.  Databases and XML
Section 5.1.  Hacks 3450: Introduction
Hack 34. Design Better SQL Schemas
Hack 35. Create Bulletproof Database Access
Hack 36. Create Dynamic Database Access Objects
Hack 37. Generate CRUD Database Code
Hack 38. Read XML on the Cheap with Regular Expressions
Hack 39. Export Database Schema as XML
Hack 40. Create a Simple XML Query Handler for Database Access
Hack 41. Generate Database SQL
Hack 42. Generate Database Select Code
Hack 43. Convert CSV to PHP
Hack 44. Scrape Web Pages for Data
Hack 45. Suck Data from Excel Uploads
Hack 46. Load Your Database from Excel
Hack 47. Search Microsoft Word Documents
Hack 48. Create RTF Documents Dynamically
Hack 49. Create Excel Spreadsheets Dynamically
Hack 50. Create a Message Queue
Chapter 6.  Application Design
Chapter 7.  Patterns
Chapter 8.  Testing
Chapter 9.  Alternative UIs
Chapter 10.  Fun Stuff
Colophon
Index


More Books
PHP Hacks
Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
The Koran (Holy Qur'an)
Macromedia Flash 8 Bible
Search Engine Optimization for Dummies
YouTube Traffic
PHP 5 for Dummies
Harry Potter and The Chamber of Secrets
Harry Potter and the Sorcerer's Stone
The Pilgrim's Progress
Wireless Hacks
Flash Hacks. 100 Industrial-Strength Tips & Tools
PayPal Hacks. 100 Industrial-Strength Tips and Tools
Amazon Hacks
Pdf Hacks
The Da Vinci Code
Google Hacks
The Holy Bible
Windows XP For Dummies
Harry Potter and the Half-Blood Prince
Seo Book
Upgrading and Repairing Networks
Macromedia Dreamweaver 8 UNLEASHED
Windows XP Annoyances
Windows XP Hacks
Microsoft Windows XP Power Toolkit
Teach Yourself MS Office In 24Hours
iPod & iTunes Missing Manual
PC Hacks 100 Industrial-Strength Tips and Tools
PC Overclocking, Optimization, and Tuning - 2th Edition
PC Hardware In A Nutshell 3rd Edition
PC Hardware in a Nutshell, 2nd Edition
Upgrading and Repairing PCs
Google for Dummies
MySQL Cookbook
Teach Yourself Macromedia Flash 8 In 24 Hours
PHP CookBook
Sams Teach Yourself JavaScript in 24 Hours
PHP5 Manual
Free Games Paper Airplane
Paper Airplane - Paper Airplane - Paper Airplane - Paper Airplane - Paper Airplane - Paper Airplane - Paper Airplane - Paper Airplane -