Google Hacks Free Open Book

Google Hacks

Previous Section Next Section

Hack 78 Scraping Yahoo! Buzz for a Google Search

figs/expert.giffigs/hack78.gif

A proof of concept hack that scrapes the buzziest items from Yahoo! Buzz and submits them to a Google search.

No web site is an island. Billions of hyperlinks link to billions of documents. Sometimes, however, you want to take information from one site and apply it to another site.

Unless that site has a web service API like Google's, your best bet is scraping. Scraping is where you use an automated program to remove specific bits of information from a web page. Examples of the sorts of elements people scrape include: stock quotes, news headlines, prices, and so forth. You name it and someone's probably scraped it.

There's some controversy about scraping. Some sites don't mind it, while others can't stand it. If you decide to scrape a site, do it gently; take the minimum amount of information you need and, whatever you do, don't hog the scrapee's bandwidth.

So, what are we scraping?

Google has a query popularity page; it's called Google Zeitgeist (http://www.google.com/press/zeitgeist.html). Unfortunately, the Zeitgeist is only updated once a week and contains only a limited amount of scrapable data. That's where Yahoo! Buzz (http://buzz.yahoo.com/) comes in. The site is rich with constantly updated information. Its "Buzz Index" keeps tabs on what's hot in popular culture: celebs, games, movies, television shows, music, and more.

This hack grabs the buzziest of the buzz, top of the "Leaderboard," and searches Google for all it knows on the subject. And to keep things current, only pages indexed by Google within the past few days [Hack #11] are considered.

This hack requires additional Perl modules: Time::JulianDay (http://search.cpan.org/search?query=Time%3A%3AJulianDay) and LWP::Simple (http://search.cpan.org/search?query=LWP%3A%3ASimple). It won't run without them.

78.1 The Code

#!/usr/local/bin/perl
# buzzgle.pl
# Pull the top item from the Yahoo Buzz Index and query the last
# three day's worth of Google's index for it
# Usage: perl buzzgle.pl

# Your Google API developer's key
my $google_key='insert key here';

# Location of the GoogleSearch WSDL file
my $google_wdsl = "./GoogleSearch.wsdl";

# Number of days back to go in the Google index
my $days_back = 3;

use strict;

use SOAP::Lite;
use LWP::Simple;
use Time::JulianDay;

# Scrape the top item from the Yahoo Buzz Index

# Grab a copy of http://buzz.yahoo.com

my $buzz_content = get("http://buzz.yahoo.com/") 
  or die "Couldn't grab the Yahoo Buzz: $!";

# Find the first item on the Buzz Index list
my($buzziest) =  $buzz_content =~ m!<TR BGCOLOR=white.+?1.+?<a href="http://
search.yahoo.com/search\?p=.+?&cs=bz">(.+?)!i;
die "Couldn't figure out the Yahoo! buzz\n" unless $buzziest;

# Figure out today's Julian date
my $today = int local_julian_day(time);

# Build the Google query
my $query = "\"$buzziest\" daterange:" . ($today - $days_back) . "-$today"; 

print 
  "The buzziest item on Yahoo Buzz today is: $buzziest\n",
  "Querying Google for: $query\n",
  "Results:\n\n";

# Create a new SOAP::Lite instance, feeding it GoogleSearch.wsdl
my $google_search = SOAP::Lite->service("file:$google_wdsl");

# Query Google
my $results = $google_search -> 
    doGoogleSearch(
      $google_key, $query, 0, 10, "false", "",  "false",
      "", "latin1", "latin1"
    );

# No results?
@{$results->{resultElements}} or die "No results";

# Loop through the results
foreach my $result (@{$results->{'resultElements'}}) {
 my $output = 
  join "\n",  
  $result->{title} || "no title",
  $result->{URL},
  $result->{snippet} || 'no snippet',
  "\n";
    $output =~ s!<.+?>!!g; # drop all HTML tags
    print $output;
}

78.2 Running the Hack

The script runs from the command line without need of arguments of any kind. Probably the best thing to do is to direct the output to a pager (a command-line application that allows you to page through long output, usually by hitting the spacebar), like so:

% perl buzzgle.pl | more

Or you can direct the output to a file for later perusal:

% perl buzzgle.pl > buzzgle.txt

As with all scraping applications, this code is fragile, subject to breakage if (read: when) HTML formatting of the Yahoo! Buzz page changes. If you find you have to adjust to match Yahoo!'s formatting, you'll have to alter the regular expression match as appropriate:

my($buzziest) =  $buzz_content =~ m!<TR BGCOLOR=white.+?1.+?<a href="http
://search.yahoo.com/search\?p=.+?&cs=bz">(.+?)!i;

Regular expressions and general HTML scraping are beyond the scope of this book. For more information, I suggest you consult O'Reilly's Perl and LWP (http://www.oreilly.com/catalog/perllwp/) or Mastering Regular Expressions (http://www.oreilly.com/catalog/regex/).

78.3 The Results

At the time of this writing—and probably for some time yet to come—musical sensation, Eminem, is all the rage.

% perl buzzgle.pl | less
The buzziest item on Yahoo Buzz today is: Eminem
Querying Google for: "Eminem" daterange:2452593-2452596
Results:
Eminem World
http://www.eminemworld.com/
Eminem World specializing in Eminem News and Information. With
Pictures, Discogr aphy, Lyrics ... your #1 Eminem Resource. Eminem
World, ...  
Eminem
http://www.eminem.com/frameset.asp?PageName=eminem
no snippet
Eminem Planet - Your Ultimate Resource
http://www.eminem-planet.com/
Eminem Planet - A Great Resource about the Real Slim Shady. .:8 Mile .:News .:Bi
ography ... More News. ::Order Eminem's book. Click Here to Check ...  
...

78.4 Hacking the Hack

Here are some ideas for hacking the hack:

  • The program as it stands returns 10 results. You could change that to one result and immediately open that result instead of returning a list. Bravo, you've just written I'm Feeling Popular!, as in Google's I'm Feeling Lucky!

  • This version of the program searches the last three days of indexed pages. Because there's a slight lag in indexing news stories, I would index at least the last two days' worth of indexed pages, but you could extend it to seven days or even a month. Simply change my $days_back = 3;, altering the value of the $days_back variable.

  • You could create a "Buzz Effect" hack by running the Yahoo! Buzz query with and without the date-range limitation. How do the results change between a full search and a search of the last few days?

  • Yahoo!'s Buzz has several different sections. This one looks at the Buzz summary, but you could create other ones based on Yahoo!'s other buzz charts (television, http://buzz.yahoo.com/television/, for instance).

    Previous Section Next Section


         Main Menu
    Main Page
    Table of content
    Copyright
    Dedication
    Credits
    Foreword
    Preface
    Chapter 1. Searching Google
    Chapter 2. Google Special Services and Collections
    Chapter 3. Third-Party Google Services
    Chapter 4. Non-API Google Applications
    Chapter 5. Introducing the Google Web API
    Chapter 6. Google Web API Applications
    6.1 Hacks #60-85
    6.2 The Ingenuity of Millions
    6.3 Learning to Code
    6.4 What You'll Find Here
    6.5 Finding More Google API Applications
    6.6 The Possibilities Aren't Endless, but They're Expanding
    Hack 60 Date-Range Searching with a Client-Side Application
    Hack 61 Adding a Little Google to Your Word
    Hack 62 Permuting a Query
    Hack 63 Tracking Result Counts over Time
    Hack 64 Visualizing Google Results
    Hack 65 Meandering Your Google Neighborhood
    Hack 66 Running a Google Popularity Contest
    Hack 67 Building a Google Box
    Hack 68 Capturing a Moment in Time
    Hack 69 Feeling Really Lucky
    Hack 70 Gleaning Phonebook Stats
    Hack 71 Performing Proximity Searches
    Hack 72 Blending the Google and Amazon Web Services
    Hack 73 Getting Random Results (On Purpose)
    Hack 74 Restricting Searches to Top-Level Results
    Hack 75 Searching for Special Characters
    Hack 76 Digging Deeper into Sites
    Hack 77 Summarizing Results by Domain
    Hack 78 Scraping Yahoo! Buzz for a Google Search
    Hack 79 Measuring Google Mindshare
    Hack 80 Comparing Google Results with Those of Other Search Engines
    Hack 81 SafeSearch Certifying URLs
    Hack 82 Syndicating Google Search Results
    Hack 83 Searching Google Topics
    Hack 84 Finding the Largest Page
    Hack 85 Instant Messaging Google
    Chapter 7. Google Pranks and Games
    Chapter 8. The Webmaster Side of Google
    Colophon
    Index


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele