Google Hacks Free Open Book

Google Hacks

Previous Section Next Section

Hack 60 Date-Range Searching with a Client-Side Application

figs/expert.giffigs/hack60.gif

Monitor a set of queries for new finds added to the Google index yesterday.

The GooFresh [Hack #42] hack is a simple web form-driven CGI script for building date range [Hack #11] Google queries. A simple web-based interface is fine when you want to search for only one or two items at a time. But what of performing multiple searches over time, saving the results to your computer for comparative analysis?

A better fit for this task is a client-side application that you run from the comfort of your own computer's desktop. This Perl script feeds specified queries to Google via the Google Web API, limiting results to those indexed yesterday. New finds are appended to a comma-delimited text file per query, suitable for import into Excel or your average database application.

This hack requires an additional Perl module, Time::JulianDay (http://search.cpan.org/author/MUIR/); it just won't work until you have the module installed.

60.1 The Queries

First, you'll need to prepare a few queries to feed the script. Try these out via the Google search interface itself first to make sure you're receiving the kind of results you're expecting. Your queries can be anything you'd be interested in tracking over time: topics of long-lasting or current interest, searches for new directories of information [Hack #21] coming online, unique quotes from articles or other sources that you want to monitor for signs of plagiarism.

Use whatever special syntaxes you like except for link:; as you might remember, link: can't be used in concert with any other special syntax like daterange:, upon which this hack relies. If you insist on trying anyway (e.g., link:www.yahoo.com daterange:2452421-2452521), Google will simply treat link as yet another query word (e.g., link www.yahoo.com), yielding some unexpected and useless results.

Put each query on its own line. A sample query file will look something like this:

"digital archives"
intitle:"state library of"
intitle:directory intitle:resources
"now * * time for all good men * come * * aid * * party"

Save the text file somewhere memorable; alongside the script you're about to write is as good a place as any.

60.2 The Code

#!/usr/local/bin/perl -w
# goonow.pl
# feeds queries specified in a text file to Google, querying
# for recent additions to the Google index.  The script appends
# to CSV files, one per query, creating them if they don't exist.
# usage: perl goonow.pl [query_filename]

# My Google API developer's key
my $google_key='insert key here';

# Location of the GoogleSearch WSDL file
my $google_wdsl = "./GoogleSearch.wsdl";

use strict;

use SOAP::Lite;
use Time::JulianDay;

$ARGV[0] or die "usage: perl goonow.pl [query_filename]\n";

my $julian_date = int local_julian_day(time) - 2;

my $google_search  = SOAP::Lite->service("file:$google_wdsl");

open QUERIES, $ARGV[0] or die "Couldn't read $ARGV[0]: $!";

while (my $query = <QUERIES>) {
  chomp $query;
  warn "Searching Google for $query\n"
   
  $query .= " daterange:$julian_date-$julian_date";
  (my $outfile = $query) =~ s/\W/_/g;
  open (OUT, ">> $outfile.csv")
    or die "Couldn't open $outfile.csv: $!\n";
   
  my $results = $google_search ->
    doGoogleSearch(
      $google_key, $query, 0, 10, "false", "",  "false",
      "", "latin1", "latin1"
    );
  foreach (@{$results->{'resultElements'}}) {
    print OUT '"' . join('","', (
      map {
        s!\n!!g; # drop spurious newlines
        s!<.+?>!!g; # drop all HTML tags
        s!"!""!g; # double escape " marks
        $_;
      } @$_{'title','URL','snippet'}
    ) ) . "\"\n";
  }
}

You'll notice that GooNow checks the day before yesterday's rather than yesterday's additions (my $julian_date = int local_julian_day(time) - 2;). Google indexes some pages very frequently; these show up in yesterday's additions and really bulk up your search results. So if you search for yesterday's results, in addition to updated pages you'll get a lot of noise, pages that Google indexes every day, rather than the fresh content you're after. Skipping back one more day is a nice hack to get around the noise.

60.3 Running the Hack

This script is invoked on the command line like so:

$ perl goonow.pl query_filename

Where query_filename is the name of the text file holding all the queries to be fed to the script. The file can be located either in the local directory or elsewhere; if the latter, be sure to include the entire path (e.g., /mydocu~1/hacks/queries.txt).

Bear in mind that all output is directed to CSV files, one per query. So don't expect any fascinating output on the screen.

60.4 The Results

Taking a quick look at one of the CSV output files created, intitle_ _state_library_of_.csv:

"State Library of Louisiana","http://www.state.lib.la.us/"," ...
Click
here if you have any questions or comments. Copyright <C2><A9>
1998-2001 State Library of Louisiana Last modified: August 07,
2002. "
"STATE LIBRARY OF NEW SOUTH WALES, SYDNEY
AUSTRALIA","http://www.slnsw.gov.au/", " ... State Library of New
South
Wales Macquarie St, Sydney NSW Australia 2000 Phone: +61 2 9273
1414
Fax: +61 2 9273 1255. Your comments You could win a prize! ...  "
"State Library of Victoria","http://www.slv.vic.gov.au/"," ...
clicking
on our logo. State Library of Victoria Logo with link to homepage
State
Library of Victoria. A world class cultural resource ...  "
...

60.5 Hacking the Hack

The script keeps appending new finds to the appropriate CVS output file. If you wish to reset the CVS files associated with particular queries, simply delete them and the script will create them anew.

Or you can make one slight adjustment to have the script create the CSV files anew each time, overwriting the previous version, like so:

...
(my $outfile = $query) =~ s/\W/_/g;
open (OUT, "> $outfile.csv")
  or die "Couldn't open $outfile.csv: $!\n";
my $results = $google_search ->
  doGoogleSearch(
    $google_key, $query, 0, 10, "false", "", "false",
    "", "latin1", "latin1"
  );
...

Notice the only change in the code is the removal of one of the > characters when the output file is created—open (OUT, "> $outfile.csv") instead of open (OUT, ">> $outfile.csv").

    Previous Section Next Section


         Main Menu
    Main Page
    Table of content
    Copyright
    Dedication
    Credits
    Foreword
    Preface
    Chapter 1. Searching Google
    Chapter 2. Google Special Services and Collections
    Chapter 3. Third-Party Google Services
    Chapter 4. Non-API Google Applications
    Chapter 5. Introducing the Google Web API
    Chapter 6. Google Web API Applications
    6.1 Hacks #60-85
    6.2 The Ingenuity of Millions
    6.3 Learning to Code
    6.4 What You'll Find Here
    6.5 Finding More Google API Applications
    6.6 The Possibilities Aren't Endless, but They're Expanding
    Hack 60 Date-Range Searching with a Client-Side Application
    Hack 61 Adding a Little Google to Your Word
    Hack 62 Permuting a Query
    Hack 63 Tracking Result Counts over Time
    Hack 64 Visualizing Google Results
    Hack 65 Meandering Your Google Neighborhood
    Hack 66 Running a Google Popularity Contest
    Hack 67 Building a Google Box
    Hack 68 Capturing a Moment in Time
    Hack 69 Feeling Really Lucky
    Hack 70 Gleaning Phonebook Stats
    Hack 71 Performing Proximity Searches
    Hack 72 Blending the Google and Amazon Web Services
    Hack 73 Getting Random Results (On Purpose)
    Hack 74 Restricting Searches to Top-Level Results
    Hack 75 Searching for Special Characters
    Hack 76 Digging Deeper into Sites
    Hack 77 Summarizing Results by Domain
    Hack 78 Scraping Yahoo! Buzz for a Google Search
    Hack 79 Measuring Google Mindshare
    Hack 80 Comparing Google Results with Those of Other Search Engines
    Hack 81 SafeSearch Certifying URLs
    Hack 82 Syndicating Google Search Results
    Hack 83 Searching Google Topics
    Hack 84 Finding the Largest Page
    Hack 85 Instant Messaging Google
    Chapter 7. Google Pranks and Games
    Chapter 8. The Webmaster Side of Google
    Colophon
    Index


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele