Google Hacks Free Open Book

Google Hacks

Previous Section Next Section

Hack 81 SafeSearch Certifying URLs

figs/expert.giffigs/hack81.gif

Feed URLs to Google's SafeSearch to determine whether or not they point at questionable content.

Only three things in life are certain: death, taxes, and accidentally visiting a once family-safe web site that now contains text and images that would make a horse blush.

As you probably know if you've ever put up a web site, domain names are registered for finite lengths of time. Sometimes registrations accidentally expire; sometimes businesses fold and allow the registrations to expire. Sometimes other companies take them over.

Other companies might just want the domain name, some companies want the traffic that the defunct site generated, and in a few cases, the new owners of the domain name try to hold it "hostage," offering to sell it back to the original owners for a great deal of money. (This doesn't work as well as it used to because of the dearth of Internet companies that actually have a great deal of money.)

When a site isn't what it once was, that's no big deal. When it's not what it once was and is now rated X, that's a bigger deal. When it's not what it once was, is now rated X, and is on the link list of a site you run, that's a really big deal.

But how to keep up with all the links? You can go visit every link periodically and see if it's still okay, or you can wait for the hysterical emails from site visitors, or you can just not worry about it. Or you can put the Google API to work.

This program lets you give provide a list of URLs and check them in Google's SafeSearch Mode. If they appear in the SafeSearch mode, they're probably okay. If they don't appear, they're either not in Google's index or not good enough to pass Google's filter. The program then checks the URLs missing from a SafeSearch with a nonfiltered search. If they do not appear in a nonfiltered search, they're labeled as unindexed. If they do appear in a nonfiltered search, they're labeled as "suspect."

81.1 Danger Will Robinson

While Google's SafeSearch filter is good, it's not infallible. (I have yet to see an automated filtering system that is infallible.) So if you run a list of URLs through this hack and they all show up in a SafeSearch query, don't take that as a guarantee that they're all completely inoffensive. Take it merely as a pretty good indication that they are. If you want absolute assurance, you're going to have to visit every link personally and often.

Here's a fun idea if you need an Internet-related research project. Take 500 or so domain names at random and run this program on the list once a week for several months, saving the results to a file each time. It'd be interesting to see how many domains/URLs end up being filtered out of SafeSearch over time.

81.2 The Code

#!/usr/local/bin/perl
# suspect.pl
# Feed URLs to a Google SafeSearch. If inurl: returns results, the
# URL probably isn't questionable content.  If inurl: returns no 
# results, either it points at questionable content or isn't in
# the Google index at all. 

# Your Google API developer's key
my $google_key = 'put your key here';

# Location of the GoogleSearch WSDL file
my $google_wdsl = "./GoogleSearch.wsdl";

use strict;

use SOAP::Lite;

$|++; # turn off buffering  

my $google_search = SOAP::Lite->service("file:$google_wdsl");

# CSV header
print qq{"url","safe/suspect/unindexed","title"\n};

while (my $url = <>) {
  chomp $url;
  $url =~ s!^\w+?://!!;
  $url =~ s!^www\.!!;

  # SafeSearch
  my $results = $google_search -> 
      doGoogleSearch(
      $google_key, "inurl:$url", 0, 10, "false", "",  "true",
      "", "latin1", "latin1"
    );

  print qq{"$url",};

  if (grep /$url/, map { $_->{URL} } @{$results->{resultElements}}) {
    print qq{"safe"\n};
  } 
  else {
    # unSafeSearch
    my $results = $google_search -> 
        doGoogleSearch(
        $google_key, "inurl:$url", 0, 10, "false", "",  "false",
        "", "latin1", "latin1"
      );

    # Unsafe or Unindexed?
    print (
      (scalar grep /$url/, map { $_->{URL} } @{$results->{resultElements}}) 
        ? qq{"suspect"\n}
        : qq{"unindexed"\n}
      );
  }
}      

81.3 Running the Hack

To run the hack, you'll need a text file that contains the URLs you want to check, one line per URL. For example:

http://www.oreilly.com/catalog/essblogging/
http://www.xxxxxxxxxx.com/preview/home.htm
hipporhinostricow.com

The program runs from the command line. Enter the name of the script , a less-than sign, and the name of the text file that contains the URLs you want to check. The program will return results that look like this:

% perl suspect.pl < urls.txt
"url","safe/suspect/unindexed"
"oreilly.com/catalog/essblogging/","safe"
"xxxxxxxxxx.com/preview/home.htm","suspect"
"hipporhinostricow.com","unindexed"

The first item is the URL being checked. The second is it's probable safety rating as follows:

safe

The URL appeared in a Google SafeSearch for the URL.

suspect

The URL did not appear in a Google SafeSearch, but did in an unfiltered search.

unindexed

The URL appeared in neither a SafeSearch nor unfiltered search.

You can redirect output from the script to a file for import into a spreadsheet or database:

% perl suspect.pl < urls.txt > urls.csv

81.4 Hacking the Hack

You can use this hack interactively, feeding it URLs one at a time. Invoke the script with perl suspect.pl, but don't feed it a text file of URLs to check. Enter a URL and hit the return key on your keyboard. The script will reply in the same manner as it did when fed multiple URLs. This is handy when you just need to spot-check a couple of URLs on the command line. When you're ready to quit, break out of the script using Ctrl-D under Unix or Ctrl-Break on a Windows command line.

Here's a transcript of an interactive session with suspect.pl:

% perl suspect.pl
"url","safe/suspect/unindexed","title"
http://www.oreilly.com/catalog/essblogging/
"oreilly.com/catalog/essblogging/","safe"
http://www.xxxxxxxxxx.com/preview/home.htm
"xxxxxxxxxx.com/preview/home.htm","suspect"
hipporhinostricow.com
"hipporhinostricow.com","unindexed"
^d
%
    Previous Section Next Section


         Main Menu
    Main Page
    Table of content
    Copyright
    Dedication
    Credits
    Foreword
    Preface
    Chapter 1. Searching Google
    Chapter 2. Google Special Services and Collections
    Chapter 3. Third-Party Google Services
    Chapter 4. Non-API Google Applications
    Chapter 5. Introducing the Google Web API
    Chapter 6. Google Web API Applications
    6.1 Hacks #60-85
    6.2 The Ingenuity of Millions
    6.3 Learning to Code
    6.4 What You'll Find Here
    6.5 Finding More Google API Applications
    6.6 The Possibilities Aren't Endless, but They're Expanding
    Hack 60 Date-Range Searching with a Client-Side Application
    Hack 61 Adding a Little Google to Your Word
    Hack 62 Permuting a Query
    Hack 63 Tracking Result Counts over Time
    Hack 64 Visualizing Google Results
    Hack 65 Meandering Your Google Neighborhood
    Hack 66 Running a Google Popularity Contest
    Hack 67 Building a Google Box
    Hack 68 Capturing a Moment in Time
    Hack 69 Feeling Really Lucky
    Hack 70 Gleaning Phonebook Stats
    Hack 71 Performing Proximity Searches
    Hack 72 Blending the Google and Amazon Web Services
    Hack 73 Getting Random Results (On Purpose)
    Hack 74 Restricting Searches to Top-Level Results
    Hack 75 Searching for Special Characters
    Hack 76 Digging Deeper into Sites
    Hack 77 Summarizing Results by Domain
    Hack 78 Scraping Yahoo! Buzz for a Google Search
    Hack 79 Measuring Google Mindshare
    Hack 80 Comparing Google Results with Those of Other Search Engines
    Hack 81 SafeSearch Certifying URLs
    Hack 82 Syndicating Google Search Results
    Hack 83 Searching Google Topics
    Hack 84 Finding the Largest Page
    Hack 85 Instant Messaging Google
    Chapter 7. Google Pranks and Games
    Chapter 8. The Webmaster Side of Google
    Colophon
    Index


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele