Google Hacks Free Open Book

Google Hacks

Previous Section Next Section

Hack 74 Restricting Searches to Top-Level Results

figs/expert.giffigs/hack74.gif

Separate out search results by the depth at which they appear in a site.

Google's a mighty big haystack under which to find the needle you seek. And there's more, so much more; some experts believe that Google and its ilk index only a bare fraction of the pages available on the Web.

Because the Web's getting bigger all the time, researchers have to come up with lots of different tricks to narrow down search results. Tricks and—thanks to the Google API—tools. This hack separates out search results appearing at the top level of a domain from those beneath.

Why would you want to do this?

  • Clear away clutter when searching for proper names. If you're searching for general information about a proper name, this is one way to clear out mentions in news stories, etc. For example, the name of a political leader like Tony Blair might be mentioned in a story without any substantive information about the man himself. But if you limited your results to only those pages on the top level of a domain, you would avoid most of those "mention hits."

  • Find patterns in the association of highly ranked domains and certain keywords.

  • Narrow search results to only those bits that sites deem important enough to have in their virtual foyers.

  • Skip past subsites, the likes of home pages created by J. Random User on their service provider's web server.

74.1 The Code

#!/usr/local/bin/perl
# gootop.cgi
# Separates out top level and sub-level results
# gootop.cgi is called as a CGI with form input

# Your Google API developer's key
my $google_key='insert key here';

# Location of the GoogleSearch WSDL file
my $google_wdsl = "./GoogleSearch.wsdl";

# Number of times to loop, retrieving 10 results at a time
my $loops = 10;

use strict;

use SOAP::Lite;
use CGI qw/:standard *table/;

print
  header(  ),
  start_html("GooTop"),
  h1("GooTop"),
  start_form(-method=>'GET'),
  'Query: ', textfield(-name=>'query'),
  '   ',
  submit(-name=>'submit', -value=>'Search'),
  end_form(  ), p(  );

my $google_search  = SOAP::Lite->service("file:$google_wdsl");

if (param('query')) {
  my $list = { 'toplevel' => [], 'sublevel' => [] };

  for (my $offset = 0; $offset <= $loops*10; $offset += 10) {
    my $results = $google_search ->
      doGoogleSearch(
        $google_key, param('query'), $offset,
        10, "false", "",  "false", "", "latin1", "latin1"
      );

    foreach (@{$results->{'resultElements'}}) {
      push @{
        $list->{ $_->{URL} =~ m!://[^/]+/?$!
        ? 'toplevel' : 'sublevel' }
      },
      p(
        b($_->{title}||'no title'), br(  ),
        a({href=>$_->{URL}}, $_->{URL}), br(  ),
        i($_->{snippet}||'no snippet')
      );
    }
  }

  print
    h2('Top-Level Results'),
    join("\n", @{$list->{toplevel}}),
    h2('Sub-Level Results'),
    join("\n", @{$list->{sublevel}});
}

print end_html;

Gleaning a decent number of top-level domain results means throwing out quite a bit. It's for this reason that this script runs the specified query a number of times, as specified by my $loops = 10;, each loop picking up 10 results, some subset being top-level. To alter the number of loops per query, simply change the value of $loops. Realize that each invocation of the script burns through $loops number of queries, so be sparing and don't bump that number up to anything ridiculous—even 100 will eat through a daily alotment in just 10 invocations.

The heart of the script, and what differentiates it from your average Google API Perl script [Hack #50], lies in this snippet of code:

push @{
  $list->{ $_->{URL} =~ m!://[^/]+/?$!
  ? 'toplevel' : 'sublevel' }
}

What that jumble of characters is scanning for is :// (as in http://) followed by anything other than a / (slash), thereby sifting between top-level finds (e.g., http://www.berkeley.edu/welcome.html) and sublevel results (e.g., http://www.berkeley.edu/students/john_doe/my_dog.html). If you're Perl savvy, you may have noticed the trailing /?$; this allows for the eventuality that a top-level URL ends with a slash (e.g., http://www.berkeley.edu/), as is often true.

74.2 Running the Hack

This hack runs as a CGI script. Figure 6-16 shows the results of a search for non-gmo (Genetically Modified Organisms, that is).

Figure 6-16. GooTop search for non-gmo
figs/gooH_0616.gif

74.3 Hacking the Hack

There are a couple of ways to hack this hack.

74.3.1 More depth

Perhaps your interests lie in just how deep results are within a site or sites. A minor adjustment or two to the code, and you have results grouped by depth:

...
    foreach (@{$results->{'resultElements'}}) {
      push @{ $list[scalar ( split(/\//, $_->{URL} . ' ') - 3 ) ] },
        p(
          b($_->{title}||'no title'), br(  ),
          a({href=>$_->{URL}}, $_->{URL}), br(  ),
          i($_->{snippet}||'no snippet')
        );
    }
  }

  for my $depth (1..$#list) {
    print h2("Depth: $level");
    ref $list[$depth] eq 'ARRAY' and print join "\n",@{$list[$depth]};
  }  
}

print end_html;

Figure 6-17 shows that non-gmo search again using the depth hack.

Figure 6-17. non-gmo search using depth hack
figs/gooH_0617.gif
74.3.2 Query tips

Along with the aforementioned code hacking, here are a few query tips to use with this hack:

  • Consider feeding the script a date range [Hack #11] query to further narrow results.

  • Keep your searches specific, but not too much so for fear of turning up no top-level results. Instead of cats, for example, use "burmese cats", but don't try "burmese breeders" feeding.

  • Try the link: [Section 1.5] syntax. This is a nice use of a syntax otherwise not allowed in combination [Hack #8] with any others.

  • On occasion, intitle: works nicely with this hack. Try your query without special syntaxes first, though, and work your way up, making sure you're getting results after each change.

    Previous Section Next Section


         Main Menu
    Main Page
    Table of content
    Copyright
    Dedication
    Credits
    Foreword
    Preface
    Chapter 1. Searching Google
    Chapter 2. Google Special Services and Collections
    Chapter 3. Third-Party Google Services
    Chapter 4. Non-API Google Applications
    Chapter 5. Introducing the Google Web API
    Chapter 6. Google Web API Applications
    6.1 Hacks #60-85
    6.2 The Ingenuity of Millions
    6.3 Learning to Code
    6.4 What You'll Find Here
    6.5 Finding More Google API Applications
    6.6 The Possibilities Aren't Endless, but They're Expanding
    Hack 60 Date-Range Searching with a Client-Side Application
    Hack 61 Adding a Little Google to Your Word
    Hack 62 Permuting a Query
    Hack 63 Tracking Result Counts over Time
    Hack 64 Visualizing Google Results
    Hack 65 Meandering Your Google Neighborhood
    Hack 66 Running a Google Popularity Contest
    Hack 67 Building a Google Box
    Hack 68 Capturing a Moment in Time
    Hack 69 Feeling Really Lucky
    Hack 70 Gleaning Phonebook Stats
    Hack 71 Performing Proximity Searches
    Hack 72 Blending the Google and Amazon Web Services
    Hack 73 Getting Random Results (On Purpose)
    Hack 74 Restricting Searches to Top-Level Results
    Hack 75 Searching for Special Characters
    Hack 76 Digging Deeper into Sites
    Hack 77 Summarizing Results by Domain
    Hack 78 Scraping Yahoo! Buzz for a Google Search
    Hack 79 Measuring Google Mindshare
    Hack 80 Comparing Google Results with Those of Other Search Engines
    Hack 81 SafeSearch Certifying URLs
    Hack 82 Syndicating Google Search Results
    Hack 83 Searching Google Topics
    Hack 84 Finding the Largest Page
    Hack 85 Instant Messaging Google
    Chapter 7. Google Pranks and Games
    Chapter 8. The Webmaster Side of Google
    Colophon
    Index


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele