Google Hacks Free Open Book

Google Hacks

Previous Section Next Section

Hack 43 Building Google Directory URLs

figs/moderate.giffigs/hack43.gif

This hack uses ODP category information to build URLs for the Google Directory.

The Google Directory (http://directory.google.com/) overlays the Open Directory Project (or "ODP" or "DMOZ," http://www.dmoz.org/) ontology onto the Google core index. The result is a Yahoo!-like directory hierarchy of search results and their associated categories with the added magic of Google's popularity algorithms.

The ODP opens its entire database of listings to anybody—provided you're willing to download a 205 MB file (and that's compressed!). While you're probably not interested in all the individual listings, you might want particular ODP categories. Or you may be interested in watching new listings flowing into certain categories.

Unfortunately, the ODP does not offer a way to search by keyword sites added within a recent time period. (Yahoo! does offer this.) So instead of searching for recently added sites, the best way to get new site information from the ODP is to monitor categories.

Because the Google Directory does build its directory based on the ODP information, you can use the ODP category hierarchy information to generate Google Directory URLs. This hack searches the ODP category hierarchy information for keywords you specify, then builds Google Directory URLs and checks them to make sure they're active.

You'll need to download the category hierarchy information from the ODP to get this hack to work. The compressed file containing this information is available from http://dmoz.org/rdf.html. The specific file you're after is http://dmoz.org/rdf/structure.rdf.u8.gz. Before using it, you must uncompress it using a decompression application specific to your operating system. In the Unix environment, this looks something like:

% gunzip structure.rdf.u8.gz

Bear in mind that the full category hierarchy is over 35 MB. If you just want to experiment with the structure, you can get a excerpt at http://dmoz.org/rdf/structure.example.txt. This version is a plain text file and does not require uncompressing.

43.1 The Code

#!/usr/bin/perl
# google_dir.pl
# Uses ODP category information to build URLs into the Google Directory.
# Usage: perl google_dir.pl "keywords" < structure.rdf.u8

use strict;

use LWP::Simple;

# Turn off output buffering
$|++;

my $directory_url = "http://directory.google.com";

$ARGV
  or die qq{usage: perl google_dir.pl "{query}" < structure.rdf.u8\n};

# Grab those command-line specified keywords and build a regular expression
my $keywords = shift @ARGV;
$keywords =~ s!\s+!\|!g;

# A place to store topics
my %topics;

# Loop through the DMOZ category file, printing matching results
while (<>) {
  /"(Top\/.*$keywords.*)"/i and !$topics{$1}++ 
    and print "$directory_url/$1\n";
}

43.2 Running the Hack

Run the script from the command line, along with a query and the piped-in contents of the DMOZ category file:

% perl googledir.pl "keywords" < structure.rdf.u8

If you're using the shorter category excerpt, structure.example.txt, use:

% perl googledir.pl "keywords" < structure.example.txt

43.3 The Results

Feeding this hack the keyword mosaic would look something like:

% perl googledir.pl "mosaic" < structure.rdf.u8
http://directory.google.com/Top/Arts/Crafts/Mosaics
http://directory.google.com/Top/Arts/Crafts/Mosaics/Glass
http://directory.google.com/Top/Arts/Crafts/Mosaics/Ceramic_and_Broken_China
http://directory.google.com/Top/Arts/Crafts/Mosaics/Associations_and_Directories
http://directory.google.com/Top/Arts/Crafts/Mosaics/Stone
http://directory.google.com/Top/Shopping/Crafts/Mosaics
http://directory.google.com/Top/Shopping/Crafts/Supplies/Mosaics
...

43.4 Hacking the Hack

There isn't much hacking you can do to this hack; it's designed to take ODP data, create Google URLs, and verify those URLs. How well you can get this to work for you really depends on the types of search words you choose.

Do choose words that are more general. If you're interested in a particular state in the U.S., for example, choose the name of the state and major cities, but don't choose the name of a very small town or of the governor. Do choose the name of a company and not of its CFO. A good rule of thumb is to choose the keywords that you might find as entry names in an encyclopedia or almanac. You can easily imagine finding a company name as an encyclopedia entry, but it's a rare CFO who would rate an entry to themselves in an encyclopedia.

    Previous Section Next Section


         Main Menu
    Main Page
    Table of content
    Copyright
    Dedication
    Credits
    Foreword
    Preface
    Chapter 1. Searching Google
    Chapter 2. Google Special Services and Collections
    Chapter 3. Third-Party Google Services
    Chapter 4. Non-API Google Applications
    4.1 Hacks #41-49
    Hack 41 Don't Try This at Home
    Hack 42 Building a Custom Date-Range Search Form
    Hack 43 Building Google Directory URLs
    Hack 44 Scraping Google Results
    Hack 45 Scraping Google AdWords
    Hack 46 Scraping Google Groups
    Hack 47 Scraping Google News
    Hack 48 Scraping Google Catalogs
    Hack 49 Scraping the Google Phonebook
    Chapter 5. Introducing the Google Web API
    Chapter 6. Google Web API Applications
    Chapter 7. Google Pranks and Games
    Chapter 8. The Webmaster Side of Google
    Colophon
    Index


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele