Google Hacks Free Open Book

Google Hacks

Previous Section Next Section

Hack 46 Scraping Google Groups

figs/moderate.giffigs/hack46.gif

Pulling results from Google Groups searches into a comma-delimited file.

It's easy to look at the Internet and say it's web pages, or it's computers, or it's networks. But look a little deeper and you'll see that the core of the Internet is discussions: mailing lists, online forums, and even web sites, where people hold forth in glorious HTML, waiting for people to drop by, consider their philosophies, make contact, or buy their products and services.

Nowhere is the Internet-as-conversation idea more prevalent than in Usenet newsgroups. Google Groups has an archive of over 700 million messages from years of Usenet traffic. If you're doing timely research, searching and saving Google Groups message pointers comes in really handy.

Because Google Groups is not searchable by the current version of the Google API, you can't build an automated Google Groups query tool without violating Google's TOS. However, you can scrape the HTML of a page you visit personally and save to your hard drive.

46.1 Saving Pages

The first thing you need to do is run a Google Groups search. See the Google Groups [Hack #30] discussion for some hints on best practices for searching this message archive.

It's best to put pages you're going to scrape in order of date; that way if you're going to scrape more pages later, it's easy to look at them and check the last date the search results changed. Let's say you're trying to keep up with uses of Perl in programming the Google API; your query might look like this:

perl group:google.public.web-apis

On the righthand side of the results page is an option to sort either by relevance or date. Sort it by date. Your results page should look something like Figure 4-1.

Figure 4-1. Results of a Google Groups search
figs/gooH_0401.gif

Save this page to your hard drive, naming it something memorable like groups.html.

46.2 Scraping Caveats

There are a couple of things to keep in mind when it comes to scraping pages, Google or not:

  • Scraping is brittle. A scraper is based on the way a page is formatted at the time the scraper is written. This means one minor change in the page, and things break down rather quickly.

  • There are myriad ways of scraping any particular page. This is just one of them, so experiment!

46.3 The Code

# groups2csv.pl
# Google Groups results exported to CSV suitable for import into Excel
# Usage: perl groups2csv.pl < groups.html > groups.csv

# The CSV Header
print qq{"title","url","group","date","author","number of articles"\n};

# The base URL for Google Groups
my $url = "http://groups.google.com";

# Rake in those results
my($results) = (join '', <>);

# Perform a regular expression match to glean individual results
while ( $results =~ m!<p><a href="?(.+?)"?>(.+?)</a><font size=-1(.+?)<br>
<font color=green><a href=.+?>(.+?)</a>\s+-\s+(.+?)\s+by\s+(.+?)\s+-.+?\((\d+) articles!mgis ) {
    my($path, $title, $snippet, $group, $date, $author, $articles) =
        ($1||'',$2||'',$3||'',$4||'',$5||'',$6||'',$7||'');
    $title =~ s!"!""!g; # double escape " marks
    $title =~ s!<.+?>!!g; # drop all HTML tags
    print qq{"$title","$url$path","$group","$date","$author","$articles"\n};
}

46.4 Running the Hack

Run the script from the command line, specifying the Google Groups results filename you saved earlier and name of the CSV file you wish to create or to which you wish to append additional results. For example, using groups.html as your input and groups.csv as your output:

$ perl groups2csv.pl < groups.html > groups.csv

Leaving off the > and CSV filename sends the results to the screen for your perusal.

Using a double >> before the CSV filename appends the current set of results to the CSV file, creating it if it doesn't already exist. This is useful for combining more than one set of results, represented by more than one saved results page:

$ perl groups2csv.pl < results_1.html > results.csv
$ perl groups2csv.pl < results_2.html >> results.csv

46.5 The Results

Scraping the results of a search for perl group:google.public.web-apis, anything mentioning the Perl programming language on the Google APIs discussion forum, looks like:

$ perl groups2csv.pl < groups.html > groups.csv
"title","url","group","date","author","number of articles"
"Re: Perl Problem?",
"http://groups.google.com/groups?q=perl+group:google.public.
web-apis&hl=en&lr=&ie=UTF-8&output=search&selm=5533bb12.0208230215.
365a093d%40po sting.google.com&rnum=1",
"google.public.web-apis","Aug. 23, 2002","S Anand","2"
"Proxy usage from Perl script",
"http://groups.google.com/groups?q=perl+group:goo
gle.public.web-apis&hl=en&lr=&ie=UTF-8&output=search&selm=575db61f.
0206290446.1d fe4ea7%40posting.google.com&rnum=2",
"google.public.web-apis","Jun. 29, 2002","Varun","3"
...
"The Google Velocity",
"http://groups.google.com/groups?q=perl+group:google.public.web-apis&hl
=en&lr=&ie=UTF-8&output=search&selm=18a1ac72.0204221336.47fdee71%
40posting.google.com&rnum=29",
"google.public.web-apis","Apr. 22, 2002","John Graham-Cumming","2"
    Previous Section Next Section


         Main Menu
    Main Page
    Table of content
    Copyright
    Dedication
    Credits
    Foreword
    Preface
    Chapter 1. Searching Google
    Chapter 2. Google Special Services and Collections
    Chapter 3. Third-Party Google Services
    Chapter 4. Non-API Google Applications
    4.1 Hacks #41-49
    Hack 41 Don't Try This at Home
    Hack 42 Building a Custom Date-Range Search Form
    Hack 43 Building Google Directory URLs
    Hack 44 Scraping Google Results
    Hack 45 Scraping Google AdWords
    Hack 46 Scraping Google Groups
    Hack 47 Scraping Google News
    Hack 48 Scraping Google Catalogs
    Hack 49 Scraping the Google Phonebook
    Chapter 5. Introducing the Google Web API
    Chapter 6. Google Web API Applications
    Chapter 7. Google Pranks and Games
    Chapter 8. The Webmaster Side of Google
    Colophon
    Index


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele