Google Hacks Free Open Book

Google Hacks

Previous Section Next Section

Hack 54 NoXML, Another SOAP::Lite Alternative

figs/moderate.giffigs/hack54.gif

NoXML is a regular expressions-based, XML Parser-free drop-in alternative to SOAP::Lite.

XML jockeys might well want to avert their eyes for this one. What is herein suggested is something just so preposterous that it just might prove useful—and indeed it does. NoXML is a drop-in alternative to SOAP::Lite. As its name suggests, this home-brewed module doesn't make use of an XML parser of any kind, relying instead on some dead-simple regular expressions and other bits of programmatic magic.

If you have only a basic Perl installation at your disposal and are lacking both the SOAP::Lite [Hack #52] and XML::Parser Perl modules, NoXML will do in a pinch, playing nicely with just about every Perl hack in this book.

As any XML guru will attest, there's simply no substitute for an honest-to-goodness XML parser. And they'd be right. There are encoding and hierarchy issues that a regular expression-based parser simply can't fathom. NoXML is simplistic at best. That said, it does what needs doing, the very essence of "hacking."

Best of all, NoXML can fill in for SOAP::Lite with little more than a two-line alteration to the target hack.

54.1 The Code

The heart of this hack is NoXML.pm, which should be saved into the same directory as your hacks themselves.

# NoXML.pm
# NoXML [pronounced "no xml"] is a dire-need drop-in 
# replacement for SOAP::Lite designed for Google Web API hacking.

package NoXML;

use strict;
no strict "refs";

# LWP for making HTTP requests, XML for parsing Google SOAP
use LWP::UserAgent;
use XML::Simple;

# Create a new NoXML
sub new {
  my $self = {};
  bless($self);
  return $self;
}

# Replacement for the SOAP::Lite-based doGoogleSearch method
sub doGoogleSearch {
  my($self, %args);
  ($self, @args{qw/ key q start maxResults filter restrict 
  safeSearch lr ie oe /}) = @_;

  # grab SOAP request from _  _DATA_  _
  my $tell = tell(DATA);
  my $soap_request = join '', ; 
  seek(DATA, $tell, 0);
  $soap_request =~ s/\$(\w+)/$args{$1}/ge; #interpolate variables

  # Make (POST) a SOAP-based request to Google
  my $ua = LWP::UserAgent->new;
  my $req = HTTP::Request->new(POST => 'http://api.google.com/search/beta2');
  $req->content_type('text/xml');
  $req->content($soap_request);
  my $res = $ua->request($req);
  my $soap_response = $res->as_string;

  # Drop the HTTP headers and so forth until the initial xml element
  $soap_response =~ s/^.+?(<\?xml)/$1/migs;

  # Drop element namespaces for tolerance of future prefix changes
  $soap_response =~ s!(<\/?)[\w-]+?:([\w-]+?)!$1$2!g;

  # Set up a return dataset
  my $return;

  # Unescape escaped HTML in the resultset
  my %unescape = ('<'=>'<', '>'=>'>', '&'=>'&', '"'=>'"', '&apos;'=>"'"); 
  my $unescape_re = join '|' => keys %unescape;

  # Divide the SOAP response into the results and other metadata
  my($before, $results, $after) = $soap_response =~ 
    m#(^.+)(.+?)(.+$)#migs ;
  my $before_and_after = $before . $after;

  # Glean as much metadata as possible (while being somewhat lazy ;-)
  while ($before_and_after =~ m#([^<]*?)<#migs) {
    $return->{$1} = $3; # pack the metadata into the return dataset
  }

  # Glean the results
  my @results;
  while ($results =~ m#(.+?)#migs) {
    my $item = $1;
    my $pairs = {};
    while ( $item =~ m#([^<]*)#migs ) {
      my($element, $value) = ($1, $2);
      $value =~ s/($unescape_re)/$unescape{$1}/g;
      $pairs->{$element} = $value;
    }
    push @results, $pairs;
  }

  # Pack the results into the return dataset
  $return->{resultElements} = \@results;

  # Return nice, clean, usable results
  return $return;
}

1;

# This is the SOAP message template sent to api.google.com. Variables
# signified with $variablename are replaced by the values of their 
# counterparts sent to the doGoogleSearch subroutine.

_  _DATA_  _
<?xml version='1.0' encoding='UTF-8'?>
<SOAP-ENV:Envelope 
 xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" 
 xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance" 
 xmlns:xsd="http://www.w3.org/1999/XMLSchema">
  <SOAP-ENV:Body>
   <ns1:doGoogleSearch xmlns:ns1="urn:GoogleSearch" 
    SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
     <key xsi:type="xsd:string">$key</key>
     <q xsi:type="xsd:string">$q</q>
     <start xsi:type="xsd:int">$start</start>
     <maxResults xsi:type="xsd:int">$maxResults</maxResults>
     <filter xsi:type="xsd:boolean">$filter</filter>
     <restrict xsi:type="xsd:string">$restrict</restrict>
     <safeSearch xsi:type="xsd:boolean">$safeSearch</safeSearch>
     <lr xsi:type="xsd:string">$lr</lr>
     <ie xsi:type="xsd:string">$ie</ie>
     <oe xsi:type="xsd:string">$oe</oe>
   </ns1:doGoogleSearch>
 </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

Here's a little script to show NoXML in action. It's no different, really, from any number of hacks in this book. The only minor alterations necessary to make use of NoXML instead of SOAP::Lite are highlighted in bold.

#!/usr/bin/perl
# noxml_google2csv.pl
# Google Web Search Results via NoXML ("no xml") module
# exported to CSV suitable for import into Excel
# Usage: noxml_google2csv.pl "{query}" [> results.csv]

# Your Google API developer's key
my $google_key='insert key here';

use strict;

# use SOAP::Lite;
use NoXML;

$ARGV[0]
  or die qq{usage: perl noxml_search2csv.pl "{query}"\n};

# my $google_search = SOAP::Lite->service("file:$google_wdsl");
my $google_search = new NoXML;

my $results = $google_search -> 
  doGoogleSearch(
    $google_key, shift @ARGV, 0, 10, "false", 
    "", "false", "", "latin1", "latin1"
  );
@{$results->{'resultElements'}} or die('No results');

print qq{"title","url","snippet"\n};

foreach (@{$results->{'resultElements'}}) {
  $_->{title} =~ s!"!""!g; # double escape " marks
  $_->{snippet} =~ s!"!""!g;
  my $output = qq{"$_->{title}","$_->{URL}","$_->{snippet}"\n};
  $output =~ s!<.+?>!!g; # drop all HTML tags
  print $output;
} 

54.2 Running the Hack

Run the script from the command line, providing a query on the command line and piping the output to a CSV file you wish to create or to which you wish to append additional results. For example, using "no xml" as our query and results.csv as your output:

$ perl noxml_google2csv.pl "no xml" > results.csv

Leaving off the > and CSV filename sends the results to the screen for your perusal.

54.3 The Results

% perl noxml_google2csv.pl "no xml"
"title","url","snippet"
"site-comments@w3.org from January 2002: No XML specifications",
"http://lists.w3.org/Archives/Public/site-comments/2002Jan/0015.html",
"No XML specifications. From: Prof. ... Next message: Ian B. Jacobs: 
&quot;Re: No XML specifications&quot;; Previous message: Rob Cummings: 
&quot;Website design...&quot;; ...  "
...
"Re: [xml] XPath with no XML Doc",
"http://mail.gnome.org/archives/xml/2002-March/msg00194.html",
" ... Re: [xml] XPath with no XML Doc. From: &quot;Richard Jinks&quot; 
<cyberthymia yahoo co uk>; To: <xml gnome org>; Subject: 
Re: [xml] XPath with no XML Doc; ...  "

54.4 Applicability and Limitations

In the same manner, you can adapt just about any SOAP::Lite-based hack in this book and those you've made up yourself to use NoXML.

  1. Place NoXML.pm in the same directory as the hack at hand.

  2. Replace use SOAP::Lite; with use NoXML;.

  3. Replace my $google_search = SOAP::Lite->service("file:$google_wdsl"); with my $google_search = new NoXML;.

There are, however, some limitations. While NoXML works nicely to extract results and aggregate results the likes of <estimatedTotalResultsCount />, it falls down on gleaning some of the more advanced result elements like <directoryCategories />, an array of categories turned up by the query.

In general, bear in mind that your mileage may vary and don't be afraid to tweak.

54.5 See Also

  • PoXML [Hack #53], a plain old XML alternative to SOAP::Lite

  • XooMLE [Hack #36], a third-party service offering an intermediary plain old XML interface to the Google Web API

    Previous Section Next Section


         Main Menu
    Main Page
    Table of content
    Copyright
    Dedication
    Credits
    Foreword
    Preface
    Chapter 1. Searching Google
    Chapter 2. Google Special Services and Collections
    Chapter 3. Third-Party Google Services
    Chapter 4. Non-API Google Applications
    Chapter 5. Introducing the Google Web API
    5.1 Hacks #50-59
    5.2 Why an API?
    5.3 Signing Up and Google's Terms
    5.4 The Google Web APIs Developer's Kit
    5.5 Using the Key in a Hack
    5.6 What's WSDL?
    5.7 Understanding the Google API Query
    5.8 Understanding the Google API Response
    Hack 50 Programming the Google Web API with Perl
    Hack 51 Looping Around the 10-Result Limit
    Hack 52 The SOAP::Lite Perl Module
    Hack 53 Plain Old XML, a SOAP::Lite Alternative
    Hack 54 NoXML, Another SOAP::Lite Alternative
    Hack 55 Programming the Google Web API with PHP
    Hack 56 Programming the Google Web API with Java
    Hack 57 Programming the Google Web API with Python
    Hack 58 Programming the Google Web API with C# and .NET
    Hack 59 Programming the Google Web API with VB.NET
    Chapter 6. Google Web API Applications
    Chapter 7. Google Pranks and Games
    Chapter 8. The Webmaster Side of Google
    Colophon
    Index


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele