PHP CookBook Free Open Book

PHP CookBook

Previous Section Next Section

Recipe 11.14 Parsing a Web Server Log File

11.14.1 Problem

You want to do calculations based on the information in your web server's access log file.

11.14.2 Solution

Open the file and parse each line with a regular expression that matches the log file format. This regular expression matches the NCSA Combined Log Format:

$pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) (\[[^\]]+\]) "(.*) (.*) (.*)" ([0-9\-]+)
    ([0-9\-]+) "(.*)" "(.*)"$/';

11.14.3 Discussion

This program parses the NCSA Combined Log Format lines and displays a list of pages sorted by the number of requests for each page:

$log_file = '/usr/local/apache/logs/access.log';
$pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) (\[[^\]]+\]) "(.*) (.*) (.*)" ([0-9\-]+)
    ([0-9\-]+) "(.*)" "(.*)"$/';

$fh = fopen($log_file,'r') or die($php_errormsg);
$i = 1;
$requests = array();
while (! feof($fh)) {
    // read each line and trim off leading/trailing whitespace
    if ($s = trim(fgets($fh,16384))) {
        // match the line to the pattern
        if (preg_match($pattern,$s,$matches)) {
            /* put each part of the match in an appropriately-named
             * variable */
            list($whole_match,$remote_host,$logname,$user,$time,
                 $method,$request,$protocol,$status,$bytes,$referer,
                 $user_agent) = $matches;
             // keep track of the count of each request 
            $requests[$request]++;
        } else {
            // complain if the line didn't match the pattern 
            error_log("Can't parse line $i: $s");
        }
    }
    $i++;
}
fclose($fh) or die($php_errormsg);

// sort the array (in reverse) by number of requests 
arsort($requests);

// print formatted results
foreach ($requests as $request => $accesses) {
    printf("%6d   %s\n",$accesses,$request);
}

The pattern used in preg_match( ) matches Combined Log Format lines such as:

10.1.1.162 - david [20/Jul/2001:13:05:02 -0400] "GET /sklar.css HTTP/1.0" 200 
278 "-" "Mozilla/4.77 [en] (WinNT; U)"
10.1.1.248 - - [14/Mar/2002:13:31:37 -0500] "GET /php-cookbook/colors.html 
HTTP/1.1" 200 460 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"

In the first line, 10.1.1.162 is the IP address that the request came from. Depending on the server configuration, this could be a hostname instead. When the $matches array is assigned to the list of separate variables, the hostname is stored in $remote_host. The next hyphen (-) means that the remote host didn't supply a username via identd,[1] so $logname is set to -.

[1] identd, defined in RFC 1413, is supposed to be a good way to identify users remotely. However, it's not very secure or reliable. A good explanation of why is at http://www.clock.org/~fair/opinion/identd.html.

The string david is a username provided by the browser using HTTP Basic Authentication and is put in $user. The date and time of the request, stored in $time, is in brackets. This date and time format isn't understood by strtotime( ), so if you wanted to do calculations based on request date and time, you have to do some further processing to extract each piece of the formatted time string. Next, in quotes, is the first line of the request. This is composed of the method (GET, POST, HEAD, etc.) which is stored in $method; the requested URI, which is stored in $request, and the protocol, which is stored in $protocol. For GET requests, the query string is part of the URI. For POST requests, the request body that contains the variables isn't logged.

After the request comes the request status, stored in $status. Status 200 means the request was successful. After the status is the size in bytes of the response, stored in $bytes. The last two elements of the line, each in quotes, are the referring page if any, stored in $referer[2] and the user agent string identifying the browser that made the request, stored in $user_agent.

[2] The correct way to spell this word is "referrer." However, since the original HTTP specification (RFC 1945) misspelled it as "referer," the three-R spelling is frequently used in context.

Once the log file line has been parsed into distinct variables, you can do the needed calculations. In this case, just keep a counter in the $requests array of how many times each URI is requested. After looping through all lines in the file, print out a sorted, formatted list of requests and counts.

Calculating statistics this way from web server access logs is easy, but it's not very flexible. The program needs to be modified for different kinds of reports, restricted date ranges, report formatting, and many other features. A better solution for comprehensive web site statistics is to use a program such as analog, available for free at http://www.analog.cx. It has many types of reports and configuration options that should satisfy just about every need you may have.

11.14.4 See Also

Documentation on preg_match( ) at http://www.php.net/preg-match; information about common log file formats is available at http://httpd.apache.org/docs/logs.html.

    Previous Section Next Section
    Index: [SYMBOL][A][B][C][D][E][F][G][H][I][J][K][L][M][N][O][P][Q][R][S][T][U][V][W][X][Z]


         Main Menu
    Main Page
    Table of content
    Copyright
    Preface
    Chapter 1. Strings
    Chapter 2. Numbers
    Chapter 3. Dates and Times
    Chapter 4. Arrays
    Chapter 5. Variables
    Chapter 6. Functions
    Chapter 7. Classes and Objects
    Chapter 8. Web Basics
    Chapter 9. Forms
    Chapter 10. Database Access
    Chapter 11. Web Automation
    11.1 Introduction
    Recipe 11.2 Fetching a URL with the GET Method
    Recipe 11.3 Fetching a URL with the POST Method
    Recipe 11.4 Fetching a URL with Cookies
    Recipe 11.5 Fetching a URL with Headers
    Recipe 11.6 Fetching an HTTPS URL
    Recipe 11.7 Debugging the Raw HTTP Exchange
    Recipe 11.8 Marking Up a Web Page
    Recipe 11.9 Extracting Links from an HTML File
    Recipe 11.10 Converting ASCII to HTML
    Recipe 11.11 Converting HTML to ASCII
    Recipe 11.12 Removing HTML and PHP Tags
    Recipe 11.13 Using Smarty Templates
    Recipe 11.14 Parsing a Web Server Log File
    Recipe 11.15 Program: Finding Stale Links
    Recipe 11.16 Program: Finding Fresh Links
    Chapter 12. XML
    Chapter 13. Regular Expressions
    Chapter 14. Encryption and Security
    Chapter 15. Graphics
    Chapter 16. Internationalization and Localization
    Chapter 17. Internet Services
    Chapter 18. Files
    Chapter 19. Directories
    Chapter 20. Client-Side PHP
    Chapter 21. PEAR
    Colophon
    Index


    More Books
    PHP Hacks
    Processing Xml With Java - A Guide To Sax, Dom, Jdom, Jaxp, And Trax
    The Koran (Holy Qur'an)
    Macromedia Flash 8 Bible
    Search Engine Optimization for Dummies
    YouTube Traffic
    PHP 5 for Dummies
    Harry Potter and The Chamber of Secrets
    Harry Potter and the Sorcerer's Stone
    The Pilgrim's Progress
    Wireless Hacks
    Flash Hacks. 100 Industrial-Strength Tips & Tools
    PayPal Hacks. 100 Industrial-Strength Tips and Tools
    Amazon Hacks
    Pdf Hacks
    The Da Vinci Code
    Google Hacks
    The Holy Bible
    Windows XP For Dummies
    Harry Potter and the Half-Blood Prince
    Seo Book
    Upgrading and Repairing Networks
    Macromedia Dreamweaver 8 UNLEASHED
    Windows XP Annoyances
    Windows XP Hacks
    Microsoft Windows XP Power Toolkit
    Teach Yourself MS Office In 24Hours
    iPod & iTunes Missing Manual
    PC Hacks 100 Industrial-Strength Tips and Tools
    PC Overclocking, Optimization, and Tuning - 2th Edition
    PC Hardware In A Nutshell 3rd Edition
    PC Hardware in a Nutshell, 2nd Edition
    Upgrading and Repairing PCs
    Google for Dummies
    MySQL Cookbook
    Teach Yourself Macromedia Flash 8 In 24 Hours
    PHP CookBook
    Sams Teach Yourself JavaScript in 24 Hours
    PHP5 Manual
    Free Games Paper Airplanes
    500 Juegos Gratis 500 Giochi Gratis 500 Jeux Gratuits 500 Jogos Gratis 500 Kostenlose Spiele