From time to time I like to parse my Apache access_log file from the command line. There are a lot of great tools for parsing access logs and retrieving information from them that operate outside of the command line. When you want to operate against what’s happening right this second, there’s no equal to simply tailing the access_log from a shell.
I found myself frequently rendering the data a specific way by piping it through multiple utilities. Specifically:
cat /var/log/httpd/access_log|awk {'print $1'}|sort -k 1,1n -k 2,2n -k 3,3n -k 4,4n|uniq -c|sort -k 1n
Which produces:
41 66.249.65.52
42 61.186.161.156
47 208.101.2.194
58 124.161.238.40
63 67.182.236.189
64 87.241.212.2
64 98.100.108.130
335 97.126.173.32
I find this rendering useful because it shows me what IPs, or hostnames if you’re resolving them, are visiting my site and how much interaction they’re having with the site. And since I don’t have Apache attempting gethostbyname() lookups my logs only have IPs. Once I’ve rendered the data in this way I sometimes wonder what domain the top visitors are coming from. To obtain that information I construct another series of piped commands:
cat /var/log/httpd/access_log|awk {'print $1'}|sort -k 1,1n -k 2,2n -k 3,3n -k 4,4n|uniq -c|sort -k 1n|awk {'print $2'}|xargs -l1 dig -x |egrep 'SOA|PTR'|awk {'print $5'}
Which produces:
crawl-66-249-65-52.googlebot.com.
ns1.apnic.net.
208.101.2.194-static.reverse.softlayer.com.
dnssvr3.169ol.com.
c-67-182-236-189.hsd1.ut.comcast.net.
ns1.enforta.com.
rrcs-98-100-108-130.central.biz.rr.com.
97-126-173-32.slkc.qwest.net.
When I can’t see a PTR record I’m happy to simply see who has authority to represent that IPs DNS information, which is why I then grep for the SOA. I mention this in case anyone was feeling the need at this point to explain to me that the PTR and SOA are not interchangeable bits of information, they’re not – but for my purposes here either is informative.
Now, typing these commands out is a bit of a pain. I could stick them in a simple bash script. It would be very simple to read $1, the first argument passed to the script from the command line upon invocation, and take that command line argument as the file to parse. So I did that, well, I did it one better. I used getopts and wrote a quick little script that would take the argument to -f as the file to parse and allow a -c option to turn the printing of hit count on or off. This way you can parse a file, sort it, print it without the count, and pipe the output to xargs -l1|dig -x etc to find out who is visiting your site.
#!/bin/bash
#
# Script: unique_addresses
# Author: Ryan Bonnett
# Date: 10/02/2009
#
# Purpose: This script should properly parse any file as long
# as the first field of each line is an IP address.
# However, it was primarily written to parse Apache's
# access_log inasmuch as Apache is using the default
# logging format. The script returns either a list of
# IP addresses or a list of IP addresses with a count
# specifying how often that IP showed up in the log file.
# I recommend piping the output to:
# |xargs -l1 dig -x|egrep 'SOA|PTR'|awk {'print $5'}
# that will attempt to resolve the IP addresses to
# hostnames/SOA provider name.
#
############################################
#
# set -n # Uncomment to check command syntax without execution
# set -x # Uncomment to debug
file=/var/log/httpd/access_log
######## sortlog function
sortlog()
{
cat $file |awk {'print $1'}\
|sort -t\. -k 1,1n -k 2,2n -k 3,3n -k 4,4n\
|uniq -c\
|sort -k 1,1n
}
######## testfile function
testfile ()
{
if [ -s "$file" -a -r "$file" ]
then
:
else
echo "$file doesn't exist, is empty, or we don't\
have read permissions."
exit 1
fi
}
######## usage function
usage ()
{
echo "Usage: $0 [options]"
echo ""
echo "-c takes no arguments, shows IP address hit count."
echo "-f follow this option with the file you're processing."
echo ""
echo "Example: unique_addresses -c -f /var/log/httpd/access_log"
exit 1
}
######## Main body of script
while getopts ":cf:" options
do
case $options in
c) countflag=1 ;;
f) file=$OPTARG ;;
\?) usage
exit 1 ;;
esac
done
if [ "$countflag" = 1 ]
then
sortlog
else
testfile
sortlog |awk {'print $2'}
fi
After writing this short script I decided that it was lacking in many ways. I could pipe data out of it, but I couldn’t pipe data in – I could only read from a file. I couldn’t specify a field separator, I couldn’t specify the field to read, I couldn’t resolve hostnames, and who knows what other features I may want in the future. A lot of files track IP address information and I wanted to be able to parse any of them and render the information sorted, counted, and possibly name resolved. So I decided to create a more fully featured utility, that’s where perl comes in:
#!/usr/bin/perl -w
#
# Script: ipmolest
# Purpose: This script sorts a list of IP addresses, listing the
# IP address that occurs most frequently in the list last.
# The script can also give a count of how often each IP
# appeared in the list. And it can attempt to resolve the
# IP addresses to hostnames.
# By default it will parse the apache access_log.
# With the optional flags you can specify an alternate
# field separator or column to parse. You can also pipe
# data into this script as you would with many other UNIX
# utilities.
#
# Author: Ryan Bonnett
# Date: 10/07/09
#
###############################################################
############################ Modules
use Socket;
use IO::Select;
use Getopt::Long;
############################ Subs
sub gethost { # This sub simply grabs a PTR if one is found.
# Make sure we're looking at an IP address.
return $_[0] if ($_[0] !~ m/^\d+\.\d+\.\d+\.\d+$/);
my $ip = inet_aton $_[0];
if (gethostbyaddr($ip, AF_INET)){
return gethostbyaddr($ip, AF_INET);
} else {
return $_[0];
}
}
sub usage {
print "Usage: ipmolest [options]\n";
printf " %-20s%s\n",
"--count, -c", "Track how many times each IP was seen.";
printf " %-20s%s\n",
"--field, -d", "The input field that contains an IP address.";
printf " %-20s%s\n",
"--file, -f", "Define the file you'd like to parse.";
printf " %-20s%s\n",
"--separator, -fs", "Specify a field separator.";
printf " %-20s%s\n",
"--help, -[h?]", "View the usage information.";
printf " %-20s%s\n",
"--resolve, -r", "Resolve IPs to hostnames where possible.";
printf " %-20s%s\n",
"--tail, -t", "Works like a pipe through tail, saves work.";
exit;
}
############################ Default values and GetOptions
$file = "/var/log/httpd/access_log";
$field = 0;
$separator = " ";
# Read options passed to the script at the command line.
GetOptions ( 'c|count' => \$count
, 'd|field=i' => \$field
, 'f|file=s' => \$file
, 'fs|separator=s' => \$separator
, 'h|help|?' => \$help
, 'r|resolve' => \$resolve
, 't|tail=i' => \$tail);
&usage if ($help);
############################ Begin Main
# Check for STDIN - is someone piping data into this script?
# If we are reading piped data then spool it up in the @ips hash.
$pipe = IO::Select->new();
$pipe->add(\*STDIN);
if ($pipe->can_read(.5)){
while(<>){
@line = split($separator, $_);
$ips{$line[$field]}+=1;
}
}
# Pick a file to parse and open it as long as we didn't read piped input.
if (!@line) {open (LOG, "$file")|| die "Can't open: $!\n";}
# Grab the IP addresses and count them in @ips, if we're reading a file.
if (!@line){
foreach $line () {
@line = split($separator, $line);
$ips{$line[$field]}+=1;
}
close (LOG);
}
# We stuck all of the IP addresses into a hash and gave the IPs a value
# that represents how many times that IP showed up in the input.
# Below we sort the keys by their value.
@sorted_ips = sort { $ips{$a} <=> $ips{$b} } keys %ips;
# Check for the tail option and modify the sorted_ips array accordingly
if ($tail){
$array_length = @sorted_ips;
if ($array_length > $tail){
$begin = ($array_length - $tail);
splice(@sorted_ips, 0, $begin);
}
}
# Now we read through the sorted IP keys and print them out one at a time.
foreach $ip (@sorted_ips){
next if (!$ip);
# Check for the resolve flag, call the gethost sub to resolve.
if ($resolve){
# Check for the count flag, count if we're counting.
if ($count){
printf ("%-10s%s\n", "$ips{$ip}", &gethost($ip));
} else {
print &gethost($ip) . "\n";
}
} else {
# We're not resolving, are we counting?
if ($count){
printf ("%-10s%s\n", "$ips{$ip}", "$ip");
} else {
print "$ip\n";
}
}
}
First let me say that this perl script isn’t “clean”. It’s something that I hacked together and it works perfectly well for me and my purposes. It could be cleaned up by adding a few more tests to make sure you’re operating with the expected data, etc. The point of this post isn’t to demonstrate my masterful skills at coding. If you do happen to take this script and industrialize it I would love to have a copy of your cleaned up version.
The perl script above weighs in at 1996 meaningful characters. I wasn’t going for concise, I was going for readable/commented. I’m sure it could be produced in half the size if that was the goal. The bash script comes in at 784 meaningful characters and it really couldn’t be written much more compact. Had I attempted to make the bash script offer the functionality that the perl script is now offering it would have taken infinitely more effort on my part. The bash script would’ve weighed in at more than double the length of the perl script. The bash script wouldn’t be portable, unless you added in endless case statements to support other Unix/Linux/GNU systems. The final reason that you shouldn’t extend bash scripts into the world of processing: the shell script takes 4 times as long to parse a file.
Behold, the perl is mightier than the bash when scripting:
[ryan@www www]# ls -l data/log
-rw-r--r-- 1 ryan ryan 455689591 2009-10-08 22:10 data/log
[ryan@www www]# wc -l data/log
1995175 data/log
[ryan@www www]# time ./ipmolest -c -f data/log > output
real 0m34.518s
user 0m33.375s
sys 0m0.636s
[ryan@www www]# time ./unique_addresses -c -f data/log > outputtoo
real 2m21.330s
user 2m23.667s
sys 0m1.613s
[ryan@www www]#
As you can see, it took the perl script 34 seconds to process the ~2 million line, 455MB, file and it took the bash script more than 4 times as long.
Okay, on to other interesting things. The perl script has some portions that are comment worthy, let’s go over those now.
Here is a handy way to have your perl script accept data that is piped to it:
# Check for STDIN - is someone piping data into this script?
$pipe = IO::Select->new();
$pipe->add(\*STDIN);
if ($pipe->can_read(.5)){
while(<>){}
}
Another handy snippet is this bit that will create an array containing the keys of a hash sorted by the key’s values.
@sorted_ips = sort { $ips{$a} <=> $ips{$b} } keys %ips;
That’s all for this post.
Ryan