Bash script to find most common 404

Following Kevin Hankens post on why you shouldn't ignore Drupal 404 errors, I decided to go through yesterdays error_log on our live apache server (the one which hosts www.pponline.co.uk, www.sportbusiness.com and www.mychild.co.uk alongside around 40 other dupral sites).

It turns out there were almost 5,000 404 (page not found) errors. How to find the most "popular" ones though? This called for a Bash script…


gawk '{ print $13 }' error_log.1  | grep ^/var | sort | uniq -c | sort -n

This uses gawk to parse yesterday's error log (hence the .1) and return column 13, assuming space is the default delimiter. Note: It turns out single quotes and double quotes mean different things to gawk!. Next I want to filter out lines beginning with "/var"; the gawk also returned values from memory & PHP errors. Next, sort them and do a unique liens count. Finally, sort this result with the most common entries at the end.

The result? SportBusiness REALLY needs a favicon in the default place - that along accounted for 20% of the 404's!

UPDATE


gunzip -c /var/log/httpd/error_log.3.gz | gawk '{ print $13 }' | grep ^/var | sort | uniq -c | sort -n

This is a slight alternative, if you use compressed log files is the following (it saves decompressing the file first).