Bash script to find most common 404

Following Kevin Hankens post on why you shouldn't ignore Drupal 404 errors, I decided to go through yesterdays error_log on our live apache server (the one which hosts www.pponline.co.uk, www.sportbusiness.com and www.mychild.co.uk alongside around 40 other dupral sites).

It turns out there were almost 5,000 404 (page not found) errors. How to find the most "popular" ones though? This called for a Bash script…

gawk '{ print $13 }' error_log.1  | grep ^/var | sort | uniq -c | sort -n

This uses gawk to parse yesterdays error log (hence the .1) and return column 13, assuming space is the default delimiter. Note: It turns out single quotes and double quotes mean different things to gawk!. Next I want to filter out lines beginning with "/var"; the gawk also returned values from memory & PHP errors. Next, sort them and do a unique liens count. Finally, sort this result with the most common entries at the end.

The result? SportBusiness REALLY needs a favicon in the default place - that along accounted for 20% of the 404's!

UPDATE

gunzip -c /var/log/httpd/error_log.3.gz | gawk '{ print $13 }' | grep ^/var | sort | uniq -c | sort -n

This is a slight alternative, if you use compressed log files is the following (it saves decompressing the file first).

Comment Icon

6 Comments

The most recent comment was on Fri, 11th Mar 2011 - 19:10

This one will definitely become part of my site status routine.

BTW, do you use any stats software to parse through your logs? I'm trying to find the best way to get a quick glimpse into the site. I've been doing some playing with Splunk and had fairly good luck so far, but haven't tried anything else.

Thanks for sharing!!

There's a pretty good chance that your one liner above may catch more than just 404 errors, and if the user has a site not stored in /var, it won't pick up any entries. The only way to really get 404's is to pull them out of the access log. The following will pull all 404's out of the access log, count them, sort them in descending order of occurrences, and print out the Top 10:

gawk '{ print $7,$9 }' access.log  | grep " 404" \
| cut -d' ' -f1 | sort | uniq -c | sort -rn \
| head -n 10

Or in Drupal:
/admin/reports/page-not-found

And if you are running the latest versions of Drupal then the favicon 404s will be prevented by .htaccess

@Kevin Hankens: I only have AWStats... Will take a look at Splunk.

@Justin Ellison: Yup - but in my case, I have seperate access_logs for all the sites, but one single error log. And all my sites are in /var/www/html... Good point though!

@Anonymous & @dalin: You're right... Every individual Drupal site does this for you. What if you have a dedicated server running 40 Drupal sites? I'll be buggered if I'm going through all those sites individually ;-)

Add new comment

Filtered HTML

  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <pre> <ul> <ol> <li> <dl> <dt> <dd> <img> <p>
  • You can use BBCode tags in the text. URLs will automatically be converted to links.
  • You can enable syntax highlighting of source code with the following tags: <code>, <pre>, <bash>, <css>, <html>, <js>, <jquery>, <mysql>, <php>. PHP source code can also be enclosed in <?php ... ?> or <% ... %>.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.