Drupal, ImageCache & Performance

A while ago we started having issues running a website. Bandwidth. Upon investigation, we saw that the majority of bandwidth was being use by image files and of those, lot was from the ImageCache folder. It was also causing a lot of excess server load by consuming Apache processes. Something needed to be done.

[adsense:468x60:4496506397]

Our first step was to farm all ImageCache handling off to a cheap server with lots of bandwidth. We rented a VPS from A2B2 and setup  sub-domain, files.example.com, to point at this server. I installed Lighttpd as, from experience, it is VERY good at serving files quickly and "cheaply". After the basic configuration (such as setting up the VHost) we needed to figure out how to get the files over there. I toyed with setting up a NFS share over the internet and mapping the files folder directly to the VPS, but there were some risks with it (dropped connections, VPS rebooting, etc) so I ended up settling on using RSync over SSH.

So I setup a user on the VPS, mapped a files folder from the home directory to the website folder and enabled the user access in /etc/ssh/sshd_config by adding them to the AllowUsers line with something like:

AllowUsers joebloggs@123.123.123.123

This line restricts the user joebloggs to only connect from 123.123.123.123. We then setup some Key Authentication so that we didn't need to hard code any passwords. The advantage of doing it this way is that you can give the client the public key, the host the private key and configure access so that the user can only use this key between the two machines.

Once this was done, we needed to setup RSync. With a little research I found the right set of commands to allow to sync the files folder over to the VPS.

/usr/bin/rsync -cavz  -e 'ssh -v -i /root/.ssh/files.example.com_id_rsa' --progress  --delete /var/www/html/www.example.com/drupal/sites/all/files/imagecache/ joebloggs@files.example.com:~/files/imagecache/

So this might seem like a long and complicated command... but it's not really. The RSync args basically tell it to:

  • Checksum (c) - compares files by a checksum instead of date/time.
  • Archive Mode (a) - basically enables several other features such as recursing into directories and preserving permissions & timestamps.
  • Verbosity (v) - handy for debugging later if it goes wrong.
  • Compression (z) - Enable compression during transfer

The next option (e) lets RSync know what remote shell to use. In this case, we're telling it to use SSH. The SSH command also has a few parameters. The first (v) is to enable Verbosity - again, good for debugging - and the second is for the identity file - in this case our public key for the VPS.

We have a couple more parameters going over to RSync; progress and delete. Progress simply tells us (in an interactive environment) how the transfers are doing; we don't really NEED this for the background tasks. The delete rule tells RSync to remove files from the VPS which are no longer on the main server; this helps to keep things tidy.

The next RSync parameter is the "source" path (ie what do we want to copy?). In this case, the ImageCache folder inside our sites files folder. The final parameter is where to copy it to, in this case the joebloggs account at the files server in the home folder.

After running this, we had a folder on the VPS which was a "mirror" of the main site's ImageCache. Things were looking good!

I then finished mapping the files folder in "joebloggs" home folder to the VPS's files VHost DocumentRoot. It was very important to keep the URL structure the same, so I created the following path...

/var/www/html/files.example.com/sites/all/files

I then used /etc/fstab to create a bind entry so I could mount the website's files folder to the users home files folder. The reason I did this was because the joebloggs account also had an FTP account which was "Root Jailed" and I wanted people to be able to get to the files folder over FTP…

Ok… so now we could access the same file in two ways:

  • http://www.example.com/sites/all/files/imagecache/myprofile/myimage.png
  • http://files.example.com/sites/all/files/imagecache/myprofile/myimage.png

Everything was looking good, so I then setup RSync to run every 5 minutes to sync files over between the two servers.

The next step was to tell our website to point image cache URL's to the files server instead of the webserver. Drupal's theme overrides makes this VERY easy. The first step is to copy the theme_imagecache function from imagecache.module (line 826) to your theme's template.php file. For reference, this is the function as of time of writing (it may change in future releases, so it is always worth checking your installed version).

function theme_imagecache($presetname, $path, $alt = '', $title = '', $attributes = NULL, $getsize = TRUE) {
  // Check is_null() so people can intentionally pass an empty array of
  // to override the defaults completely.
  if (is_null($attributes)) {
    $attributes = array('class' => 'imagecache imagecache-'. $presetname);
  }
  if ($getsize && ($image = image_get_info(imagecache_create_path($presetname, $path)))) {
    $attributes['width'] = $image['width'];
    $attributes['height'] = $image['height'];
  }

  $attributes = drupal_attributes($attributes);
  $imagecache_url = imagecache_create_url($presetname, $path);
  return ''. check_plain($alt) .'';
}

Now we need to make a few tweaks. Firstly, rename the function to match your theme (eg, garland_imagecache). Secondly, add a line just before the return:

$imagecache_url = 'http://files.example.com'. $imagecache_url;

This adds the absolute URL prefix to the ImageCache URL ($imagecache_url starts with the base path which is usually a slash (/)).

Now you should have something like this:

function garland_imagecache($presetname, $path, $alt = '', $title = '', $attributes = NULL, $getsize = TRUE) {
  // Check is_null() so people can intentionally pass an empty array of
  // to override the defaults completely.
  if (is_null($attributes)) {
    $attributes = array('class' => 'imagecache imagecache-'. $presetname);
  }
  if ($getsize && ($image = image_get_info(imagecache_create_path($presetname, $path)))) {
    $attributes['width'] = $image['width'];
    $attributes['height'] = $image['height'];
  }

  $attributes = drupal_attributes($attributes);
  $imagecache_url = imagecache_create_url($presetname, $path);
  $imagecache_url = 'http://files.example.com'. $imagecache_url;
  return ''. check_plain($alt) .'';
}

It was soon after this I realised a slight problem. What happens if you request a file which hasn't been sync'd over to the VPS yet? You will start getting 404 errors until it gets sync'd… And everything was going so well!

I did a little researched and stumbled into the LUA Scripting system in Lighttpd. This allowed me to create a very simple script which acts very much like Drupal's Clean URL Apache Rewrite rule which converts non-existing files into a index.php?q=blah format.

attr = lighty.stat(lighty.env['physical.path'])
if (not attr) then
  lighty.header["Location"] = "http://www.example.com"  .. lighty.env["request.uri"]
  return 302
end

What does this do?! Well, it takes the physical path of the request (ie the internal filesystem path) and does a "stat" on it (which basically checks if it exists by getting some info on the file). If the attributes for the file dont exist then we need to redirect the request back to the main server where the file is more likely to exist. We send the redirect as a 302 rather than a 301 (as the redirect isn't permanent). Once the file gets transferred over via the next due RSync, Lighttpd will be able to "stat" the file and therefore wont redirect it anymore.

So, we did this for one of our big sites and our bandwidth usages dropped significantly as our VPS started handling the files. The main webserver also had less to do and started responding better too!

Our VPS gave us 400Gb for £15/month. 400Gb with our host would have cost us somewhere in the region of £600/month. So that's a pretty big saving!