← recent

Fighting Back After Hacker News Took Down My Site

Last week I had a post make it to the Hacker News front page, and my site immediately went down. After fighting with it for a while, I was able to get it limping along well enough to last the day, and since then I’ve made several simple changes that serve as a much more robust solution. I imagine there are a ton of self-hosters out there with similar setups as I had, so hopefully the details of my comeuppance will help others preventatively.

The Problem I run this blog and a few other sites on linode, with 768M RAM. I run apache, which serves 1) this php blog, 2) a rails app, 3) a sinatra app, 4) another sinatra app, and 5) a proxy to to a local NodeJS app. All of my sites serve dynamic content, and none get enough traffic for me to have made caching or performance tuning a huge priority. I was using the default ubuntu apache config.

When my blog post made it to the front page of hacker news, it shot up to #8, and I went down immediately.

The Immediate Solution #1 I suspected the main problem was that wordpress wasn’t doing much/any caching by default. I found a page caching plugin, opened up an SSH connection to my box, and waited for several minutes to get a prompt. My load average was around 60, and pkill -9 -f apache took several minutes to take effect. Once apache was dead, I downloaded the plugin, restarted apache, and used the wordpress admin interface to enable basic page caching. About 30 seconds later, my site went down again.

The Immediate Solution #2 The page caching definitely helped, but I was still leaking memory fast. I set up a cronjob to restart apache every minute, and had a chat with the resident Braintree sysadmin. He told me to look at the the number of workers apache had running, and ps aux | grep apache showed tons. I jumped into the config file (default ubuntu config), and it was configured to run up to 150 workers, which my little server couldn’t nearly handle. I bumped that down to 50, then again to 25. The site was slow, but stable enough to last the day without the restart cronjob.

The Long-Term Solution The page caching and worker count fixes both were good long-term solutions, but there was another more subtle problem. Because I serve a lot of dynamic data on my other sites, the synchronous nature of apache is a big liability. If I have a popular post about Power Hungry, it’s sure to send a fair amount of traffic to that link. While the blog is (now) cached, Power Hungry is still going to be rather slow. Because apache workers can only handle one request at a time, slow requests to one site have the side-effect of slowing down traffic to all of my sites. I addressed this by switching to nginx, which is asynchronous. This means that slow requests don’t block subsequent requests, which means fewer workers are needed. Memory is no longer the bottleneck it once was.

I experimented with Varnish (in-memory page caching), but didn’t see significant performance improvements. I imagine I was just missing something with that, but the numbers I was seeing from just switching to nginx were impressive enough that I think I’m pretty good for now. We’ll see :)

Benchmarks It’s important to restart everything and run benchmarks after every significant change. Benchmarks should be run from another system, with adequate resources (I use a box I have at slicehost). Here are a few of my interesting ones:

apache no caching, no concurrency (normal non-spike traffic) $ ab -c 1 -n 50 http://www.plainlystated.com/ Percentage of the requests served within a certain time (ms) 50% 684 66% 707 75% 725 80% 742 90% 758 95% 788 98% 1388 99% 1388 100% 1388 (longest request)

apache no caching, high concurrency $ ab -c 50 -n 500 http://www.plainlystated.com/ ^C (killed.. system thrashing from running out of ram) Percentage of the requests served within a certain time (ms) 50% 57670 66% 139696 75% 211840 80% 239120 90% 348515 95% 394327 98% 406878 99% 418990 100% 418990 (longest request)

nginx, W3 Total Cache wordpress plugin, high concurrency $ ab -c 50 -n 1000 http://www.plainlystated.com/ Percentage of the requests served within a certain time (ms) 50% 300 66% 307 75% 309 80% 310 90% 328 95% 357 98% 360 99% 559 100% 3304 (longest request)

More info I got a lot of tips from Daniel Miessler.