System Administration

Rebuilt AWS Infrastructure

I was hired as a contractor in early 2017 for some web development work. At the time, the company I was contracted to were experiencing extreme performance issues with their AWS environment.

  • They were having to reboot their primary RDS every Monday morning to prevent it collapsing under load.
  • Websites were very slow, often into the minute or more load times. Content managers were unable to access the CMS, website visitors couldn’t access the front end, and any increase in traffic could crash the entire server(s).
  • The stack was split into 3 very large web servers running Debian Linux with apache, the file system was a shared NFS disk and the database was a single massive RDS. All sites shared the same file system, including development sites, so any issues with one site could impact every other site.
  • The servers were quite old and unmaintained. Simple package updates were impossible.
  • An attempt was made to migrate everything to a “highly available and elastic” server configuration, although proper consideration wasn’t paid to the initial setup so it actually made things a lot worse, and blew costs out 10x.

I had experience with Linux and AWS, so even though I was contracted as a developer, I was asked to help out and eventually the project became mine.

The solution

Initially my top priority was to stop things from crashing. The sites were performing so poorly that the company was losing a lot of money on top of the increasing AWS costs. Clients were losing faith and the company was panicking.

One by one, I migrated each of the websites to their own tiny ec2 instance. They weren’t massive sites, but they did have a lot of content and media and they did collectively generate a lot of traffic. So by separating them I could minimise the impact that any one site had on the rest.

Around half of the websites were all fairly similar “business” type sites, half a dozen pages, contact forms and similar theme design, so to keep things easier to maintain, I migrated each of those into a WordPress multisite. As of 2020, there’s 3 multisites, 2 of them with around 10 – 12 sites each, and the 3rd has 5 sub-sites.

The rest of the websites were all larger and would benefit more being separate, so I left those as single, standalone sites on their own instances.

I upgraded all of the PHP versions to 7.x+ as most of the sites were running on PHP 5.3. This required a bit of work because even though they were all mostly WordPress, there was a lot of custom code that relied on things no longer in PHP.

I also wanted to take some pressure off the server storage space as well. Having so much media, in some cases up to 80GB for a single site, moving sites was a pain and the disks were filling up fast, even on standalone servers. So using the plugin WP Offload Media from Delicious Brains, I was able to move all of the media files into Amazon s3.

For load balancing, I decided to go with Cloudflare as you can start off on the free tier and still benefit from a solid CDN and built in SSL certificates. I realise I could have just gone with AWS ELB, but I was trying to keep things simple and cheap and the benefits of Cloudflare in my opinion were too good to pass up. It’s definitely been one the best decisions I’ve made.

Automated backups, also to s3, were put in place so if anything did go wrong we would recover quickly. All the theme and custom plugin code was moved into bitbucket so we could track changes and have a central branch for each project.

I know this isn’t an example of an “ideal” elastic cloud configuration, but in all honesty, that was never the goal. This wasn’t an experiment in Cloud Infrastructure, it was to fix a critical business issue and save the company money.

  • We cut costs down from (at its worst) $17k AUD a month, down to less than $2k a month.
  • Performance improved dramatically, down to millisecond load times in a lot of cases.
  • 2.5 years later (2020), the entire network averages 99.87% uptime.

The servers are still manually updated and new instances need to be spun up by hand, but for the time being that’s ok. The environment isn’t so large that it’s unmanageable and I do want to keep some aspects under manual control.

There’s a mixture of Ubuntu, Debian and CentOS Linux, for a variety of reasons, though most of them are Debian.

With everything in place now, in the rare occasion that there is an outage, it takes minutes to recover, and outages or increased traffic to one site doesn’t have any effect on anything else, and Cloudflare takes a lot of the load anyway.