How did you make the jump to high availability?

Incertitude · December 31, 2016, 8:56pm

My saas business is starting to get traction (it’s generating enough on which to live comfortably), and I’ve also had a few availability incidents caused by my old VPS provider (Backbone AG - I ditched them a few weeks ago, and I suppose everyone else did too because they sent around emails last week saying they were quitting the VPS business).

I’m using python + postgres (so there isn’t really a “works straight out of the box” middleware solution to pool connections and handle failover). I’m also having trouble finding a good infrastructure engineer who has set up HA postgres before. The one who had promised to do it flaked on me just before christmas, and I’ve had a couple of others take a serious look at it and tell me “nah, too hard” … not to mention the dozens who have said “haven’t ever done it but I once did this thing with Galera…”.

I could of course go with a managed DB, but then for network latency I’ve generally got to be in the same datacenter and the big ones (AWS, Azure…) cost a mint for commodity VPSes. My current server bill at Linode is 15% of what it would be at Azure. Call me cheap, but a managed DB also seems to cost $3-5k per annum more than knowing how to do it yourself on a few commodity VPSes, plus you don’t have the knowledge either. Not to mention lock in - call me paranoid, but my experience with Azure and Backbone so far makes me really like the idea of being able to reprovision the application quickly at another provider.

So for those of you who have implemented HA on your Saas:

What stack did you use?
Was there open source middleware + configuration recipes already available?
Did you get outside help, and was it useful?
What were the unforseen pitfalls?
How did it all work out?
Any HA-related advice at this stage in the journey?

anthonyfranco · December 31, 2016, 10:20pm

I think postgres just came out with a native replication solution, so you might want to look into that.

For me, I have a Rails + MySQL app. Using master-master replication I always have a hot spare server available.

I then use the failover service that my DNS provider has. When it detects my main server is down it automatically switches the DNS to the backup server.

garrettdimon · December 31, 2016, 11:42pm

The biggest thing I’d advise is to be really sure it’s worth it. Downtime sucks, but introducing a more complicated architecture can bring its own growing pains and lead to more downtime until you get everything tweaked just right.

We were running Rails, but we tried Rackspace and Linode. In both cases we were pretty happy. This article kind of outlines how we progressed. https://startingandsustaining.com/chapter-11-servers-architecture-48a01ae1fd7d#.r6zk5gaf9 Ultimately, Linode was significantly cheaper and easier to work with. Their node balancers were also better for us.

The biggest problem we found was that the standard load balancers through Rackspace and Linode were both less than perfect in terms of configuration and upgrades. With Rackspace, we ended up spinning up our own load balancers. Linode’s were good enough for us, but they weren’t stellar.

We got outside help. We found a good sys admin and paid him up front for the migration work and then kept him on retainer for $200/mo for a couple of hours of work whether we used him or not. That was priceless. Any time there was a significant vulnerability or upgrade, he was on top of it, and we didn’t even have to worry.

The un-forseen pitfalls were mainly that moving to a higher availability architecture meant a lot of configuration changes and several of those required extensive testing and debugging to get right. So it was painful to be sure.

The article above pretty well details how I’d do it over again. It worked well, but it wasn’t without its challenges. Maintaining a parallel staging environment that mimics production as much as possible was a huge help. It cost a little more, but it made it drastically easier to test and make architectural changes. We even maintained a load balancer in front of a single web server for staging so that it matched the production configuration more closely.

Our biggest lesson was that even if you make the application servers redundant, the load balancer and DB can still be a single point of failure. Ultimately, it was a lot better and made it easier to update the web and application servers with minimal downtime, but we didn’t do any kind of auto-failover for the database. About once or twice a year, we’d be down for 5 minutes to update and reboot the database server, but that was more than acceptable in our case since usage was very low on the weekends.

My biggest advice would be to make sure it’s really necessary and not to do it just to aim for 5 9’s. Unless it’s a chronic problem that’s affecting customers, you can spend a lot of effort chasing only moderate improvements.

There’s a slightly dated but still relevant article from Signal vs. Noise that articulates this a little better:
https://signalvnoise.com/archives2/dont_scale_99999_uptime_is_for_walmart.php

Hope that helps.

aeden · January 1, 2017, 8:27am

I’ll second what @garrettdimon said: make sure it is really necessary before making the leap to HA. You may be better off focusing on Mean Time To Recovery (MTTR) first. This could mean a hot standby so that if one VPS fails you can fail over to another quickly. Alternatively it could mean having the means to stand up a new VPS somewhere quickly (with reliable configuration management scripts ready to handle the stand up with minimal effort).

We ended up spending a lot of time over the past year working on HA for our web systems, and we’re still not at the point where it can happen automatically with Postgres.

For the web we use Anycast inside our data center and blue/green deploys. Thus when a node is being deployed it stops broadcasting BGP during the deploy, and then announces itself again once the deploy is complete. The benefits of doing this is that it allows us to not need any sort of load balancing outside of BGP routing.

For Postgres, we tried out several different tools, from pgpool, to carp, and now we are trying keepalived. HA at the database layer is by far the most difficult part of the HA stack. At this point we can perform manual failovers, which is much better than where we were in the past, but we are still aiming to get to automatic failovers.

If you integrate with any third-party services, especially if you are dependent on them, make sure to consider that as well. All the HA in the world doesn’t do a bit of good if some third-party service you depend on goes down. Also keep in mind that true HA means distributing across multiple data centers as well, which adds yet another layer of complexity.

Anyhow, back to the original point, as Garrett said: don’t do it if you don’t really need it.

craigvn · January 1, 2017, 11:14am

I run my app in Azure, I have about 4 web servers, one SQL Azure database, Redis caching and a few other services. Price is around $800 per month. I could go cheap and install in on some cheap VPS, but I value the reliability. And the cost is not exponential, if I double, triple in size the hosting won’t do the same.

But it depend’s a lot on the product you are offering. If I am down for 30 minutes that is a big deal, down for a few hours would be terrible. It needs to be self managed so I can sleep knowing my overseas users will always be up when I am down.

fideloper · January 1, 2017, 4:33pm

I’ve always avoided PG because it’s high availability options always seemed so poor (or rather, overly complex). I’ve yet to find a cogent guide on it (where as with MySQL you can replicating to a replica data really quickly). Maybe @aeden can point us in the direction of some resources?

That being said regular maintenance and automation around any database tech is harder. If it comes down to paying a person for their time or paying for a hosted database, I’d likely go with the hosted database (up to your and your needs of course).

The Mean Time To Recovery idea is pretty appealing. Generally I at least have some Ansible roles or similar setup so I can duplicate an application server quickly even if I don’t have a hot standby around (this has saved my bacon before ).

rfctr · January 2, 2017, 6:24pm

+1

I’m working for large corps, and even them, with all their money and resources, prefer not to do master-master HA for databases. Instead, they have three-nines SLA (about 8.5 hours downtime a year). Applications and middleware have measures to gracefully degrade when the DB is not available - usually keeping mostly-static data (e.g. product catalogue) in a distributed cache and have cache-for-availability records (i.e. hit DB, but if it is not available, return the last known record from cache - with a warning that it may be stale).

The only place I’ve seen a HA DB was a large bank, its central customer records store. And it had to use HP NonStop – a huge, expensive, hardware-supported high availability platform, with a specially designed DB.

I’m not a DBA, but as far as I understand it, master-master is not really a complete solution for HA unless both masters conform to distributed transactions protocol, which I doubt is the case for MySQL or even Postgres. MySQL replication seems to based on binary log transfer, and then what happens if a record commited on Master A gets rejected on Master B due to other updates made recently? I expect the servers now out of sync - hopelessly.

A severed link between two masters can also create an out-of-sync configuration which may or may be not corrected when the link is back up.

Really, master-master is more for performance than for HA, I think.

I believe Master-Hot Standby is a more reliable (because it is simpler) design – all requests are handled by Master A, and replicated to Master B, but only when Master A has an issue the requests flow is redirected fully to Master B. Master A then should be stopped, investigated, possibly wiped out, and become a standby Master for B. That all can be automated, tho the initial state transfer from the new Master to the new standby can take time.

The mentioned above HA DB for a bank was master-standby, BTW. It says something, eh? (It is using the hardware interruption to stop-and-resume the clients requests, so the downtime is sub-second and clients won’t notice the failures, but still.)

rfctr · January 2, 2017, 5:33pm

+1 I believe this is pretty much the best design reasonable money can buy.

Only one comment - load balancer (LB) can be a device or DNS records. If it is a device (just as you said in your post, but not in the article) it is a single-point of failure.

For HA purposes a set of DNS records with short TTL and a watchdog (2+) that updates DNS when an app server goes down may be a better solution.

A device tho has a faster reaction time when a server stops responding - the traffic is routed to healthy nodes almost instantly vs 60s at least for DNS.

stympy · January 2, 2017, 5:47pm

Postgres replication isn’t too bad, actually. Here’s the documentation for it:

https://www.postgresql.org/docs/9.6/static/warm-standby.html

Basically what you do is set up the leader to log the WALs (same as the binlog in mysql) and grant permission to a replicator user. You then set up one or more followers with a backup from the master (using pg_basebackup) and a recovery.conf (which can be created by pg_basebackup) that points the follower to the leader. The final step is to set up archive_command on the leader and a restore_command on the follower to archive the WAL segments so that they can be accessed by the follower in case it gets too far behind the leader. At Honeybadger we use wal-e for this, as it saves the WAL segments to s3 we can recover in another datacenter if necessary.

As for postgres HA, that is a bit more complicated. We are moving to Patroni for this. It runs postgres instances for you, using etcd/consul/zookeeper to store information about the leader and followers, and updating individual postgres instances as necessary when the leader fails. To handle failover for the clients talking to the leader, we use consul-template to rewrite the configuration file for pgbouncer which is running along-side the application. Each application points to the pgbouncer on localhost rather than to the leader itself.

Regarding HA in general: as others have said, it’s a tough nut to truly crack. Getting to a place where you can recover from downtime within minutes with human intervention is a great milestone to achieve in your pursuit of minimal downtime. Once you get there, you’ll have a good idea of what you’ll need to have in place to be able to more fully automate recovery from various types of failure.

rfctr · January 2, 2017, 6:23pm

I believe, it is since version 9. Before that the replication was really quite cumbersome and there was a number of third-party tools, each with their own quirks.

stympy · January 2, 2017, 5:37pm

Ah, yes, before version 9 it was a nightmare.

Incertitude · January 3, 2017, 9:05am

I guess I should make it clear that I’m not looking for five 9s, but rather than ability to get on a plane or drive somewhere without phone reception as a one-man operation. If something goes down during that time, I would want automated failover. This is easy enough for the filesystem and application servers, but much less easy for the DB compared to manual intervention.

maximus · January 7, 2017, 9:05pm

If you really need HA and want to sleep at night, nothing beats managed AWS (or Azure) services.

The absolutely minimum config would be:

Two VPS instances in different zones, behind the load balancer. You can also choose pre-built managed stacks such as Elastic Beanstalk.
Managed load balancer, such as ELB. HAProxy is better than ELB but it becomes a single point of failure.
Managed database service, such as RDS. Preferable with automatic multi zone master-slave replication.
Managed cache service with persistent cache store, such as ElasticCache Redis. Preferable with automatic multi zone master-slave replication.
Managed DNS service, such as Route53.

Expect to pay between $500 and $2000 / month depending on instance type and additional services.

stympy · January 7, 2017, 9:45pm

HAProxy doesn’t have to be a SPOF if you run it on two instances in different AZs, each with its own EIP, Route 53 health check, and Route 53 record. You can then create an autoscaling group with an AMI that has code to attempt to assume the EIPs at boot.

intelligentcoder · January 8, 2017, 4:02am

I’m in a similar situation where I feel like I can’t go out without my phone because I’m the only one responsible for keeping my app online. It was fine to have outages when I had 20 clients, but now that there are hundreds of clients using the app everyday, I can’t have outages without pissing some people off.

I’ve been trying to solve this for a couple of years, and I’ve learned that the only real solution is to hire someone that can cover for you when you are not available. Unfortunately, that is not always an option for bootstrappers like us. I could hire someone, but at the moment I’m okay with being on call 24/7 in order to maximize my profits. My plan is to hire someone when I start to feel burnout symptoms.

In the meantime, I’ve been relatively successful with HA without spending lots of money (under $400). I have 2 dedicated servers in different data centers on different regions and a cheaper VPS. Each of the dedicated servers are powerful enough to host the application and database on its own and would be able to handle at least 5x the current load.

One dedicated server is the primary and the other one is a hot backup. I use totaluptime.com to failover the application if the primary is unreachable for 30 seconds. This has worked flawlessly at the application level.

For the database, I use a MongoDB replica set (3 members), one on each server. The automatic failover works well most of the time. There were 2 or 3 times where it didn’t work as expected and I had to intervene. The downfall is that I had to get really comfortable with managing and securing a MongoDB replica set, this is time I should have spent on the product or on marketing. By the end of this year I want to move it to Mongo Atlas, which is not expensive IMO.

As I said, even with the relative success, I would only feel comfortable going completely offline if I had a dev ops person that could cover for me.

Incertitude · January 8, 2017, 6:41am

Thanks for that; apart from the tech stack it closely describes my situation … except that I’d like to hire someone before I start feeling burned out

SteveMcLeod · January 8, 2017, 12:40pm

Two stories:

I pay a sys admin for a few hours work per month (on a retainer). I found him via Upwork and pay US$30/hour. It took a while to find the right person, but well worth the money.
A friend of mine runs a 1-person SaaS site that does website monitoring. It needs to have very good uptime. The same friend drove a car from Germany to Mongolia. He could do that because outages are so infrequent. He achieved great uptime by relentlessly improving the infrastructure after every outage so that the cause of the outage was very unlikely to happen again.

craigvn · January 8, 2017, 9:34pm

This is pretty much where I am at. You are not going to get HA for $30, or even $100, a month.

rfctr · January 9, 2017, 7:48pm

That’s cheating document-oriented DBs are so much easier to clusterize.

jitbit · January 11, 2017, 5:58pm

A bit late to the show, but here’s my 2 cents.

(mostly aws-related and pretty trivial, sorry)

Backup from the outside. For example, backing up a database using the actual DB-server affects your server’s performance. OTOH, taking a snapshot of your EBS-drive uses Amazon’s resources. Snapshotting even uses a separate physical network-card on the storage-server (the one that hosts your EBS disks), and does not exhaust your own disk-speed/iops-credits/bandwidth/cpu at all, it literally does not touch your server. The procedure is smart enough to backup only the diff-ed data, so it’s ridiculously fast (100gb drive in 5 seconds, not bad huh). Taking a snapshot is a free operation, Amazon offers an API for this, so you basically can take a snapshot every minute (!) who cares! Just make sure to clean them up once in a while, because $$$

Keep a stopped copy of all servers - in AWS, a “server” is just a line in some config file somewhere (when it’s not running), it’s free to keep one. If something goes wrong with the hardware - you don’t have to shutdown of “force-stop” the frozen machine, and wait until it responds… Just cut the EBS disks, throw them at a new server, launch. 10 seconds. We even keep a script for that.

PS. handy stuff: https://github.com/colinbjohnson/aws-missing-tools (especially the “ec2-automate-backup” bit)