How did you make the jump to high availability?

I’d have to echo those who mention AWS in this thread.

I have “high enough” availability with their almost default settings for their Multi-AZ RDS instances. Meaning, I have 2 master databases (1 replicating to the other) in 2 separate physical regions/networks.

If 1 goes down, they (meaning AWS) automatically switch over to the db in the other region.

Read more here: http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html

For this, with a moderate DB server on one of my products I paid $65.47 last month:

“$0.088 per RDS db.t2.small Multi-AZ instance hour (or partial hour) running MySQL / 744 Hrs / $65.47”

This by no means “cost me a mint” and I could even go cheaper (i.e. a smaller RDS instance). Also @ $65.47 / month that’s about $786 per annum, a far cry from you $3-$5k estimate.

Sure, you could scream “lock-in!” by using AWS and their HA systems, whatever…

When I get big enough and have a large tech staff, I can have them spin their wheels and solve that problem for me. For now, I want to focus on Product + Customers, not trivial things like this that have been solved for a reasonable price by others.

…just my $0.02

Love this, with the one caveat that backing up a database by backing up the disk can (if anyone knows better Lmk!) give you a corrupt or incomplete db backup depending on what operations were taking place in the DB at the time of the backup.

Yeah. But most DB-engines have some kind of “check-db” command that restores the integrity after a disaster. You might lose a record or two, but it’s mostly ok. It also depends on the DB engine. MySQL can be sensitive to this, Postgress is better, SQL Server is literally invincible (handles physical corruption, transaction-log deletion, sudden power-offs etc.)

You have to weigh the options of course. I meant backing up for huge disasters here, like “server blew up” disaster. Or “Hurrcane Sandy destroyed datacenter” disaster… For regular annoyances, like “user X accidentally deleted his own data” - we have DB transaction logs and “point-in-time” recovery

You mean “master-standby” then, right? I.e. the updates are only go to one DB.

Yeah, not bad.

Exactly, thanks for the clairification.

“Amazon RDS automatically provisions and maintains a synchronous standby replica in a different Availability Zone.”

What happens with the old master when it fails and the standby becomes the new master?

Do you need to intervene and rebuild it manually, or Amazon will create a new instance for you? (I doubt it - will be a performance hit to transfer the state to a new standby).

1 Like

So at that point it’s up to you to handle how’d you’d like to recover and reset up the old master. For example, the old masters region could still be down. But they make it pretty trivial to get a new DB instance back up and replicated with your new master DB’s data. A few button clicks on their dashboard and it building/replicating.

+1 for everyone that said "don’t do HA, do ‘high-enough availability’, totally agree.

(I’m a bootstrapper currently doing contract DevOps for a rather large multinational. True HA on master-master databases is a hard problem, and expensive. You either lose out on speed, to ensure consistency, or risk data conflicts when mutually-incompatible data is written in two different places at the same time)

Use AWS or similar where possible; let them solve the hard problems.

The other comment I’d add, is whatever system you go with, don’t forget to test it. Make sure that you’ve run through the disaster recovery scenario, and followed a documented recovery plan, at least once and preferably multiple times before you need it for a real situation.

Don’t be like Gitlab; they recently lost 6 hours of production data affecting 5k projects. With this immortal quote as part of the post-mortem “So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place”

This post-mortem is well worth reading, if you’re interested in this stuff.

2 Likes

Oh, this is a ready-to-use horror movie scenario!

I could feel with my own skin how the nameless engineer first broke sweat when they realized the files being deleted are from PROD database, and how this state was turning into a hopeless haze as each next safety measure turned out to be a dud. Progressive steps into despair and madness.

Too bad too few people can truly understand the events, so it can’t become a blockbuster.

(Of course I may just be not suited for this line of work, and the engineers could have been cold, calculating and efficient in their fight for recovery.)

They were surprisingly calm about getting the problem fixed. They chose to do it with full transparency, including a live stream of their discussions and choices, and a public Google doc. Kudos to them for doing that!

I did something similar in my first job many many years ago, and never forgot that “oh hell what did I just do?” feeling. https://www.quora.com/What-is-a-coders-worst-nightmare/answer/Rachel-Willmer

My first such disaster: I was 21 and I deleted the company’s CRM database. The boss was out for the day. A co-worker let me fret for hours before he revealed that the database was safely backed up.

I crashed Siebel (sales system) of a major telecom, blocking their sales nation-wide. Two times in row within 30 minutes.

However, I was not to blame because a) someone else provided me with PROD URL instead of SFT b) PROD Siebel was not fortified against wide ad-hoc searches and went OOM.

I actually did not know about the accidents until much later in the day.

Really… they had a backup that failed and no one checked? Lol… I would fire his ass in 5 seconds flat…

Their original plan was sound, poor execution… People that do this stuff should give a shit… If you have a backup process and you don’t check it end to end in a staging environment on a regular basis you deserve being handed your own ass in a little box…

Most of those plans fail because people get complacent … hell, the stuff works most of the time, at least enough to put you to sleep and give you this false sense of confidence… it is living on borrowed time. Paranoia should be a daily thing even when all is well…

They lost me when I read their S3 bucket was empty and it was a surprise to them… really… Thank god for Github…

And I don’t think it’s rocket science… you just need to execute a reasonable plan with care… the with care here is the most important part… lol

That (5s) wouldn’t be wise. You’re in a middle of a crisis and you need every pair of hands, especially those familiar with the backups.

When the crisis is over, then you fire them. And maybe this is what has happen, we do not know.