MongoDB Elections

On Monday of this week, Amazon’s EC2 service suffered a major outage, which they call “performance issues”, which we all know is simply not true.

This is not a post about how Amazon has failed us. Everyone goes down. We use AWS because it’s flexible, and we need the flexibility. This is a post about how Gimme Bar went down due to this outage, despite our intentions of making everything resilient to these types of failures. It is a post about how I accidentally misconfigured our MongoDB Replica Set (“RS”).

When one of the us-east availability zones died (aside: this was us-east-1c on the Fictive Kin AWS account, but I’ve learned that the letter is assigned on a per-account basis, so you might have lost 1a, 1e etc.), I knew what was wrong with the RS right away. In talking this over with a few friends, it became clear that the way MongoDB elections take place can be confusing. I’ll describe our scenario, and hopefully that will serve as an example of how to not do things. I’ll also share how we fixed the problem.

Gimme Bar is powered by a three-node MongoDB replica set. A primary and a secondary, plus a voting-but-zero-prority delayed secondary. The two main nodes are nearly-identical, puppeted, and are in different Amazon AWS/EC2 Availability Zones (“AZ”). The delayed secondary actually runs on one of our web nodes. It serves as a mostly-hot “oops, we totally screwed up the data” failsafe, and is allowed to vote in RS elections, but it is not allowed to become primary, and the clients (API nodes) are configured to not read from it.

In the past, we did not have the delayed secondary. In fact, at one point, we had three main nodes in the cluster (a primary and two secondaries, all configured for reads (and writes to the primary) by the API nodes).

In order for MongoDB elections to work at all, you need at least three votes. Those votes need to be on separate networks in order for the election to work properly. I’ll get back to our specific configuration below, but first, let’s look at why you need at least three votes in three locations.

To examine the two-node, two-vote scenario, let’s say we have two hypothetical, identical (for practical values of “identical”) nodes in the RS, in two separate locations: Castle Black and Winterfell. Now, let’s say that there’s a network connection failure between these two cities. Because the nodes can’t see each other, they each think that the other node is down. This makes both nodes attempt an election, but they both destroy their own votes because there is not a majority. (A majority is ((“number of nodes” ÷ 2) + 1), or in this scenario: 2 nodes. The election fails, the nodes demote themselves to secondary, and your app goes down (because there’s no primary).

To solve this problem, you really need a third voting node in a third location: King’s Landing. Then, let’s say that Castle Black loses network connectivity. This means that King’s Landing and Winterfell can both vote, and they do because they have a majority. They come to a consensus and nominate Winterfell (or King’s Landing; it doesn’t matter) to be Primary, and you stay up. When Castle Black comes back online, it syncs, becomes a secondary, and the subjects rejoice.

MongoDB has non-data nodes (called arbiters). These can be helpful if you’re only running two MongoDB nodes, and don’t want to replicate your data to a third location. Imagine it’s really expensive to get data over the wall into King’s Landing, but you still want to use it to vote. You could place an arbiter there, and in the scenario above where Castle Rock loses connectivity, King’s Landing and Winterfell both vote. Since King’s Landing can’t become primary (it has no data), they both vote for Winterfell, and you stay up. When Castle Rock rejoins the continent, it syncs and becomes secondary… and the subjects rejoice.

So, back to Gimme Bar. In our old configuration, we had three (nearly) identical nodes in three AZs. When one went down, the other two would elect a primary, and our users never noticed (this is far better than rejoicing). At one point, we upgraded the memory on our database nodes, and realized that we really only needed one secondary (two nodes). As discussed above, we can’t run a RS with just two nodes, so we added an arbiter on one of our ops boxes, which was in a third AZ. We were still AZ-failure tolerant.

Then, at some point, we thought about the “Sean accidentally types db.users.remove() into a late-night console and the users do the opposite of rejoicing” scenario. Thus, we set up one of our web nodes to act as a delayed secondary, as described above. When we did this, we removed the now-redundant arbiter from the RS. We still had three votes in the RS, so all was good… right? Not exactly.

What we neglected to notice is that gbweb01 (where we set up the delayed secondary) was in the same AZ as gbdb03 (our priority Primary). This was, unfortunately, the same AZ that suffered performance issues on Monday. So, a majority of our voters (two of the three) were knocked out, and gbdb04 (normally our wired secondary) was unable to elect itself primary, so we went down. Luckily, so did about half of the Internet, so we were just noise in an otherwise-noisy Monday afternoon.

To solve the problem, after Amazon had mopped up its mess, I simply moved the delayed secondary to gbweb03 which is not in the same AZ as gbdb03 or gbdb04 and reconfigured the RS. Sync, secondary, three votes, and our cluster is happily redundant and AZ-fault-tolerant again. During the outage, I could also have just reconfigured the RS to give gbdb04 the only vote, thus forcing it to become primary, but we were already under pretty heavy load from the API nodes screaming “where did the DB go?!” so we just waited it out at that point.

In discussing this whole thing with Paul, he mentioned that he was setting up a Mongo RS for his most-excellent Where’s It Up service and asked me to take a look at his RS config.

Paul has lots of servers in lots of places, so he set up MongoDB nodes on three of them: Washington, San Antonio and Montreal. He wanted Washington to be the primary whenever possible, though, so he set up an additional arbiter on the same box (but different port) in Washington. So, now, his RS had 4 votes: two in Washington, one in San Antonio, and one in Montreal. This is not immediately obvious, but let’s say that Washington were to go down. San Antonio and Montreal would say “we each have one vote. That’s two votes. Out of four. We’re not a majority!” and they would demote themselves to secondaries, waiting for Washington to be restored. The solution is to remove the arbiter. It’s one less vote, and Washington doesn’t hold two. Now if any node goes down, the other two each get a vote (2/3, a majority), and the election can proceed as intended.

Hopefully this was easy to follow without illustrations or other, specific configuration data. If not, please comment, and I’ll help however I can. Obviously, this is not meant as a guide to configuring RS elections, but more of an anecdotal guide to not-configuring-your-RS-improperly. Don’t make my mistakes. (-: