Measure Twice, Cut (over) Once
This past weekend we did a deploy at Yieldbot unlike any other we’ve done before.
At its completion we had:
- upgraded from using Python 2.6 to 2.7.3;
- reorganized how our realtime matching index is distributed across systems;
- split off monitoring resources to separate servers;
- moved out git repos that were submodules to be sibling repos;
- changed servers to deploy code from Chef server instead of directly from github;
- completely transitioned to a new set of servers;
- suffered no service interruption to our customers.
The last two points deserve some emphasis. At the end of the deploy, every single instance in production was new – including the Chef server itself. Everything in production was replaced, across our multiple geographic regions.
Like many outfits, we do several deploys a week, sometimes several a day. Having no service disruption is always critical, but in most deploys is also usually fairly straightforward. This one was big.
The procedures we had in place for carrying it out were robust enough though that we didn’t even internally notify anyone from the business side of the company when the transition was happening. The only notification was getting sign-off from Jonathan (CEO) on Friday morning that the cut-over would probably take place soon. In fact, we didn’t notify anyone *after* the transition took place either, unless you count this tweet:
I suppose we cheated a little by doing it late on a Saturday night though.
We have a few kinds of data that we had to consider. Realtime streaming, analytics results, and configuration data.
Realtime Streaming and Archiving
For archiving of realtime stats, the challenge was going to be the window of time that old systems were still receiving requests while new servers were starting to take their place. In addition to to zero customer impact we demanded zero data loss.
This was solved mostly by preparation. By having the archiving include the names of the source donig the archiving, the old and new servers could both archive data to teh same place without overwriting each other.
We currently have a number of MongoDB servers that hold the results of our analytics processes, and store the massive amounts of data backing the UI and the calculation of our realtime matching index.
Transitioning this mostly fell on MongoDB master-slave capabilities. We brought up the new instances as slave instances pointing to the old instances as their master. When it was time to go live on the new servers, a re-sync with chef reverted them to acting as masters.
There was a little bump here where an old collection ran into a problem in the replication and was replicating to be much larger in the new instance than in the large instance. Luckily it was an older collection that was no longer needed, and dropping it altogether on the old instance got us past that.
Transitioning the config data was made easy by the fact that it uses a database technology that we created here at Yieldbot called HeroDB. (which we’ll much more to say about it in the future).
The beneficial properties of this database in this case is that it is portable, and can be easily reconciled against a secondary active version. So we copied these databases over and had peace of mind that we’d reconcile later as necessary with ease.
We tested the transition in a couple different ways.
As we talked about in an earlier blog post, we use individual AWS accounts for developers with Chef config analogous to the production environment. In this case we were able to bring up clusters in test environments along the way before even trying to bring up the replacement clusters in production.
We also have test mechanisms in place already to test proper functioning of data collection, ad serving, real time event processing, and data archiving. These test mechanisms can be used in individual developer environments, test environments, and production. These proved invaluable in validating proper functioning of the new production clusters as we made the transition.
The Big Switch – DNS
DNS was the big switch to flip and the servers go from “ready” to “live”. To be conservative we placed one of our new edge servers (which would serve a fraction of the real production traffic in a single geographic region) into the DNS pool and verified everything worked as expected.
Once verified, we put the rest of the new edge servers across all geographic regions into the DNS pools and removed all of the old edge servers from the DNS pools.
The switch had been flipped.
There Were Bumps (but no bruises)
There were bumps along the way. Just none that got in our way. Testing as we went, we were confident that functionality was working properly and could quickly debug anything unexpected. As any fine craftsman knows, you cut to spec as precisely as possible, and there’s always finish work to get the fit perfect.
The star of the show, other than the team at Yieldbot that planned, coded, and executed the transition, was Chef.
We continue to be extremely pleased with the capabilities of Chef and the way that we are making use of it. No doubt there are places where it is tricky to get what we want. And of course there’s a learning curve in understand how the roles, cookbooks, and recipes all work together, but when it all snaps into place, it’s like devops magic.