Posterous theme by Cory Watilo

Why Yieldbot Built Its Own Database

About a year ago we made the decision to switch over all of our configuration to a new database technology that we would develop in-house, which we ended up calling HeroDB.

The motivating principle behind what we wanted in our configuration database was a reliable concept of versioning. We had previously tried to manually implement some concept of versioning in the context of available database technology. This would keep around older versions of objects in a separate area with a version number identifying them, and application logic would move copies of whole objects around as changes were made to them. Data events would contain versions of the objects that they were related to. While this did some of what we wanted, it was clear that this was not the solution we were looking for.

Enter Git

at the core of Git is a simple key-value data store. You can insert
any kind of content into it, and it will give you back a key that you
can use to retrieve the content again at any time.
    – Pro Git, Ch 9.2

While we had these challenges thinking about the management of data in our application, we were managing one of our types of data with perfect version semantics. Our source code. Some simple searches told us we weren’t the first to think about leveraging Git to manage data, but there also wasn’t anything serious out there for what we were looking for.

So we thought hard about what Git would be able to provide us as a database (there are definitely both pros and cons) and how it intersected with what we were looking for in versioning our configuration data. A few of the things we liked:

  • every change versioned
  • annotated changes (comments) and history (log)
  • every past state of the database inspectable and reproducible
  • reverting changes
  • cacheing (by commit sha) – specific version of data doesn’t change

There were definitely cons, which we decided would be worth it for the strength of the benefits we’d be getting. A few of the cons we decided to live with:

  • comparatively slow, both reads and writes
  • size concerns, would shard
  • no native foreign keys

Some of these can be mitigated. For instance, read performance can be improved (with caveats) by having cloned repos for read that are updated as changes are made to the “golden” repo. To mitigate size concerns, and because there is no native concept of foreign keys, data can be sharded with no penalty to what can be expressed in the data.

What We Did

Once we decided we liked the sound of having Git-like semantics on our data, we went about looking for projects that might aleady be available that provided what we wanted. Not satisfied with what we found our next step was to look for suitable programmatic interface to Git. In the end we found a good solution there in a project named Dulwich (https://github.com/jelmer/dulwich) which is a pure Python implementation of Git interfaces.

With Dulwich as the core, we implemented a REST API providing the data semantics that we wished to expose.

In terms of modeling data, we took Git’s concept of Trees and Blobs, conceiving of Trees as objects and Blobs as attributes, with the content of the Blobs being the values of the attributes. The database itself exists as a Git “bare” repo. Data is expressed in JSON, where ultimately all values are encoded in Blobs and where Trees represent nesting levels of objects.

The following simple example illustrates the JSON representation of a portion of our database realized in Git and what that looks like in a working tree.

Example JSON:

{“alice@example.com“: {“name”: “Alice”, “age”: 18}, “bob@example.com“: {“name”: “Bob”, “age”: 22}}

In cloned repo:

$ find .
alice@example.com
alice@example.com/age
alice@example.com/name
bob@example.com
bob@example.com/age
bob@example.com/name

$ cat alice@example.com/name
“Alice”
$ cat alice@example.com/age
18

What’s magical about representing your data this way is that it has a very understandable and easy to work with when realized in a filesystem view of the repo (i.e., the working tree). The database can literally be interacted with by using the same Git and sytem tools that developers are used to using in their everyday work. The database is copied local by cloning the repo. Navigating through the data is done via `cd` and `ls`, data can be searched using tools like `find` and `grep`, etc. Best of all, data can be modified, committed with appropriate comment, and pushed back to production if need be.

Interesting Directions to Evolve

Thinking about managing your data the same way you manage your source code leads to some interesting thoughts about new things that you can do easily or layer on top of these capabilities. A few that we’ve thought of or done in the last year:

  • Reproduce exactly in a test environment the state of the database at some point in time in production in order to reproduce a problem.
  • Discover the exact state of the database correlating to data events (by including commit sha).
  • Analyze the effect of specific changes to configuration on behavior of the platform over time.
  • Targeted undo (revert) of a specific configuration change that caused some ill effect.
  • Expose semantics in a product UI that map to familiar source control semantics: pull/branch/commit/push

Why We Did It

The short answer of why we did it was because we considered history such an important aspect of the database. Building a notion of history into the database itself was the best way to ensure the ability to correlate data events like clicks on ads back to configuration of monetary settings that drove the original impression in an auditable fashion.  Not finding the solution to our problem anywhere we followed one of Yieldbot’s core principles and JFDI’ed.

The simplicity of the approach hit two strong notes for us. First was that the simplicity brought with it a kind of elegance. It is easy to understand how the database handles history and to reason about the form of the data in the database. We also immediately got functionality like audit logging built into the database for free. And ultimately for a technical team that at the time was four developers with our hands full building rich custom intent analytics, performance optimized ad serving, a rich javascript front-end to manage, explore and visualize our custom intent analytics, and a platform that scales out to a worldwide footprint, we could focus on our core mission of building that product.

 

We’re discussing our work with HeroDB later this week:

Yieldbot Tech Talks Meetup (Feb 7, 2013 @ 7PM): www.meetup.com/Yieldbot-Tech-Talks/events/101101302/

 


Interested in joining the team? We’re hiring! www.yieldbot.com/jobs

by Rich Shea

| Viewed
times | Favorited 0 times

1 Comment

Leave a comment...