About a year ago we made the decision to switch over all of our configuration to a new database technology that we would develop in-house, which we ended up calling HeroDB.
The motivating principle behind what we wanted in our configuration database was a reliable concept of versioning. We had previously tried to manually implement some concept of versioning in the context of available database technology. This would keep around older versions of objects in a separate area with a version number identifying them, and application logic would move copies of whole objects around as changes were made to them. Data events would contain versions of the objects that they were related to. While this did some of what we wanted, it was clear that this was not the solution we were looking for.
“at the core of Git is a simple key-value data store. You can insert
any kind of content into it, and it will give you back a key that you
can use to retrieve the content again at any time.“
– Pro Git, Ch 9.2
While we had these challenges thinking about the management of data in our application, we were managing one of our types of data with perfect version semantics. Our source code. Some simple searches told us we weren’t the first to think about leveraging Git to manage data, but there also wasn’t anything serious out there for what we were looking for.
So we thought hard about what Git would be able to provide us as a database (there are definitely both pros and cons) and how it intersected with what we were looking for in versioning our configuration data. A few of the things we liked:
- every change versioned
- annotated changes (comments) and history (log)
- every past state of the database inspectable and reproducible
- reverting changes
- cacheing (by commit sha) – specific version of data doesn’t change
There were definitely cons, which we decided would be worth it for the strength of the benefits we’d be getting. A few of the cons we decided to live with:
- comparatively slow, both reads and writes
- size concerns, would shard
- no native foreign keys
Some of these can be mitigated. For instance, read performance can be improved (with caveats) by having cloned repos for read that are updated as changes are made to the “golden” repo. To mitigate size concerns, and because there is no native concept of foreign keys, data can be sharded with no penalty to what can be expressed in the data.
What We Did
Once we decided we liked the sound of having Git-like semantics on our data, we went about looking for projects that might aleady be available that provided what we wanted. Not satisfied with what we found our next step was to look for suitable programmatic interface to Git. In the end we found a good solution there in a project named Dulwich (https://github.com/jelmer/dulwich) which is a pure Python implementation of Git interfaces.
With Dulwich as the core, we implemented a REST API providing the data semantics that we wished to expose.
In terms of modeling data, we took Git’s concept of Trees and Blobs, conceiving of Trees as objects and Blobs as attributes, with the content of the Blobs being the values of the attributes. The database itself exists as a Git “bare” repo. Data is expressed in JSON, where ultimately all values are encoded in Blobs and where Trees represent nesting levels of objects.
The following simple example illustrates the JSON representation of a portion of our database realized in Git and what that looks like in a working tree.
In cloned repo:
What’s magical about representing your data this way is that it has a very understandable and easy to work with when realized in a filesystem view of the repo (i.e., the working tree). The database can literally be interacted with by using the same Git and sytem tools that developers are used to using in their everyday work. The database is copied local by cloning the repo. Navigating through the data is done via `cd` and `ls`, data can be searched using tools like `find` and `grep`, etc. Best of all, data can be modified, committed with appropriate comment, and pushed back to production if need be.
Interesting Directions to Evolve
Thinking about managing your data the same way you manage your source code leads to some interesting thoughts about new things that you can do easily or layer on top of these capabilities. A few that we’ve thought of or done in the last year:
- Reproduce exactly in a test environment the state of the database at some point in time in production in order to reproduce a problem.
- Discover the exact state of the database correlating to data events (by including commit sha).
- Analyze the effect of specific changes to configuration on behavior of the platform over time.
- Targeted undo (revert) of a specific configuration change that caused some ill effect.
- Expose semantics in a product UI that map to familiar source control semantics: pull/branch/commit/push
Why We Did It
The short answer of why we did it was because we considered history such an important aspect of the database. Building a notion of history into the database itself was the best way to ensure the ability to correlate data events like clicks on ads back to configuration of monetary settings that drove the original impression in an auditable fashion. Not finding the solution to our problem anywhere we followed one of Yieldbot’s core principles and JFDI’ed.
We’re discussing our work with HeroDB later this week:
Yieldbot Tech Talks Meetup (Feb 7, 2013 @ 7PM): www.meetup.com/Yieldbot-Tech-Talks/events/101101302/
Interested in joining the team? We’re hiring! www.yieldbot.com/jobs