The core of Mathspace is a full stack Django monolith that powers our step-by-step math engine, the main user facing app including the student and teacher dashboards, task management and many other features and components. To help speed up testing of changes to the monolith (for devs and stake holders alike), we made a few attempts in the past to allow spinning up complete stacks, on-demand and quickly. The recurring challenge towards this goal has been provisioning the database for new stacks in a quick, cost effective and robust way. In this post, I'm going to explain how we finally solved this challenge using our own in-house solution.
Our monolith app talks to a Postgres database. We use a trimmed down anonymized version of the production data for development and this is the version that we wanted to use in our on-demand stacks. The size of the dev database is about 15GB as of this writing.
We use Django's built-in schema and data migration facilities and it has served us well to date. There are occasions however where we resort to out of band migrations when a migration is too intensive to run synchronously as a part of a deployment (for example, updating a column on all rows of a billion row table, in preparation of adding a constraint to that column). Because of these out of band migrations, it's not always possible to take an arbitrary dump of the database, and apply migrations to it to bring it up to date. We have experimented with various techniques over time to unify all migrations, but they've fallen short in one way or another. So instead, we run daily database dumps and occasionally throw away our dev database and pull in a fresh one when migrations fail.
With the background out of the way, following are our requirements for "PR Stacks" — fully working on-demand instances of the monolith each with their own dedicated database:
- Ability to spin up a stack for any git commit for which we have a compatible database dump (more on what is deemed compatible later).
- Must shut itself down when not used — to save cost.
- Must start up quickly when accessed.
- Must retain its data for a month — in case we need to refer to it.
- Must allow resetting database — to clear any test data.
- Must be accessible from public Internet but only by staff — to make sharing access to it low friction.
- Cost effective for moderate number of active and large number of inactive stacks.
Among these requirements, 1, 4, 5 and 7 are of interest in this post. Let's look at each.
(PR stands for Pull Request as these stacks are exclusively used in our code review process at the moment.)
The Right Dump
A dump is defined as compatible with a specific git revision X if there isn't any migrations applied in the dump that isn't defined in X. Intuitively, it makes sense not to use a database that has future migrations applied to it.
In order to track what migrations are defined in code and what is applied in a dump, we look at:
django_migrationstable provided by Django which has a row for every migration applied.
- A one migration per line listing of all migrations stored next to and kept up to date with the code and checked by CI.
The above definition of "compatible" allows for multiple dumps to be compatible with a single code revision. In particular, if dump A is compatible with git revision X, then all dumps older than A are also compatible with revision X. We do however need to decide on a single dump to use for a code revision. The obvious choice is the right one here: the latest compatible dump. That is because, if you recall, out of band migrations may break the ability to migrate the database from a point in time forward. Therefore, by picking the latest compatible dump, we reduce the likelihood of this breakage (but not eliminate it). In practice, we have yet to come across a dump that cannot be migrated forward.
It's worth noting that a stack will run migrations on startup in order to bring the database up to date. This is usually a no-op as the dump has all migrations applied.
Cost & Retention
Cost and data retention go hand in hand. Keeping data around usually costs more as does having more copies of it. We merge a lot of pull requests a month and while not all make use of PR Stacks, our usage will only ever increase. At 15GB a copy, it won't be long before we are spending non-trivial amounts on storage alone.
Our solution here was to use Copy-on-Write storage, observing that although the base data size is large, the actual changes to the data is often very small. We chose XFS with reflink and background de-duplication to great effect. In particular, we can have a new instance ready in under a second. Going down this path meant however, that all database instances had to reside on a single VM. This is good trade-off, because:
- instances are shut down aggressively when they become idle (more on this later);
- stacks don't often run database heavy operations.
Hence, we can run the VM with a low spec CPU. We also have Kernel Shared Memory (KSM) as a potential future improvement should the memory usage grow higher over time.
The last requirement, the ability to reset the database, is essentially the same as creating a new instance. We realized that an on-demand database instance that's quick to boot and can start from any daily dump, has more potential uses than just for the purpose of PR Stacks. Some examples include:
- Applying migrations and transformations between multiple instances and diffing them.
- Using as local database, allowing devs to switch between as many independent databases as needed while developing and testing.
- For on-demand light weight analytics where an older version of the database is needed. Alternative would be restoring from a complete backup which is a multi-hour endeavor.
So we decided to run the on-demand database as its own dedicated service. Keeping the two components separate brings some other benefits as well:
- PR Stacks without database is stateless. It is therefore easy to operate and migrate.
- Each service can run on an entirely different infrastructure.
- General advantages of having decoupled components with well defined interfaces.
In its simplest form, an on-demand database service would provide an interface to create, connect to and destroy an instance. The requirement to keep data for a maximum of 30 days lets us skip the destroy part. What if we could also skip the creation part and create instances on the fly when the client first connected? This was the light bulb moment! (Recall that it takes under a second to spin up a new instance, so the client's experience isn't significantly affected.)
In order to create Postgres instances on the fly, we must implement a proxy that sits between the client and the instance we will be creating. It may seem intimidating to implement Postgres wire protocol from scratch. However, with an engineering mindset, we can carefully assess how much of the protocol is required to accomplish our specific goal and ignore the rest. This is a healthy trade-off because in exchange for simplicity and shorter development and maintenance effort, we give up flexibility that we are unlikely to need in the foreseeable future. Furthermore, nothing stops us from implementing more of the protocol as needed to enable new features.
In our case, we only needed to implement the initial handshake which we implemented in a few dozen lines of Go code.
Here is an example of
psql invocation that transparently creates a database while connecting to it:
psql -U instance:testdb,gitref:abcde12345 mathspace_default
We have effectively overloaded the
username to define what database snapshot we want to create the instance from.
instance defines a unique name that on subsequent connections, will connect to the same instance.
gitref defines a revision of code for which we want the best compatible snapshot as basis of the instance. Once Multidb has determined the snapshot, it CoW copies the snapshot to a new directory and uses it as the data directory for the new instance. It then runs Postgres in docker and proxies the traffic to it.
Because the traffic is proxied, idle connections can be trivially detected and terminated, allowing us to stop and remove and Postgres containers that are not actively used, thereby making efficient use of CPU and RAM.
Multidb has proved quite robust during the several months that it's been running, requiring maintenance only when the snapshots it consumes break (which is rare). Multidb has enabled a reliable PR Stacks which has given our dev team increasing confidence to rely on it for feature demonstrations and testing.