Roll Your Own Database Migrations

I once worked at a startup where database migration scripts distributed via email. The engineering team was very small and I don’t remember a time where it caused a major issue (especially when compared to the bug that charged customers 10x the purchase price). But, if you want to make your database migrations a little more formal than email, you can read on about how I do this.

There are many frameworks that manage database migrations for you. I’ve used some of these frameworks and they’re great. Many of the techniques that I use in my day-to-day development come from things I’ve seen in other pieces of software – Ruby on Rails, Drupal, and Spring are a few examples. If you’re using one of these frameworks and are curious about how they might work, read on.

The Problem

Over the life of a software project that uses a relational database, the schema will need to change. How should these schema changes be tracked and applied in all environments (dev, qa, prod, etc.)?

The simple answer is that one would want to track the changes just like we track code changes in any SCMS. This gives us the ability to see what the schema was at any given time, and it gives us the ability to replay schema updates in an isolated environmet.

Also, we’d want the updates applied automatically, or at least with just a few keypresses.

One Solution

When faced with this problem about four years ago (I was building an e-commerce platform for a now defunct startup), I came up with a simple solution that has served me well since.

The idea is simple:

  • Store SQL migration files in a /sql-migrations directory
  • Store the current database version in the database
  • Use a shell script to apply necessary migrations

There are some limitations:

  • Migrations are limited to pure SQL – there is not (easy) access to application code
  • I have not implemented a rollback mechanism (I’ll talk more about this below)

Show Me the Code

First, your database must have a table that stores the current version that looks like this:

create table database_versions (
  version varchar(32) not null,
  is_active boolean not null default false,
  creation_date timestamp not null default current_timestamp
);

Logic to automatically create this table could be added to my migration script below. I have a db-init script that creates this table for me.

The is_active field is optional. You could infer the active version by assuming that it’s the greatest version. I use it because it makes the code simpler and I have not had a problem with it being out of sync.

Now let’s look at the SQL migration files. There are two requirements for these files

  • The schema version must be available via the file name
  • The files must be sortable by name e.g. sort -V, or sort -t. -k 1,1n -k 2,2n, -k 3,3n, or just make the names lexically ordered

I use file name like 000.000, 000.001, 001.000, etc. because it makes the migration shell script easier to code.

The files in the /sql-migrations directory are changes to the schema, not the full schema. As an example, the first file (000.000) may look like this:

create table users (
  name varchar(32) not null,
  primary key (name)
);

The second file (000.001) may look like this:

alter table users
add email varchar(256) not null;

Now for the meat. Here’s the script I use to apply the migrations:

#!/bin/bash
set -e
source $(dirname $0)/../.include

function add-version() {
  psql $PSQL_OPTS -c "update database_versions set is_active = false where is_active = true" $DB_NAME
  psql $PSQL_OPTS -c "insert into database_versions (version, is_active) values ('$1', TRUE)" $DB_NAME
}

current_version=$(psql $PSQL_OPTS -Atc "select version from database_versions where is_active = true" $DB_NAME)

for f in $(ls -1 $BASEDIR/migration/*.sql | sort); do
  filev=$(basename $f .sql)
  if [[ $filev > $current_version ]]; then
    psql $PSQL_OPTS -vON_ERROR_STOP= -1f $f $DB_NAME
    # No need to check return code because of the 'set -e'
    add-version $filev
  fi
done

This script

  1. Retrieves the current database version
  2. Iterates through the migration files (in order) and applies versions that are greater than the current version

This script is pretty simple. When I first wrote it, I was surprised at how short it was.

This script is for a postgres databases, but I’ve used similar versions for MySQL.

Applying the Updates

In development environments, I apply these updates manually. This means that a dev would have to see that a migration file was added and run the migration script by hand. This could be handled a little better using a test at application startup time that does one of the following if the databse is not up-to-date.

  • Fails to start
  • Automatically applies the migrations
  • Prints a warning

In production, I apply the migrations as part of my deploy process.

What About Rollbacks?

Like I said above, this database migration technique does not handle rollbacks. It could. All one would have to do is include a rollback SQL file for each migration file and write a script to rollback to a particular version. In my experience, I’ve never had a need for this. Maybe I’m just lucky, but I’d like to think that it’s more that I’ve tested and planned properly.

Configuration, Part 2

In my previous post I talked about general configuration best practices. In this post, I’ll detail the specifics of how I’ve configured my latest project.

First, here are some relevant details about the project:

  • It’s a webapp Anthology
  • It’s also a REST API for our iOS app
  • JVM/Scala
  • Spring IoC

Processes are always kicked off from a shell script that lives in the $APP_HOME/bin directory. All of my scripts in the $APP_HOME/bin directory start with the following lines:

#!/bin/bash
source $(dirname $0)/../.include

Line 1 tells the shell to use /bin/bash to execute the script. Line 2 sources .include. Here’s .include:

#!/bin/bash

cd $(dirname $0)/..
BASEDIR=$PWD
cd - >/dev/null

source $BASEDIR/.env
source $BASEDIR/.functions

version=$(cat $BASEDIR/.version)

if test -r $BASEDIR/.source-version-hash; then
  source_version_hash=$(cat $BASEDIR/.source-version-hash)
else
  source_version_hash=NONE
fi

if test -e $BASEDIR/etc/override.properties; then
  if test -n "$APP_HOME" -a "$APP_HOME" != "$BASEDIR"; then
    fatal "Your APP_HOME is misconfigured"
  fi
fi

if test -n "$APP_HOME" -a "$APP_HOME" != "$BASEDIR"; then
  if test -e $BASEDIR/etc/override.properties; then
    fatal "Your APP_HOME is misconfigured"
  fi
fi

if test -z "$APP_HOME"; then
  if test -e $BASEDIR/etc/override.properties; then
    info "Setting APP_HOME to $BASEDIR"
    export APP_HOME=$BASEDIR
  elif test -e $HOME/opt/$APP_NAME/etc/override.properties; then
    info "Setting APP_HOME to $HOME/opt/$APP_NAME"
    export APP_HOME=$HOME/opt/$APP_NAME
  else
    fatal "Please set APP_HOME"
  fi
fi

if test -r $APP_HOME/etc/env; then
  source $APP_HOME/etc/env
fi

This file’s main purpose is to setup configuration. It checks that required variables are set (or that there is a sensible default) and it loads additional configuration.

Line 7 loads configuration from a .env file. This is default configuration (suitable for a development environment) for shell scripts. Configuration loaded in this step can be overridden by line 43 (e.g. to load production configuration values).

Lines 18-40 attempt to set and/or check the APP_HOME environment variable. Configuration override files are stored in the $APP_HOME/etc directory. This script has to deal with both development and deployed (e.g. qa or production) environments.

In development, the expectation is that there is an APP_HOME environment variable set to a directory that is not part of the codebase. This way, a developer’s custom configuration will not be checked in. The default location for APP_HOME is $HOME/opt/$APP_NAME. With this, a developer can pull down the codebase and mkdir -p $HOME/opt/$APP_NAME/etc to get started. (I actually have a /dev-bin/bootstrap script that takes care of this.)

In a deployed environment, $APP_HOME and $BASEDIR must be equal. I typically deploy to /opt/$APP_NAME, so this means that there must be an /opt/$APP_NAME/etc directory.

After .include is sourced APP_HOME is set and we have verified that we have a configuration directory. We’ve also loaded environment specific shell variable overrides (line 43 in the file above).

Now we can start up our app. I won’t show my complete startup script here (that can be the topic of another post) because the part that relates to configuration is actually pretty short.

java -classpath $CP -Dapp.home=$APP_HOME co.anthology.common.spring.Main "$@"

You can ignore the $CP variable. The main thing to notice here is that $APP_HOME is passed to the JVM as a system property using a -D option.

The final part of configuring is handled by Spring. Here’s a snippet from a Spring context file:

<bean
    class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer"
    p:system-properties-mode-name="SYSTEM_PROPERTIES_MODE_OVERRIDE"
    p:ignore-unresolvable-placeholders="true"
    p:ignore-resource-not-found="true"
    >
  <property name="locations">
    <list>
      <value>classpath:common.properties</value>
      <value>file://${app.home}/etc/override.properties</value>
      <value>file://${app.home}/etc/local-override.properties</value>
    </list>
  </property>
</bean>

Configuration values are applied in these Spring context files like this

<bean
    id="my-id"
    class="my.class.Name"
    p:property-1="${common.property-1}"
    />

In the above example, property-1 will get the configured value of common.property-1. (Have a look at the Spring docs for more about this.)

The value of common.property-1 is set in the common.properties file (line 9 above). It can be overridden by override.properties (line 10), local-override.properties (line 11), and finally by a system property (line 3).

common.properties is a source controlled default configuration file (all defaults should be dev friendly). It is deployed in a jar file.

override.properties is a deployed configuration file that is also under source control. There is one of these for each deployed environment (e.g. qa and prod). I typically store these files under /config/{qa,prod} in my codebase.

local-override.properties is an optional file (technically they’re all optional because of the ignore-resource-not-found="true") that lives on the deployment hosts. I typically don’t need this, but if I ever needed a way to configure a host differently than the others, this is how I would do it.

Finally, I can set a system property at startup time e.g. -Dcommon.property-1=xyz to override all previous configuration values.

Okay, that’s a short overview of how I’m configuring my latest project. Configuration may not be the most sexy topic, but it’s definitely an important piece of any project.

Configuration

Just about any code that is deployed or shared needs to be configured. This post will discuss configuration best practices.

Configuration can tell your program

  • Where to find your database
  • What port to bind to
  • How long to wait until timing out
  • Who to email when there’s a problem
  • What remote API server to connect to
  • What code to run when a certain event happens
  • Much more…

Early and Often

Configuration should be one of the first things you address in a new software project. If you don’t address this early, you will attempt to solve problems that could easily be solved with configuration in ways that are not optimal.

A classic consequence of putting off configuration is leaving configuration in the code. E.g. hardcoding a database connection string. When this code is shared and the connection string (for example) must be modified to work in a new environment, there is a chance that the connection string change will be committed, and will break others’ development environments. This database connection string is the first of many values that should be configured. If you wait to employ configuration, you will cost you and your team valuable time.

Another cost of not setting up configuration early is that you will be missing a valuable tool. Configuration can be used to solve many important problems (e.g. deploying to QA), if this tool is missing you will have a hard time solving certain types of problems.

Configuration in Development

Your codebase should come with sensible defaults. An engineer should be able to pull down the code, build it, and run it quickly with little to no custom configuration. In other words, the default configuration should always be set for a developer’s environment. I recommend this for a few reasons

  • Your deployment process should take care of configuring your program in remote environments. In a developer’s environment, there is no deployment, and therefore, no chance to set configuration
  • Money. Engineering talent is expensive and we don’t want to waste time with needless configuration
  • It would not be good for a program running in a dev environment to interfere with production systems (e.g. a dev program connecting to a production database)

Engineers will need the ability to modify configuration in their development environments. The approach to modifying configuration in development will be the same for all other environments. These approaches will be discussed later.

First Things

There are two main considerations when thinking about configuration:

  • Where to store it
  • In what format

Where to Store Configuration Data

Code Repository

Store configuration files in your code repository. You get all the benefits of source control, and your configuration is easy to find, examine, and change. When I do this, I usually make /config/{qa,prod} directories to hold my configuration files.

The main drawback with this approach is that updating configuration values will require a full deployment. This can be mitigated by

  • creating a patch deploy process that only deploys configuration
  • using a hybrid approach that also takes advantage of host-based configuration (described below)

Finally, if your configuration varies per deployment/installation this can be difficult as you will have a separate file in source control for each deployment (not to mention how to determine which configuration file to use during deployment).

Deployment Hosts

The approach here is to have a configuration file or directory that is searched at startup time e.g. /etc/$APP_NAME or /opt/$APP_NAME/etc.

This is a great option if you have host/deployment-specific configuration e.g. your path to ImageMagick’s convert executable varies on a per-host basis. The drawback here is that your configuration is now spread out over multiple hosts and is, therefore, harder to maintain.

Managing these configuration files can be difficult, as they will need to be maintained by some dev-ops process.

Database

Using a database for configuration is a great way to share configuration across all instances of the app. This is similar to the Code Repository option in that configuration is easily maintained across all deployments. However, there’s the chicken/egg problem of how does the app know what database to connect to? Also, unlike the code repository solution, there is no version control.

The big advantages with database configuration are

  • configuration can vary independent of the software deployment
  • configuration can be maintained from a central location

What Format

Common options for configuration file formats are

  • JSON
  • XML
  • YAML
  • Java Properties and other language specific formats
  • Scripting language

The first four (JSON, XML, YAML, Java Properties) are all of the same type. JSON, XML, and YAML all allow for hierarchal data. Java Properties are strictly name/value pairs – they do not offer a way to store data hierarchicaly. On the JVM, Java Properties are a great choice because of their easy serialization/deserialization. JSON is great because of it’s simplicity and well-known spec. For more structured data, JSON has a big leg up over Java Properties. YAML is similar to JSON. I would not recommend XML as it’s most likely overkill.

Using a scripting language for configuration allows for dynamic and calculated configuration values. I have never done this, and my guess is that there aren’t many cases where this would be a great choice. But there can be great power in having a programming language available in one’s configuration.

When to Apply Configuration

Build Time

In the technique, build artifacts are built with the configuration specified at build time. The main advantage with build time configuration is that the build artifacts can stand on their own i.e. they do not require any specific environment to be deployed into. The disadvantage is that any configuration change requires a new build and deployment.

Build time configured application all but forces one to store configuration along with the code.

Runtime

Using the runtime approach, all configuration is loaded at runtime. Configuration is not part of the deployed artifacts, but lives on the runtime hosts or a database.

Configurations updates can be applied easily by restarting the process (or one could even write some code to dynamically reload the configuration without a restart).

Hybrid (Build Time/Runtime)

This is a mix of the two. The advantage of this is that you can choose what values are configured in what way. The disadvantage is that you have to maintain multiple configuration systems.

Code in Configuration

Storing code in configuration can give your programs a new level of power and flexibility. This technique only makes sense if you are not storing configuration in your code repository. This technique comes at a cost

  • It’s hard to verify the correctness of code in configuration
  • An infrastructure must be built around the execution of the configured code
  • Debugging can be difficult as it can be difficult to know what deployed code executed with what configured code

Examples of storing code in configuration are email templates and event hooks. Email templates can be stored as configuration and allow for updates independent of deployments. Event hooks could allow for system updates that quickly address changing requirements.

Code in configuration is also a solution to handling changing or unknown inputs into your program. E.g. if you’re receiving data from a third party that is not well defined (the data, that is), you may want to store the code that processes the data in configuration so it can be updated without a new deployment.

This kind of configuration is never easy and should be considered carefully before employing. I would not use this unless there was a very good reason.

Sometimes, there can be external factors that might lead to using this technique

  • Your deployment process is slow and you need to see changes fast
  • System requirements are not well defined or are constantly changing

Command Line Overrides

The ability to override configuration from the command line can be valuable as a way to quickly change configuration in both production and development environments. Configuration values should be applied in the following order

  1. Hard-coded values in code
  2. Config files/database values
  3. Command line options

Next

In my next post, I’ll show the details of how I’ve configured my latest application.