02 Feb 2016
I once worked at a startup where database migration scripts distributed via email. The engineering team was very small and I don’t remember a time where it caused a major issue (especially when compared to the bug that charged customers 10x the purchase price). But, if you want to make your database migrations a little more formal than email, you can read on about how I do this.
There are many frameworks that manage database migrations for you. I’ve used some of these frameworks and they’re great. Many of the techniques that I use in my day-to-day development come from things I’ve seen in other pieces of software – Ruby on Rails, Drupal, and Spring are a few examples. If you’re using one of these frameworks and are curious about how they might work, read on.
The Problem
Over the life of a software project that uses a relational database, the schema will need to change. How should these schema changes be tracked and applied in all environments (dev, qa, prod, etc.)?
The simple answer is that one would want to track the changes just like we track code changes in any SCMS. This gives us the ability to see what the schema was at any given time, and it gives us the ability to replay schema updates in an isolated environmet.
Also, we’d want the updates applied automatically, or at least with just a few keypresses.
One Solution
When faced with this problem about four years ago (I was building an e-commerce platform for a now defunct startup), I came up with a simple solution that has served me well since.
The idea is simple:
- Store SQL migration files in a
/sql-migrations
directory
- Store the current database version in the database
- Use a shell script to apply necessary migrations
There are some limitations:
- Migrations are limited to pure SQL – there is not (easy) access to application code
- I have not implemented a rollback mechanism (I’ll talk more about this below)
Show Me the Code
First, your database must have a table that stores the current version that looks like this:
create table database_versions (
version varchar(32) not null,
is_active boolean not null default false,
creation_date timestamp not null default current_timestamp
);
Logic to automatically create this table could be added to my migration script below. I have a db-init
script that creates this table for me.
The is_active
field is optional. You could infer the active version by assuming that it’s the greatest version. I use it because it makes the code simpler and I have not had a problem with it being out of sync.
Now let’s look at the SQL migration files. There are two requirements for these files
- The schema version must be available via the file name
- The files must be sortable by name e.g.
sort -V
, or sort -t. -k 1,1n -k 2,2n, -k 3,3n
, or just make the names lexically ordered
I use file name like 000.000
, 000.001
, 001.000
, etc. because it makes the migration shell script easier to code.
The files in the /sql-migrations
directory are changes to the schema, not the full schema. As an example, the first file (000.000
) may look like this:
create table users (
name varchar(32) not null,
primary key (name)
);
The second file (000.001
) may look like this:
alter table users
add email varchar(256) not null;
Now for the meat. Here’s the script I use to apply the migrations:
#!/bin/bash
set -e
source $(dirname $0)/../.include
function add-version() {
psql $PSQL_OPTS -c "update database_versions set is_active = false where is_active = true" $DB_NAME
psql $PSQL_OPTS -c "insert into database_versions (version, is_active) values ('$1', TRUE)" $DB_NAME
}
current_version=$(psql $PSQL_OPTS -Atc "select version from database_versions where is_active = true" $DB_NAME)
for f in $(ls -1 $BASEDIR/migration/*.sql | sort); do
filev=$(basename $f .sql)
if [[ $filev > $current_version ]]; then
psql $PSQL_OPTS -vON_ERROR_STOP= -1f $f $DB_NAME
# No need to check return code because of the 'set -e'
add-version $filev
fi
done
This script
- Retrieves the current database version
- Iterates through the migration files (in order) and applies versions that are greater than the current version
This script is pretty simple. When I first wrote it, I was surprised at how short it was.
This script is for a postgres databases, but I’ve used similar versions for MySQL.
Applying the Updates
In development environments, I apply these updates manually. This means that a dev would have to see that a migration file was added and run the migration script by hand. This could be handled a little better using a test at application startup time that does one of the following if the databse is not up-to-date.
- Fails to start
- Automatically applies the migrations
- Prints a warning
In production, I apply the migrations as part of my deploy process.
What About Rollbacks?
Like I said above, this database migration technique does not handle rollbacks. It could. All one would have to do is include a rollback SQL file for each migration file and write a script to rollback to a particular version. In my experience, I’ve never had a need for this. Maybe I’m just lucky, but I’d like to think that it’s more that I’ve tested and planned properly.
26 Jan 2016
In my previous post I talked about general configuration best practices. In this post, I’ll detail the specifics of how I’ve configured my latest project.
First, here are some relevant details about the project:
- It’s a webapp Anthology
- It’s also a REST API for our iOS app
- JVM/Scala
- Spring IoC
Processes are always kicked off from a shell script that lives in the $APP_HOME/bin
directory. All of my scripts in the $APP_HOME/bin
directory start with the following lines:
#!/bin/bash
source $(dirname $0)/../.include
Line 1 tells the shell to use /bin/bash
to execute the script. Line 2 sources .include
. Here’s .include
:
#!/bin/bash
cd $(dirname $0)/..
BASEDIR=$PWD
cd - >/dev/null
source $BASEDIR/.env
source $BASEDIR/.functions
version=$(cat $BASEDIR/.version)
if test -r $BASEDIR/.source-version-hash; then
source_version_hash=$(cat $BASEDIR/.source-version-hash)
else
source_version_hash=NONE
fi
if test -e $BASEDIR/etc/override.properties; then
if test -n "$APP_HOME" -a "$APP_HOME" != "$BASEDIR"; then
fatal "Your APP_HOME is misconfigured"
fi
fi
if test -n "$APP_HOME" -a "$APP_HOME" != "$BASEDIR"; then
if test -e $BASEDIR/etc/override.properties; then
fatal "Your APP_HOME is misconfigured"
fi
fi
if test -z "$APP_HOME"; then
if test -e $BASEDIR/etc/override.properties; then
info "Setting APP_HOME to $BASEDIR"
export APP_HOME=$BASEDIR
elif test -e $HOME/opt/$APP_NAME/etc/override.properties; then
info "Setting APP_HOME to $HOME/opt/$APP_NAME"
export APP_HOME=$HOME/opt/$APP_NAME
else
fatal "Please set APP_HOME"
fi
fi
if test -r $APP_HOME/etc/env; then
source $APP_HOME/etc/env
fi
This file’s main purpose is to setup configuration. It checks that required variables are set (or that there is a sensible default) and it loads additional configuration.
Line 7 loads configuration from a .env
file. This is default configuration (suitable for a development environment) for shell scripts. Configuration loaded in this step can be overridden by line 43 (e.g. to load production configuration values).
Lines 18-40 attempt to set and/or check the APP_HOME
environment variable. Configuration override files are stored in the $APP_HOME/etc
directory. This script has to deal with both development and deployed (e.g. qa or production) environments.
In development, the expectation is that there is an APP_HOME
environment variable set to a directory that is not part of the codebase. This way, a developer’s custom configuration will not be checked in. The default location for APP_HOME
is $HOME/opt/$APP_NAME
. With this, a developer can pull down the codebase and mkdir -p $HOME/opt/$APP_NAME/etc
to get started. (I actually have a /dev-bin/bootstrap
script that takes care of this.)
In a deployed environment, $APP_HOME
and $BASEDIR
must be equal. I typically deploy to /opt/$APP_NAME
, so this means that there must be an /opt/$APP_NAME/etc
directory.
After .include
is sourced APP_HOME
is set and we have verified that we have a configuration directory. We’ve also loaded environment specific shell variable overrides (line 43 in the file above).
Now we can start up our app. I won’t show my complete startup script here (that can be the topic of another post) because the part that relates to configuration is actually pretty short.
java -classpath $CP -Dapp.home=$APP_HOME co.anthology.common.spring.Main "$@"
You can ignore the $CP
variable. The main thing to notice here is that $APP_HOME
is passed to the JVM as a system property using a -D
option.
The final part of configuring is handled by Spring. Here’s a snippet from a Spring context file:
<bean
class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer"
p:system-properties-mode-name="SYSTEM_PROPERTIES_MODE_OVERRIDE"
p:ignore-unresolvable-placeholders="true"
p:ignore-resource-not-found="true"
>
<property name="locations">
<list>
<value>classpath:common.properties</value>
<value>file://${app.home}/etc/override.properties</value>
<value>file://${app.home}/etc/local-override.properties</value>
</list>
</property>
</bean>
Configuration values are applied in these Spring context files like this
<bean
id="my-id"
class="my.class.Name"
p:property-1="${common.property-1}"
/>
In the above example, property-1
will get the configured value of common.property-1
. (Have a look at the Spring docs for more about this.)
The value of common.property-1
is set in the common.properties
file (line 9 above). It can be overridden by override.properties
(line 10), local-override.properties
(line 11), and finally by a system property (line 3).
common.properties
is a source controlled default configuration file (all defaults should be dev friendly). It is deployed in a jar file.
override.properties
is a deployed configuration file that is also under source control. There is one of these for each deployed environment (e.g. qa and prod). I typically store these files under /config/{qa,prod}
in my codebase.
local-override.properties
is an optional file (technically they’re all optional because of the ignore-resource-not-found="true"
) that lives on the deployment hosts. I typically don’t need this, but if I ever needed a way to configure a host differently than the others, this is how I would do it.
Finally, I can set a system property at startup time e.g. -Dcommon.property-1=xyz
to override all previous configuration values.
Okay, that’s a short overview of how I’m configuring my latest project. Configuration may not be the most sexy topic, but it’s definitely an important piece of any project.
25 Jan 2016
Just about any code that is deployed or shared needs to be configured. This post
will discuss configuration best practices.
Configuration can tell your program
- Where to find your database
- What port to bind to
- How long to wait until timing out
- Who to email when there’s a problem
- What remote API server to connect to
- What code to run when a certain event happens
- Much more…
Early and Often
Configuration should be one of the first things you address in a new software
project. If you don’t address this early, you will attempt to solve problems
that could easily be solved with configuration in ways that are not optimal.
A classic consequence of putting off configuration is leaving configuration in
the code. E.g. hardcoding a database connection string. When this code is shared
and the connection string (for example) must be modified to work in a new
environment, there is a chance that the connection string change will be
committed, and will break others’ development environments. This database
connection string is the first of many values that should be configured. If you
wait to employ configuration, you will cost you and your team valuable time.
Another cost of not setting up configuration early is that you will be missing a
valuable tool. Configuration can be used to solve many important problems (e.g.
deploying to QA), if this tool is missing you will have a hard time solving
certain types of problems.
Configuration in Development
Your codebase should come with sensible defaults. An engineer should be able to
pull down the code, build it, and run it quickly with little to no custom
configuration. In other words, the default configuration should always be set
for a developer’s environment. I recommend this for a few reasons
- Your deployment process should take care of configuring your program in remote environments. In a developer’s environment, there is no deployment, and therefore, no chance to set configuration
- Money. Engineering talent is expensive and we don’t want to waste time with needless configuration
- It would not be good for a program running in a dev environment to interfere with production systems (e.g. a dev program connecting to a production database)
Engineers will need the ability to modify configuration in their development
environments. The approach to modifying configuration in development will be the
same for all other environments. These approaches will be discussed later.
First Things
There are two main considerations when thinking about configuration:
- Where to store it
- In what format
Where to Store Configuration Data
Code Repository
Store configuration files in your code repository. You get all the benefits of
source control, and your configuration is easy to find, examine, and
change. When I do this, I usually make /config/{qa,prod}
directories to hold my
configuration files.
The main drawback with this approach is that updating configuration values will
require a full deployment. This can be mitigated by
- creating a patch deploy process that only deploys configuration
- using a hybrid approach that also takes advantage of host-based configuration (described below)
Finally, if your configuration varies per deployment/installation this can be
difficult as you will have a separate file in source control for each deployment
(not to mention how to determine which configuration file to use during
deployment).
Deployment Hosts
The approach here is to have a configuration file or directory that is searched
at startup time e.g. /etc/$APP_NAME
or /opt/$APP_NAME/etc
.
This is a great option if you have host/deployment-specific configuration e.g.
your path to ImageMagick’s convert
executable varies on a per-host basis. The
drawback here is that your configuration is now spread out over multiple hosts
and is, therefore, harder to maintain.
Managing these configuration files can be difficult, as they will need to be
maintained by some dev-ops process.
Database
Using a database for configuration is a great way to share configuration across
all instances of the app. This is similar to the Code Repository option in that
configuration is easily maintained across all deployments. However, there’s the
chicken/egg problem of how does the app know what database to connect to? Also,
unlike the code repository solution, there is no version control.
The big advantages with database configuration are
- configuration can vary independent of the software deployment
- configuration can be maintained from a central location
Common options for configuration file formats are
- JSON
- XML
- YAML
- Java Properties and other language specific formats
- Scripting language
The first four (JSON, XML, YAML, Java Properties) are all of the same type.
JSON, XML, and YAML all allow for hierarchal data. Java Properties are strictly
name/value pairs – they do not offer a way to store data hierarchicaly. On the
JVM, Java Properties are a great choice because of their easy
serialization/deserialization. JSON is great because of it’s simplicity and
well-known spec. For more structured data, JSON has a big leg up over Java
Properties. YAML is similar to JSON. I would not recommend XML as it’s most
likely overkill.
Using a scripting language for configuration allows for dynamic and calculated
configuration values. I have never done this, and my guess is that there aren’t
many cases where this would be a great choice. But there can be great power in
having a programming language available in one’s configuration.
When to Apply Configuration
Build Time
In the technique, build artifacts are built with the configuration specified at
build time. The main advantage with build time configuration is that the build
artifacts can stand on their own i.e. they do not require any specific
environment to be deployed into. The disadvantage is that any configuration
change requires a new build and deployment.
Build time configured application all but forces one to store configuration
along with the code.
Runtime
Using the runtime approach, all configuration is loaded at runtime.
Configuration is not part of the deployed artifacts, but lives on the runtime
hosts or a database.
Configurations updates can be applied easily by restarting the process (or one
could even write some code to dynamically reload the configuration without a
restart).
Hybrid (Build Time/Runtime)
This is a mix of the two. The advantage of this is that you can choose what
values are configured in what way. The disadvantage is that you have to maintain
multiple configuration systems.
Code in Configuration
Storing code in configuration can give your programs a new level of power and
flexibility. This technique only makes sense if you are not storing
configuration in your code repository. This technique comes at a cost
- It’s hard to verify the correctness of code in configuration
- An infrastructure must be built around the execution of the configured code
- Debugging can be difficult as it can be difficult to know what deployed code executed with what configured code
Examples of storing code in configuration are email templates and event hooks.
Email templates can be stored as configuration and allow for updates independent
of deployments. Event hooks could allow for system updates that quickly address
changing requirements.
Code in configuration is also a solution to handling
changing or unknown inputs into your program. E.g. if you’re receiving data from
a third party that is not well defined (the data, that is), you may want to
store the code that processes the data in configuration so it can be updated
without a new deployment.
This kind of configuration is never easy and should be considered carefully
before employing. I would not use this unless there was a very good reason.
Sometimes, there can be external factors that might lead to using this technique
- Your deployment process is slow and you need to see changes fast
- System requirements are not well defined or are constantly changing
Command Line Overrides
The ability to override configuration from the command line can be valuable as a
way to quickly change configuration in both production and development
environments. Configuration values should be applied in the following order
- Hard-coded values in code
- Config files/database values
- Command line options
Next
In my next post, I’ll show the details of how I’ve configured my latest
application.