Heroku and Git submodules

April 21, 2017 / Swen Kooij

TL;DR: We developed a new buildpack which fetches the Git object tree and uses Git natively to fetch submodules: https://github.com/SectorLabs/heroku-buildpack-git-submodule

We currently host our websites on Heroku. Heroku is a great platform for getting started. It takes the pain out of managing infrastructure and allows you to get up and running. It might not be a good long-term solution, but it gets you up and running quickly. This isn't to say that the experience is completely pain free. From time to time, you run into a problem that requires some engineering to solve. We ran into one.

We host all our code on Github, in a private organisation. Heroku deploys straight from our Github repository. It does it automatically every time we push into a certain branch. This is very convienent, and gives us continious deployment without a lot of effort. Recently, we started modularizing our code base to allow code re-use. This also meant that we moved some code into a separate repository that is now included as a Git submodule. Git submodules aren't perfect, but good enough for us now.

Git submodules

Our repository looks like this now:

- duck
    - @paddle
        - index.html
        - somegreatfile.txt
- somefile.txt
- .gitmodules

Where duck is the main repository that gets deployed and paddle is another repository that is included in duck as a Git submodule. A Git submodule is tied to a specific commit. This allow you to pin it to a specific version (versioning it). When you check out the parent repository, you'll get the exact same version as you committed. If you want to update it, you have to make a commit that stores a new reference point (a commit hash). This commit hash is stored in the Git object tree.

On top of the commit hash of the referenced repository being stored in your Git object tree, you'll also have a .gitmodules file. This file describes where the referenced repository is stored:

[submodule "paddle"]
    path = paddle
    url = https://github.com/SectorLabs/paddle.git

When you clone a Git repository that has submodules, you won't automatically see the contents of those. You have to either instruct git clone to also clone out the submodules, or check them out manually after cloning:

git clone --recursive [repo url]
OR
git submodule update --init --recursive

You could use Git subtree to truly embed a Git repository into another, but this can dramatically increase your repository's size, as this would store a copy of the other repository. It would also complicate actively working on the submodule.

Read more about Git submodules and Git subtree here:

The problem

When we thought about introducing Git submodules into our project we also researched how Heroku would deal with this. We found: https://devcenter.heroku.com/articles/git-submodules and life was good. It was already taken care of. The joy!

Our joy was short-lived when we saw the following warning:

warning

That's a problem. We're using Github Sync to do our deploys. This means that Heroku pulls from our repository when deploying (as a result of some event, like a push) rather than us pushing the changes to them. Heroku only supports Git submodules natively wheing doing Git pushes and not when it pulls from your repository.

We figured that somebody else must have encountered this problem. Google to the rescue! Of course, somebody else did encounter this problem and came up with a solution. The solution is a custom buildpack that takes care of checking out the submodule. Brilliant!

Again, our joy was short-lived (a bit longer-lived this time). Although the custom buildpack worked fine, it came with two major disadvantages:

Does not support SSH authentication.

This means you have to hard-code credentials in your repository. Obviously, this doesn't apply if your submodule is a public repository which can be clone without credentials.
Does not actually read the Git object tree to checkout the submodule at the stored commit.

It instead looks at the branch setting specified in the .gitmodules file and simply checks out that branch, completely ignoring the hash stored in the Git object tree.

Number #2 is a big disadvantage. The great advantage of storing the commit hash at which a submodule is tied, is that it allows versioning the dependency. When you upgrade the submodule to a newer version, you make a commit. By simply tying it to a branch, you'd have to make sure that branch always contains the right version. Such a system easily breaks and makes our deployment model more complicated.

At first, we wondered why the buildpack we found doesn't simply do git submodule update --init --recursive. That would have been much easier. Here's why it doesn't:

Buildpacks don't have access to the Git tree. Heroku simply clones your repository at the start of the build, zips up the result (without the .git directory) and starts executing the configured buildpacks.

Ouch. This makes things a bit more complicated. In order to solve this problem, we'd have to somehow develop a buildpack that can access the Git object tree and check out the submodules properly.

The solution

We developed a new buildpack to solve our problem: https://github.com/SectorLabs/heroku-buildpack-git-submodule

This new buildpack should be secure and simple to use. On top of that, it fetches Git submodules exactly like Git does instead of relying on a hack or trick.

Making a simple buildpack

Developing a Heroku buildpack is quite easy. You simply create a public Git repository with the following structure:

- bin/
    - detect
    - compile

Where detect and compile are marked as executable (chmod +x). This means we can simply write a Bash script to do all the magic.

Read more about creating buildpacks for Heroku here: https://devcenter.heroku.com/articles/buildpack-api

Getting the object tree

The first step is to get the Git object tree in there. Heroku strips it. We have to find some way to get it back. We do however not want to check out the whole repository. Not only would this be time-consuming for large repositories, it would also overwrite the files Heroku already packed up. We also don't need the entire Git history, just the version that is being deployed.

Git is very powerful source version control system. If you know where to look, you can pull off quite some magic. We can take advantage of the following two Git features to achieve what we want:

Sparse checkouts

A sparse checkout will allow us to only get the files we're interested in. This feature is usually used to save bandwidth and time when you don't actually need everything in a repository.
Shallow clones

A shallow clone (--depth 1) will prevent pulling in the entire Git history and get only the last commit.

In order to clone the Git repository (or at least part of it), we need to know where it is located. The Heroku app is connected to our Github repository. Inside the buildpack, we don't have access to the Git URL. Therefor, we're forced to ask the user of the buildpack to add a setting to their Heroku app:

heroku config:set GIT_REPO_URL=https://github.com/SectorLabs/duck.git

This allows the buildpack to figure out where to clone the repository from. Just cloning the repository is not enough. We also need to checkout the right version. When Heroku deploys, it checks out from a specific branch (that you configured). We have to replicate this behavior. Luckily, Heroku sets an environment variable for buildpacks: SOURCE_VERSION, which contains the commit hash of the version that is being deployed. This allows us to check out the right version.

The sparse checkout is a little bit more complicated then it seems. It allows you to check out part of the repository, we actually want none of it. Unfortunately, when you try to do that:

error: Sparse checkout leaves no entry on working directory

Our work-around is simple:

rm .gitmodules
echo ".gitmodules" > .git/info/sparse-checkout

Instructing Git to only checkout the .gitmodules file. We have to remove the original before we do that. This is not a problem because the version that we're checking out should be the exact same.

Putting it all together, we get something like this:

git init
git config core.sparseCheckout true
echo ".gitmodules" > .git/info/sparse-checkout
git remote add origin "$GIT_REPO_URL"
git fetch -q --depth 1 origin -a
git checkout -q $SOURCE_VERSION

That leaves us with a .git directory containing our Git object tree. We can now use normal Git commands to accomplish the rest.

Dealing with authentication

As mentioned, another problem was the fact that the solutions we found required hard-coding the authentication as part of the Git url:

[submodule "paddle"]
    path = paddle
    url = https://username:password@github.com/SectorLabs/paddle.git

This is a bit of a security risk. It would be much nicer if we could specify this as part of the Heroku app. Not only would this be more secure, it would also make it easier to use different credentials per environment.

We could simply get the username/password from an environment variable set on the Heroku app. However, we'd rather use SSH keys for authentication. Also because this wouldn't require us to pay an additional $9 a month for a user on Github. SSH keys can be specified as Github deploy keys.

For this, our buildpack requires the GIT_SSH_KEY setting to be set on the Heroku app, specifying a private SSH key to use to authenticate to both the repository and its submodules:

heroku config:set GIT_SSH_KEY=$(cat ~/.ssh/id_rsa)

Open-source

As usual, we release our solution to the community under the liberal MIT license:

https://github.com/SectorLabs/heroku-buildpack-git-submodule

We look forward to hearing your feedback. Positive or negative.

Come work for us

Does all of this excite you and would you like to work for us? Don't wait, check out our job listings here: https://www.sectorlabs.ro/jobs/

Yes, we do have rubber ducks.

🦆

5 Likes

categories / Tech
tags / heroku, git, submodules