Git Workflows, Part 3: Refactoring Large Branches and Pull Requests

Posted in Git

permalink

Summary

  • If a feature branch or pull request gets too complicated and should be refactored into simpler pieces:
    • Create a new feature branch from the original destination branch
    • Turn commits into patches, or cherry-pick commits (leaving changes unstaged)
    • Apply patches or cherry-picks to the feature branch
    • Use git add --patch or git add --edit to selectively split out changes into separate commits

This post contains many common patterns applied to different workflows.

Managing Complexity

When collaborating on software, especially large software with people who are not the primary developers, it is important to limit the complexity of features and proposed changes. Why is it bad practice to propose large, complex changes?

  • It is harder to review the proposed changes
  • Bugs increase in likelihood, and increase in likelihood far faster than the amount of code.
  • More complex changes usually combine

Refactoring Large Branches

Consider the case of a large feature branch that is suffering from feature creep (trying to cram too many changes into one branch.) For example, in the process of implementing a feature, you may also implement significant fixups, refactoring of functions, and code cleanup that is in the same file but not entirely related. While writing tests for the new feature, you may also refactor tests to be cleaner, or to use context manager foobar, or etc.

To illustrate: suppose you are on a branch called feature (created off of master) that consists of three sets of changes, DEF:

A - B - C (master)
    \
     D1 - E1 - D2 - F1 - E2 - F2 - D3 - F3 - E3 (feature)
  • D corresponds to implementing the new feature and writing tests for it
  • E corresponds to fixups to the same file that was changed to implement the feature
  • F corresponds to fixups to tests unrelated to the new feature

Now, in reality, if things were really so clean, if you had a time machine or the patience to to rebase commits one at a time and split them into atomic changes only to the features in their scope (which would be super easy because of course your git logs are filled with helpful, concise commit messages) you could use git cherry-pick to replay commits D1, D2, D3 onto a new D branch, and so on.

But in reality, commit F1 contains a little bit of E1, and D2, and vice versa, and so on. It's much easier to navigate a diff and select pieces from it. That's were git add -e (or --edit) will help.

We also have to turn a set of commits into a single set of unstaged changes (meaning, replay the changes each commit made but not replay the commits themselves). There are a few ways to do this, we'll cover two ways: squashing and rolling back a set of commits, and converting a set of commits into a set of patch files.

Once the commits have been rolled back and unstaged, particular changes can be staged for each split commit using git add -e and using the editor to select which changes to include or exclude from the commit. As each commit is created, branches can be created that are linked to the group of changes in each new commit.

Converting a Set of Commits to Unstaged Changes

We are trying to untangle a set of unrelated changes into separate commits that group related changes together. For the example, we want to convert this:

A - B - C (master)
    \
     D1 - E1 - D2 - F1 - E2 - F2 - D3 - F3 - E3 (feature)

to this:

A - B - C (master)
    \
     D - E - F (feature)

so that the changes in commits D, E, and F are simpler, more limited in scope, and easier to review.

We cover three strategies for turning a sequence of commits like D1-E1-...-E3 into a set of unstaged changes. Then, particular changes can be selectively added to commits using git add -e (--edit) or git add -p (--patch).

git format-patch

To create a set of patches, one per commit, to edit them or apply them in various orders, you can use git format-patch with a commit range:

git format-patch D1..E3

This will create a series of patches like

patches/0001-the-D1-commit-message.patch
patches/0001-the-E1-commit-message.patch
patches/0001-the-D2-commit-message.patch
patches/0001-the-F1-commit-message.patch
patches/0001-the-E2-commit-message.patch
patches/0001-the-F2-commit-message.patch
patches/0001-the-D3-commit-message.patch
patches/0001-the-F3-commit-message.patch
patches/0001-the-E3-commit-message.patch

Patches can be further split or modified, and can be applied in the desired order (although changes in line numbers happening out of order may confuse the program applying the patch).

Start by creating a branch from the desired commit (commit B in the diagram above):

git checkout B

(where B should be either the commit hash for commit B, or a tag or branch that is associated with commit B). Now create a branch that will start from that commit (we'll start with our branch for feature D here):

git checkout -b feature-d

Now apply patches to the new branch, which will start from commit B.

To apply a patch, use patch -p1:

patch -p1 < 0001-the-D1-commit-message.patch

The -p1 strips the prefix by 1 directory level, which is necessary with patches created by git. We use patch rather than git am to apply the patch, because we want to apply the changes independent of git, and only stage the changes we want into our next commit.

If you have a series of commits that you want to squash, that's also easy to do by applying each patch for those commits, then staging all the changes from those patches into a new commit.

As patches are applied, particular changes can be staged and commits can be crafted. Use the --edit or --patch flags of git add:

git add --edit <filename>
git add --patch <filename>

This allows selective filtering of particular edits into the next commit, so that one patch (or any number of patches) can be applied, and selective changes can be staged into a commit.

Once you are ready, just run

git commit

without specifying the filename. (If you specify the filename, it will stage all changes, and ignore the crafting you've done.)

As you create a commit or a set of commits specific to changeset D, you can work on a D branch. When you finish all commits related to D, you can start a new branch with

git checkout feature-e

that will start a new branch from where the d-branch left off. Chaining your changes together into several small branches that build on each other will help keep pull requests simpler too.

The advantages of this approach include:

  • Commits can be split by applying the patch and staging particular edits
  • The ability to split single commits into more commits, or combine/squash commits together, means this approach has a lot of flexibility
  • Best for some situations where, e.g., a long series of commits with many small commits that should be squashed and some large commits that should be split

The disadvantages of this approach include:

  • Patches applied out of order can confuse the program applying the patches

cherry-pick and unstage

An alternative to the above workflow is to use git cherry-pick to apply the changes from particular commits, but to leave those changes unstaged using the --no-commit or -n flag:

git cherry-pick --no-commit <commit-hash>
git cherry-pick -n <commit-hash>

Alternatively, a range of commits can be used instead:

git cherry-pick -n <commit-hash-start>..<commit-hash-end>

This can help achieve a similar level of flexibility to the patch approach.

soft reset and commit

Suppose the commit history is simple enough that you can squash all of the commits together into a single diff set, and pick the changes to split into commits D, E, and F.

In that case, the easiest way might be to roll back all of the commits made, but preserve the changes that each commit made. This is precisely what a soft reset will do.

For the git commit history

A - B - C (master)
    \
     D1 - E1 - D2 - F1 - E2 - F2 - D3 - F3 - E3 (feature)

Run the command

git reset --soft B

to move the HEAD pointer to commit B, while also preserving all changes made from the start of the feature branch D1 to the tip of the feature branch E3, all added as staged changes (as though they had been git add-ed).

The changes will be staged, but changes to files can be unstaged using

git restore --staged <filename>

Now add changes selectively using the --edit or --patch flags

git add --edit <filename>
git add --patch <filename>

If desired, those changes can be unstaged, and then re-staged using git add --edit or git add --patch to selectively add changes to particular commits.

When done, run

git commit

with no arguments to commit the changes you made.

Refactoring Large Pull Requests

The approaches above can be useful for refactoring branches. The end result will look something like this:

A - B - C (master)
    \
     D (feature-d)
      \ 
       E (feature-e)
        \
         F (feature-f)

Now 3 pull requests can be made, one for each feature. Thanks to the refactoring above, each branch should be a more isolated set of changes that are all related, and therefore easier to review.

Chaining Pull Requests

The three D E F branches should be merged in together, since they are all related. But their changes should be kept separate to make reviewing each branch easier. To accomplish this, chain the pull requests together like so:

Pull Request 1: merge feature-d into master

Pull Request 2: merge feature-e into feature-d

Pull Request 3: merge feature-f into feature-e

In this way, each pull request only shows the changes specific to that branch.

(If each pull request were made against master, then later branches (F) would also incorporate changes from prior branches (D), resulting in messy and hard-to-review pull requests.)

Pull requests are reviewed and discussed, and new commits will probably be added to fix things or incorporate feedback:

A - B - C (master)
    \
     D - DA - DB (feature-d)
      \ 
       E - EA - EB (feature-e)
        \
         F - FA - FB (feature-f)

Preparing to Merge a Large Pull Request

All of your pull requests are approved and ready to merge. Now what?

Pull requests will need to be merged in reverse order (last PR is merged first - f into e, e into d, d into master). To test that things go smoothly with the first pull request (feature-f into feature-e), we should create a local E-F integration branch.

The local integration branch will have new commits if changes are needed to resolve merge conflicts or fix broken tests. Any changes made can be added to the feature-f branch and pushed to the remote, so that they are part of the pull request, making the merge into feature-e go smoothly.

To create a throwaway E-F integration branch, we start by creating a test integration branch from the tip of the feature-f branch, and we will merge branch feature-e into branch feature-f.

git checkout feature-f

Now we create a local E-F integration branch:

git checkout -b integration-e-f

Now we merge feature-e into integration-e-f, which is the same as feature-f:

git merge --no-ff feature-e

The --no-ff flag creates a separate merge commit, which is useful here to keep our commit history clean.

If merge conflicts are encountered, those can be resolved in the usual manner, and the (conflict-free) new versions of each file, reflecting changes from feature-f and feature-e, will all be present after the merge commit.

Further commits can also be made to make tests pass, with a resulting git diagram:

A - B - C (master)
    \
     D - DA - DB (feature-d)
      \ 
       E - EA - EB ----
        \              \
         F - FA - FB - EF1 - EF2 (integration-e-f)
                              ^
                             HEAD

Once the integration-e-f branch is polished and passing tests, we can re-label it as feature-f and push the new commits to the remote. To re-label integration-e-f as feature-f, assuming we're at the tip of the integration-e-f branch (where we left off above):

git branch -D feature-f
git checkout -b feature-f

and push the new commits to the remote's feature-f branch, before you merge in the pull request (feature-f into feature-e):

git push origin feature-f

Now you are ready to merge pull request 3 (F into E).

Rinse and Repeat

Rinse and repeat for pull requests 2 and 1.

For Pull Request 2, we start by creating a new integration-d-e-f branch from the tip of the integration-e-f branch, like so:

git checkout integration-d-e
git checkout -b integration-d-e-f

and use the same approach of merging in the feature-d branch with an explicit merge commit:

git merge --no-ff feature-d

Work out any merge conflicts that result, and add any additional changes needed to get tests passing, and you should now have a git commit history like this:

A - B - C (master)
    \
     D - DA - DB ----------------
      \                          \
       E - EA - EB ----           \
        \              \           \
         F - FA - FB - EF1 - EF2 - DEF1 - DEF2 (integration-d-e-f)
                                            ^
                                           HEAD

Now re-label the integration-d-e-f branch as feature-e:

git branch -D feature-e && git checkout -b feature-e

Finally, push all new commits to the remote, including the new merge commit, which will make sure the pull request can be merged without any conflicts:

git push origin feature-e

Now PR 2 (E into D) can be merged.

Final Merge into Master

The last and final PR, D into master, will merge all combined feature branches into the master branch. We start with a feature-d branch that has several commits related to feature D, then several commits from merging the feature-e branch in (pull request 2, E into D), and the feature-e branch also had feature-f merged into it.

A - B - C (master)
     \
      D - D2 - DEF1 - DEF2 (feature-d)

Now we will create one more commit on the feature-d branch that is merging master into feature-d, which will help the merge happen smoothly for pull request 1 (D into master).

But first we switch to an integration branch, in case things don't go smoothly and we want to throw away the merge commit:

git branch integration-def-master

Create an explicit merge commit to merge master into integration-def-master:

git merge --no-ff master

Work out any merge conflicts that result, and add any additional changes needed to get tests passing, and you should now have a git commit history like this:

A - B - C (master)
     \   \---------------------
      \                        \
       D - D2 - DEF1 - DEF2 - DEF3 (integration-def-master)

where commit DEF3 is the merge commit created with the --no-ff command.

The merge commit will resolve any conflicts. When you're satisfied with the merge commit, you can switch out the integration-def-master branch with the feature-d branch like so:

git branch -D feature-d
git checkout -b feature-d

Now you can push the merge commit to the remote:

git push origin feature-d

and you're now ready to merge your (conflict-free) pull request!

Tags:    git    rebase    cherry-pick    branching    version control