Managing Forks

Data.gov should avoid creating forks that it then has to own and maintain. These guidelines describe how we should manage our forks in order to:

Leverage the open source community and easily incorporate fixes and improvements from upstream
Focus on delivering value to customers and the Data.gov mission
Reduce our need for forks and contribute improvements back to the open source community
Reduce maintenance burden when forking is our only option

Avoid forking

Managing forks requires a lot of time and resources that could be spent on delivering value to customers and building toward the Data.gov mission. To manage a fork successfully, you need to have a non-trivial amount of knowledge on the internals of the upstream project. If you have a lot of forks, there's a lot of code to understand and keep in your head. When the upstream releases a new version, that could be a lot of new things to learn. Otherwise, you might miss out on bug fixes, security improvements, and new features. Instead, try to avoid forking.

That said, we have a lot of existing forks to maintain, so here is how we should manage them.

Push customizations upstream
Treat a fork as any other repo
Subscribe to release notifications
Manage forks as a set of patches, rebase commits onto new versions
Code comments help identify what/why something was done custom
Use clean interfaces
Track the fork in an issue

Example

TODO link to a real example to demonstrate how this works (and that it does work).

Suppose we have two features we are trying to upstream to ckanext-harvest. One feature is to allow custom templating to harvest email reports (feature/email-templates branch). The other is to add timeouts and retry limits to job execution (bugfix/timeouts branch). Each of those should be submitted upstream as separate PRs. While we are waiting for the PRs to get merged, we create two PRs for our fork branch datagov/v1.2.0 and squash commit each onto our fork branch datagov/v1.2.0.

Upstream requests some changes to feature/email-templates. If the changes are minor, we might make the change to feature/email-templates but not to datagov/v1.2.0. We'll get those changes once the feature branch is merged upstream. If the changes are major, we might want to do another PR to our fork branch and squash commit.

Now suppose upstream has accepted our bugfix/timeouts PR. A new version is released, v1.2.1 and we get a notification of the new version. We create a branch (datagov/v1.2.1) from this tag and then rebase datagov/v1.2.0 onto datagov/v1.2.1. Since bugfix/timeouts is now already included, that commit will be dropped. Congratulations! That's one less commit we have to maintain. During this rebase, we might decide to combine a few commits or alter them so they make logical sense.

Note: There is some opaqueness to rebasing since you can't really diff two rebases between versions in a PR (although you could link to a separate diff that better summarizes changes between versions). As an alternative, it might make sense to create individual PRs for each bugfix/feature onto the fork branch, again, for each upstream release. It's a lot of work but might be more clear. We will assess this process and re-evaluate as we try it out.

Once the new fork branch is passing tests, it's ready to be incorporated downstream.

Push customizations upstream

The goal of the fork is to get rid of the fork. We should be pushing all of our customizations upstream so that we don't have to maintain them. Ultimately it is upstream's discretion on what changes to accept or not, but there might be other ways to get rid of our forks than pushing upstream. For example, bugfixes are likely to be accepted upstream. New features might be more appropriate in a separate plugin. Maybe there is a different feature upstream that approximates what we need.

Submit bug reports and feature requests upstream and keep an open dialog about the best approach to getting your changes incorporated.

Treat a fork as any other repo

Forks are code we own and maintain. They should meet the same quality that we have for app repos that we wrote ourselves. That means forks should have automated tests with continuous integration.

TODO link to repo checklist.

We should existing automated tests where possible. We must use a GSA approved CI system (CircleCI or GitHub Actions). If no tests exist then we should open an issue with upstream to add them, and consider implementing them ourselves.

Subscribe to release notifications

We should be notified of new versions of upstream via #datagov-notifications. This allows us to keep up with security updates, bug fixes, and new features.

Use rebase instead of merge

With a fork, we are really maintaining a set of patches on top of a stable version of upstream. When a new version of upstream is released, we should be rebasing our patches on top of the new version rather than doing a traditional merge.

Branches we ship should be named datagov/{upstream-version} which allows us to easily switch to new versions downstream by specifying the new branch. This also keeps our fork branches separate from feature or bugfix branches which we should be submitting upstream. These branches are not for upstream.

When there are Data.gov-specific bugfixes or features, they should be squash merged onto the shipping branch to keep the number of patches small and the related changes together. When rebasing onto upstream, it can be hard to resolve merge conflicts if related changes are spread across multiple commits. Using squash commits should help.

Use verbose code comments

When making changes, be verbose in your comments. Remember, this repo might require its own domain expertise and might be rarely touched. Assume the reader needs to be brought up to speed on how the upstream code works and how your change accomplishes its goal.

Use clean interfaces

Where possible, use clean and appropriate interfaces when making changes. For example, use configuration options instead of hardcoding values. Move custom code to a helper module and import the helper where needed to reduce merge conflicts.

Track the fork

Create a GitHub issue to track the fork so that we can prioritize the work to upstream it against our backlog. Be sure to document why the fork exists (what are the features/bugs you're addressing). It's easy to lose track of why something was forked in the first place. The original reason might not be valid anymore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly