In the past few years I’ve had several chances to setup CI/CD workflows for various projects for my employer and client’s projects. Each time I’ve iterated on my ideas and have developed in my mind what would be the ideal for those projects.
I take for granted the following:
- The project is version controlled using git
- The Git repo is hosted on something like Github and something like Github Actions is available to be used (without stringent time/compute limits).
You have two branches:
developare both protected and require approve PRs to merge new code.
Developers work on
USER/...prefixed branches and make PRs to
developrun a tests/linting/etc.
- (Optional) A way to spin up an ephemeral test environment to validate changes and behavior manually.
Once code is pushed/merged into
- Full tests/linting/type-checking is run and requires passing
- An “artifact” is built (typically a Docker Image)
- [Assuming docker image] Docker Images are tagged with the Git SHA and pushed to an repo configured to delete old versions automatically (or similar clean up).
The “artifact” is deployed to the
A PR from
(Optional) Integration tests are run pointed at the
- It’s expected that manual testing had already occurred, but if not this is when that’d be done.
- Final code review is done; Making easy to read git commits make this much easier.
- (Optional) Integration tests are run pointed at the
developis merged into
- Full test suite + lint + type check are run
A new “production” build is made and pushed to
the “prod” repo
- This is for images which are expected to go to production
For now this is typically tagged with the new Git SHA
This new build is deployed in to a “staging”
- Depending on the system/app this can be a “internal/small prod” or an isolated environment matching production.
(Optional) Integration testing happens to check this
deploy in staging
- If integration testing passes have option to allow automatic deploys to Production
Once changes in staging are accepted (testing manual or
automated) a manual process to deploy can happen
(Optional) create new Git Tag for release
- e.g. Semantic versioning, incremental version, date + number
The artifact in staging is tagged with this same tag;
e.g. docker image
- Production’s configuration is updated to use the same artifact used in staging (use proper name if using Tag step).
- The changes to production configuration are deployed automatically.
- (Optional) A step monitors key metrics and deployment health and reverts changes if the deploy does not work
- (Optional) create new Git Tag for release
The last step is rebasing any changes to deploy
staging/production back into the
developbranch to keep them up to date.
“Ha, ok. Yes really simple.”
Okay, yes it’s more complicated than
the files over or logging in to git pull, but in my opinion this
is the simplest and most flexible workflow that follows best
What practices are those?
- Run automated tests, linting, and etc as soon as possible to prevent easy bugs and problems.
- Require/encourage code reviewing changes early in the process
Changes are staged (in
develop) so they can be tested holistically as they’d be deployed in production.
Changes are merged to
mainwhich then builds a kind of “release candidate” artifact which is the exact set of bytes deployed to production.
- Those set of bytes are tested in an environment that allows us to make a best effort of checking if the deploy would work in production (it’s never a guarantee).
- Production deployment is simply changing which set of bytes are being used (typically which docker image to use for the containers).
- Many of these tests and deployments are happening in a CI/CD system which can be secured once and allow developers to do deployments without having access to production access or credentials.
This process works for whatever kind of build process you have. A Docker Image is the most common these days, but it could easily be building a VM, static binary, or set of static files.
This template has some easy changes you can make depending on how you prefer things:
You could merge to
mainand then do PRs to a
productionbranch. You could even cut “version branches” like
release/2.xto allow maintaining multiple versions (although this workflow is optimized for a “single rolling version” typical of a SaaS product/system)
You could skip the staging step and simply promote artifacts
dev, but this makes it harder to know which artifacts were actually production ready and possibly deployed. It’s a little risker to do but probably fine when early in the product’s life and deployments to production are more frequent
You may want a “hotfix” workflow that allows
changes to merge directly into
mainfor emergency fixes. The first “hotfix” workflow I’d build is a “rollback” one which switches to the previous artifact in production.
- It’s worth creating a “bot” user which be used to identify commits made by the CI system as well as providing a way to limit access to repositories to only those of the company for external systems. This also avoids tying anything to a specific employee’s github account or similar.
- It’s nice/helpful to keep Terraform/Pulumi IaC code inside the same repo as the project (at least to manage specific resources for this project) since it allows making new changes/commits inside the same Repo.
Ensure that concurrency features for your CI/CD features are
properly configured to block parallel deployments and auto
cancel tasks which would duplicate work.
- Particularly be careful how and when tasks are cancelled. Deployments should be executed in serial and never cancelled once they’re started.
If I joined a company doing something different than this workflow it would not automatically make it bad. This is a suggestion of an Ideal workflow that has not-too-many moving parts, but which makes it easy to ensure code is tested and reviewed. Once those reviews occur it allows the system to take over deploying to progressively important environments with some degree of safety (via Integration Tests).
I consider this a template that I can start with and modify as needed.
Generally I think it’s important to consider you’re making a trade off between deploying as quickly as possible and performing checks to ensure the deployment will not cause an outage. There is also the question of how fast does it take to deploy to production independent of how long k8s or similar takes to do that.