My Ideal, simple CI/CD workflow

In the past few years I’ve had several chances to setup CI/CD workflows for various projects for my employer and client’s projects. Each time I’ve iterated on my ideas and have developed in my mind what would be the ideal for those projects.

I take for granted the following:

The project is version controlled using git
The Git repo is hosted on something like Github and something like Github Actions is available to be used (without stringent time/compute limits).

The Workflow Link to heading

You have two branches: main and develop
- main and develop are both protected and require approved PRs to merge new code.
- Developers work on USER/... prefixed branches and make PRs to develop.
PRs to develop run a tests/linting/etc.
- (Optional) A way to spin up an ephemeral test environment to validate changes and behavior manually.
Once code is pushed/merged into develop:
1. Full tests/linting/type-checking is run and requires passing
2. An “artifact” is built (typically a Docker Image)
3. [Assuming docker image] Docker Images are tagged with the Git SHA and pushed to an repo configured to delete old versions automatically (or similar clean up).
4. The “artifact” is deployed to the dev environment.
A PR from develop to main
- (Optional) Integration tests are run pointed at the dev environment
- It’s expected that manual testing had already occurred, but if not this is when that’d be done.
- Final code review is done; Making easy to read git commits make this much easier.
develop is merged into main:
1. Full test suite + lint + type check are run
2. A new “production” build is made and pushed to the “prod” repo
  - This is for images which are expected to go to production
  - For now this is typically tagged with the new Git SHA of main
3. This new build is deployed in to a “staging” environment automatically
  - Depending on the system/app this can be a “internal/small prod” or an isolated environment matching production.
4. (Optional) Integration testing happens to check this deploy in staging
  - If integration testing passes have option to allow automatic deploys to Production
Once changes in staging are accepted (testing manual or automated) a manual process to deploy can happen
- (Optional) create new Git Tag for release
  - e.g. Semantic versioning, incremental version, date + number
  - The artifact in staging is tagged with this same tag; e.g. docker image Company/App:GITSHA becomes Company/App:TAG
- Production’s configuration is updated to use the same artifact used in staging (use proper name if using Tag step).
- The changes to production configuration are deployed automatically.
- (Optional) A step monitors key metrics and deployment health and reverts changes if the deploy does not work
The last step is rebasing any changes to deploy staging/production back into the develop branch to keep them up to date.

Benefits of this workflow Link to heading

“Ha, ok. Yes really simple.”

Okay, yes it’s more complicated than rsyncing the files over or logging in to git pull, but in my opinion this is the simplest and most flexible workflow that follows best practices.

What practices are those?

Run automated tests, linting, and etc as soon as possible to prevent easy bugs and problems.
Require/encourage code reviewing changes early in the process
Changes are staged (in develop) so they can be tested holistically as they’d be deployed in production.
Changes are merged to main which then builds a kind of “release candidate” artifact which is the exact set of bytes deployed to production.
Those set of bytes are tested in an environment that allows us to make a best effort of checking if the deploy would work in production (it’s never a guarantee).
Production deployment is simply changing which set of bytes are being used (typically which docker image to use for the containers).
Many of these tests and deployments are happening in a CI/CD system which can be secured once and allow developers to do deployments without having access to production access or credentials.

This process works for whatever kind of build process you have. A Docker Image is the most common these days, but it could easily be building a VM, static binary, or set of static files.

This template has some easy changes you can make depending on how you prefer things:

You could merge to main and then do PRs to a production branch. You could even cut “version branches” like release/2.x to allow maintaining multiple versions (although this workflow is optimized for a “single rolling version” typical of a SaaS product/system)
You could skip the staging step and simply promote artifacts from dev, but this makes it harder to know which artifacts were actually production ready and possibly deployed. It’s a little risker to do but probably fine when early in the product’s life and deployments to production are more frequent
You may want a “hotfix” workflow that allows changes to merge directly into main for emergency fixes. The first “hotfix” workflow I’d build is a “rollback” one which switches to the previous artifact in production.

Some things to think about Link to heading

It’s worth creating a “bot” user which be used to identify commits made by the CI system as well as providing a way to limit access to repositories to only those of the company for external systems. This also avoids tying anything to a specific employee’s github account or similar.
It’s nice/helpful to keep Terraform/Pulumi IaC code inside the same repo as the project (at least to manage specific resources for this project) since it allows making new changes/commits inside the same Repo.
Ensure that concurrency features for your CI/CD features are properly configured to block parallel deployments and auto cancel tasks which would duplicate work.
- Particularly be careful how and when tasks are cancelled. Deployments should be executed in serial and never cancelled once they’re started.

Conclusion Link to heading

If I joined a company doing something different than this workflow it would not automatically make it bad. This is a suggestion of an Ideal workflow that has not-too-many moving parts, but which makes it easy to ensure code is tested and reviewed. Once those reviews occur it allows the system to take over deploying to progressively important environments with some degree of safety (via Integration Tests).

I consider this a template that I can start with and modify as needed.

Generally I think it’s important to consider you’re making a trade off between deploying as quickly as possible and performing checks to ensure the deployment will not cause an outage. There is also the question of how fast does it take to deploy to production independent of how long k8s or similar takes to do that.