3 Ways to Cut CI Costs in 2022
This article contains 3 ways to optimize the costs of your CI environments by optimizing the tests running in your CI pipeline.
- Dig into Tests Error Root Causes
- Test Gap Analysis
- Prioritizing Tests
Some Data Points First
In The Cost of Poor Software Quality in the US: A 2020 Report, The Consortium for Information & Software Quality says that the largest spending item in the software lifecycle is the cost of finding and fixing bugs or defects. In the 2018 report section on “Computer and IT occupations in the US today“, the Consortium for Information and Software Quality estimates that $500 billion was being spent finding and fixing defects.
In The Economics of Software Quality, Capers Jones, Olivier Bonsignour express the cost of low quality software in another devastating way.
Almost 10 years ago, the population of software engineers in the United States was around 2.5M and on any given day, almost half of them spend their day finding and fixing bugs due to poor software quality and poor quality control.
The above data points clearly show the direct effect of finding and fixing bugs on the cost of the software development life cycle.
Let’s get started with the reasons for high CI costs before diving into details.
What is the reason for high CI costs?
There are various variables that cost you money in a CI pipeline like infrastructure costs but the essential reasons for high CI costs are the test failures and latencies in build processes.
Let’s assume that there are 20-100 developers in your organization. They commit their code on their branches and open pull requests (PR) every day.
Let’s also assume that you have 100 tests in your CI pipeline consisting of unit and integration tests.
Another assumption is that you have a well-configured CI pipeline that is triggered at the push of a button to merge into the base.
Here comes the nub of the story. A build process runs 100 tests in 120 minutes on average. But this time; the build fails after 119 minutes. A pile of logs is sent in front of the developer who opened the PR.
All the other developers are lined up with multiple PRs and the engineering manager responsible for that CI pipeline starts to do a slow burn.
The developer who received the test error tried to reproduce the issue in her local environment but everything seems working fine! She struggles over the code and opens another PR.
And the story repeats over and over again…
This is a story every developer can feel how harsh it is. As clearly seen from this experience, the test failures can clearly lead to high costs of CI pipelines. Not only your bill bloats but your development time is wasted as well.
There’s also another scenario in this short story where the tests take forever to execute and end with success or failure. Your tests can act differently every time they run because you do not control the infrastructure that they run.
Either way, your CI costs increase just because of undesired conclusions of your automated tests. Such as failures and latencies.
Re-running the failed tests
What do we do when our build fails because we have failed tests? The common practice is that we push the run button again! It is an ancient belief that when you restart or rerun nonoperating software, it starts working like magic.
Reproducibility of the issue in the code is an important aspect. According to a recent research, more than 91% of developers give up rerunning the failed tests and admit to having defects caused by those failures.
Similarly, reproducing a bug in production is a daunting task. In order to solve a problem on code running in production, developers have to gather information relating to the problem and write test cases and/or reproduce the problem on their local machines to understand the root cause. This is definitely a time-consuming process. Moreover, this is an inconvenient and even annoying process for the end users.
Furthermore, when reproducing production deficiencies does not help, software vendors often send staff to the customer site to try to resolve the problem on-site. This for sure negatively impacts customers in time, cost, and resources.
Long CI process
Speed is the key to modern software development. A CI pipeline employs a variety of tools and techniques to automate builds and software testing. Test aıutomation can consist of many types of tests such as sniff or unit testing, along with more comprehensive integration or end-to-end testing.
A CI pipeline provides rapid and detailed feedback to the developer. At each iteration, bugs are identified, located, reported, and corrected without harming the production environment and naturally the end users.
Ultimately, when a CI workflow ends even with a successful build but at a longer time than expected, it may lead to significant delays in the whole CICD process and software delivery lifecycle. The root cause of the delays in CI workflows needs to be addressed and optimized for not to face high costs in all aspects.
What is the Impact of Test Failures?
Tests fail. Actually, tests are meant to fail because it is the best and easiest way to experiment with the code. But there are some consequences of test failures. When a codebase of a software product or a codebase of an organization is developed and maintained continuously, test failures slow down the development process. New releases of new functionalities, fixes, or enhancements of existing functionalities get delayed because of test failures whose root cause is hard to discover.
Production may fail
With the advent of Continuous Integration in software development, the bugs encountered in production environments are limited, but problems with building and running tests are compounded by many dependencies and configuration issues. It requires a solution that facilitates the visibility problem of CI and makes it possible to troubleshoot tests when a build fails in a CI pipeline.
The flow of shipping to production must be smooth and error-free, fast in CI/CD pipelines. Every new PR has to be covered by tests in order to prevent bugs fall off the radar.
If the defects are not detected early in the release cycle instead of in production, this may probably result in end-user dissatisfaction, revenue loss, or even loss of reputation.
Reproducing test failures impede velocity. Software engineers spend an average of 14 hours finding and fixing a single failure in their backlog. When a test fails, getting the bug to reproduce is the biggest time-wasting process in finding and fixing it. Most of the time, developers claim that it is pretty daunting to reproduce the bug causing a test fail.
In a recent study in 2021 among software developers, 82% of the developers say they could release software 1-2 days faster if reproducing the test failures wasn’t an issue and 33% say they could deliver their tasks at least 2 weeks faster.
How to Optimize CI test costs?
Software teams aim to release as fast as possible but there are various reasons that hamper velocity. Slow builds, failing and flaky tests, and clogged CI pipelines end up with slow release cycles. As a result, productivity falters, costs increase, and end users might be impacted.
The cost of unoptimized CI pipelines may incur in many ways such as infrastructure costs, vendor costs, network costs, and worst of all the cost of decreased developer productivity. Here are some ways to optimize CI pipeline costs and naturally the performance as well.
- Monitor the CI workflows to identify failures fast and easily. A common challenge for software organizations is ensuring that all stakeholders like developers, DevOps engineers, SREs, and management have across-the-board visibility into the software delivery process. Having actionable analytics on your CI workflows helps to identify issues and enhance organization-wide visibility on your workflows.
- Trace CI processes to understand the root cause of failures at the Kernel level. Tracing the “critical path” which means the time spent by the process itself but not its child processes helps to pinpoint the root cause of errors in CI workflows, jobs, steps, and builds.
- Monitor the CI workflow metrics such as CPU load, memory usage, network I/O, and disk I/O for each workflow run to be able to catch the peaks and valleys.
- Detect untested code changes and do not let them go to production without being tested or reviewed. Understand where the test gaps are and the impact of code changes to move on safely and fast to production.
- Do not run every single test for every pull request. Pinpoint what parts of the code are changing and what percentage is tested. As a result, just running the tests covering only the changed part of the code saves a lot of time. Additionally, prioritizing tests that are more likely to fail makes an impact on time and costs as well.
Testing early and testing often reduces the cost of failures that might appear late in the software development life cycle.
Using ephemeral dockerized services or tools to automize container management for your tests. These services wake the containers before running your tests and stop them after your tests are completed. This way, you end up saving time and money.
Dig into Tests Error Root Causes
Long-running test suites and frequent failing tests are the most common reason for slowing down build times and hence reducing deployment frequency. You need to have visibility into test runs that enable teams to quickly debug test failures, detect flaky tests, identify slow tests, and visualize performance over time to identify trends.
The most important approach to debugging any software defect is to debug it at its origin. More specifically, if a bug is found or even suspected in the testing phase, it is best to find out the reason why and debug it right away without changing the environment or variables in the system.
In a cloud environment set for a continuous integration pipeline, staging environments mimic production environments. When a build process fails because of erroneous tests, it is a costly and inefficient practice trying to reproduce the error in local environments.
Slow build times end up reducing deployment frequency mostly because of long-running test suites and frequent failing tests. Being able to identify which tests block the CI pipeline increase productivity and decrease costs indirectly. Identifying erroneous tests swiftly lead to spending less time fixing bugs.
Test Gap Analysis
It becomes tough to follow which PR impacts which parts of the application, whether it is well covered by tests or not, and how risky it is to move on. Code reviewers need to understand where the test gaps are and the impact of code changes in order to make healthy decisions.
Engineering leaders can ease their jobs if they intake an automated code-change impact report in their change management process. This can allow them to make a risk-based decision on which test gaps need to be closed before releases. Having an automated test gap analysis report which tells how risky it is to move on with missing coverages accelerates the change management processes.
Moreover, developers, QAs, and testers require a code level visibility at the PR level at every workflow run which can make their job a lot easier to write tests for the gaps. An automated test gap analysis system would correlate the changes to the codebase with the test coverage reports and determine how much of the changes are covered by the tests. Nothing will fall off the developers’ radar this way.
Slow releases bring underperforming developers. Waiting for long CI and testing queues, and being blocked for stability, uptime, and error reduction decreases developers' productivity. The higher the development velocity, the greater the chance for downtime and errors.
In order to save time, money, and effort, you should prioritize the tests that are more likely to fail. This way you can save a great amount of time compared to a long testing cycle. For instance, you should be able to choose not the execute irrelevant tests. Having this in hand, you can boost your productivity, decrease time to production and lessen the time spent reviewing code changes.
When you automate your testing process, prioritizing is not viable by manual or guesswork. You need a deep analysis of your tests to have the best possible test execution order. A system can help you prioritizes tests according to their importance, flakiness, and failure rates, hence eventually allowing you to create a unique and reliable test order.
We discussed the 3 ways to optimize the costs of your CI environments in this article. These are; digging into test error root causes, making a test gap analysis, and prioritizing tests in your CI pipelines.
In today’s DevOps practices, CI/CD pipelines are mostly automated and maintained by highly skilled engineers. Even though, bugs may slip through the cracks and show up in production out of nowhere.
Observability is widely spread in continuous delivery but there’s a lack of observability in continuous integration pipelines. Optimizing CI pipelines for cost, workforce, time, or anything is highly possible by having a good observability system dedicated to CI workflows. Tools like Foresight are made for CI observability and differ from APMs, error tracking systems, or code coverage tools.