Detecting Flaky Tests in CI Pipelines
Flaky tests reduce your confidence in your codebase. When tests fail intermittently, for no discernable reason, it casts doubt on your entire test suite, leaving you wondering which test will be the next to fail unexpectedly.
What’s worse is that due to the nature of these flaky failures, the offending change can be challenging—and expensive—to track down and fix. Instead of reacting to these failures after they’ve occurred, we need to find where the failure was introduced and, from there, identify the root cause.
In this article, we’ll detect flaky tests in a testbed application caused by intermittent failures. We will use a method of error diagnosis that can help us get to the problem changeset more quickly, identifying flaky tests in CI.
Testbed Application Details
To demonstrate the method used to detect flaky tests, I’ve built a small microservice application using the Severless Framewrk and AWS Step Functions. This application has the following features:
- One Lambda function that fetches a list of currently playing movies from The Movie Database
- One Lambda function that accepts a response from the first function and extracts an element from each entry, returning a list of film titles.
Detecting Flaky Tests
To detect flaky tests in our CI pipeline, we’ll use Foresight for monitoring and analytics on the application’s CI pipeline. Foresight integrates with your application’s GitHub repository and other CI platforms, seamlessly consuming build events and tracking their success or failure.
To integrate with Foresight, follow these steps:
- Sign up for an account
- Connect your CI pipeline
- Install Foresight's GitHub app
- Choose repositories to watch
- Upload your test reports
The entire process can be completed in under 2 minutes, and it requires nothing more than an email to get started.
Once you’ve connected your GitHub repository to the Foresight platform, the Foresight dashboard will automatically populate with information from your configured CI/CD system. All you need to do is dig in and find the root cause of any problems you’re experiencing.
Intermittent failures can arise from multiple potential sources. The following list, while not exhaustive, represents a good portion of the failure classes you’re likely to see in your test suite:
- Time-order dependencies: The success of your tests may depend on the order in which they’re run. This leads to highly coupled test suites that are prone to random failures stemming from relatively minor changes.
- Resource dependencies: If your tests are dependent upon a finite resource, they may be exhausted when the test suite runs. This can cause failures unrelated to any code changes being evaluated.
- Concurrency issues: If you’re working in a concurrent execution environment, your tests might interact in unexpected ways, as the threads operate on their own control flow paths. Logical concurrency errors, such as objects that are not thread safe, can lead to random failures due to an unpredictable order of execution on the processor.
- External resource dependencies: If the test depends on a third-party service, this could lead to seemingly random failures when the third party is having availability problems.
Given that the potential problem domain for flaky tests is very wide, there’s never going to be one fix that covers all test cases. Each failure needs to be evaluated and addressed on its own merits. Often, it’s not the flaky test that is problematic, but the influence of surrounding tests.
When tests start failing, it’s important to identify when the failure began to occur and not just ask “Why is this test failing?” While this is a valid question, in the instance of flaky tests it can lead you down a rabbit hole, where every step works perfectly but the end result is still a failed test suite.
If you’re able to identify where a failure started, you can narrow down the potential changes that led to the error being introduced. This vastly reduces the scope of root cause investigation, speeding the time to resolution for tests with consistency issues.
At the absolute minimum, this time-focused investigation leads you to a “last known good” version of your application, which you can redeploy in a pinch.
Introducing an Intermittent Failure to Our Testbed
To demonstrate the method to detect flaky tests, we’re going to deliberately add a flaky test. This test will check for the presence of the CI environment and, if found, it’ll have a 1-in-20 chance of failing. This is to simulate a live flaky test—in this situation we don’t care why the failure occurred, just that there is one.
Below is the basic Jest test to add to handler.test.js:
This test is a low-effort proxy for a more complex test. If we wanted to expand this into a more robust demonstration of the failure scenario, we could encapsulate a failure case behind a web service that we control and have our test call that instead. There are limitless possibilities, depending on your goals in debugging.
Now, we’ll run the build over and over again until we produce a failure. It’s one of the few times in software where breaking things is the point!
Tracking Down the Issue with Foresight
Navigating to our project in Foresight, we can immediately see that our build is having issues:
Click on the Test Runs view, which shows the results of the last several test runs.
Using the detail view, find the last known successful test.
As you can see, after integrating Foresight, we are able to see the performance history of our single test execution. As shown above, you can see the latest 10 execution of the test. 5 out of 10 executions are failed even though we know that we haven't changed our code. We can easily identify that this is a flaky test and it blocks our CI pipeline.
Intermittent failures add time and stress to a development team. Due to the many potential sources for a flaky test, there’s no one guaranteed pathway to resolving the issue, even when it’s finally discovered. In these situations, the speed of diagnosis is key to the resolution.
Sign up for Foresight and let it help you quickly debug flaky tests and regain confidence in your code.