CI/CD Insights and Analytics

Datadog CI Visibility - an alternative solution

what is CI visibility, why you should care about it, and an alternative solution (Foresight) to Datadog CI visibility that you can use to apply in your development process
Ismail Egilmez
5 mins read

This article tells about what CI (continuous integration) visibility and shift left observability are, why you should care about it, and an alternative solution - Foresight - to Datadog CI visibility that you can apply to your development process.

Here are the benefits that CI visibility brings;

  • Reduce lead time
  • Reduce mean time to resolution
  • Increase mean time between failures
  • Increase developer productivity

Let’s start with what CI observability is

Observability is the practice of making it easier to understand and monitor the behavior of systems, services, and applications. This can be achieved through the use of tools such as logs, metrics, and tracing to provide visibility into the inner workings of a system.

By incorporating observability principles into your CI workflows, you can gain a better understanding of how your workflows are functioning, identify bottlenecks and performance issues, and make informed decisions about how to optimize your workflows for faster builds.

For example, you can use observability data to understand what jobs are taking the longest to complete and use this information to optimize your workflow or split it into multiple jobs that can be run in parallel. You can also use this data to track the performance of your workflows over time and identify trends or patterns that may indicate a problem.

Overall, adding observability to your CI workflows can help you improve the performance and reliability of your workflows, which can ultimately lead to faster builds and more efficient development processes.

Observability is shifting left indeed

So what is shift left observability? It is really nothing more than an algorithm to detect application problems earlier in the software development life cycle. In other words, shift left observability is a way to avoid shipping bugs, security issues, etc. into production. It helps us detect deficiencies earlier.

Looking at the continuous integration continuous delivery cycle, many things happen on the development side of things, and many things happen on the operational side of things. Traditionally, most of the observability (monitoring) we do happens in the latter half of this diagram. We tend to focus more on production.

However, the shift left observability concept is ultimately about bringing observability in the continuous integration side.

  • How can we enable developers to own observability responsibilities such that we can capture the bug earlier?
  • How can we minimize the cost of a production bug?
  • How can we prevent hundreds of customers from being affected by a deficiency?
  • How can we monitor our CI - the black box part of the CICD?

These are some of the questions that you can answer by applying observability to your CI processes.

Two real examples

Foresight is a CI visibility tool born out of these concerns in Thundra. It is an alternative to Datadog CI visibility. Shift left observability helped us detect some problems earlier. Let me give a couple of real world examples that our engineering team at Foresight has been able to detect once we start using Foresight in our own workflows.

A small change can introduce a huge problem

My first real world example is about a small change to a SQL query. A small PR (Pull Request) was opened by a developer, he made a small change to a SQL query, and he was thinking its not breaking or changing anything in the system.

Then, he noticed that a simple workflow running an integration test, which tests the entry point of that SQL query calls, took almost seven times slower than before. It went up from 3 seconds to 20 seconds, but it went through the CI, and the test passed. So normally, we wouldn’t have detected it unless Foresight let us know.

Luckily, we set up a latency alert on Foresight for some critical tests, and we got notified by that way.

The CI workflow run latency was not increased because the increase is even negligible. But this specific SQL query was 7 times slow. So what could happen? The consequences could be harsh. For instance, a response to a button on our app would be 7 times slower. It would have affected the end users if we had shipped it to production unnoticed.

You can’t notice if you don’t monitor

The second example is also from a real experience we had on our own team. Every developer makes small changes all the time. It was one of those cases. One of our developers changed a dependency and committed the PR.

It would have directly gone to production because it did not change or break anything. Everything would have been as expected, as stable as is, until we saw the ruinous bill if we didn’t use Foresight. But this scenario didn’t happen. We got notified immediately right after the PR was sent.

It seemed that the container image went up to 1.9 GB size from 103 MB after that change. The CPU metrics, disk I/O, and network metrics jumped to the roof. And therefore, Foresight let us know about this untrendy situation.

Imagine not having an observable CI in this case. This would most likely have gone unnoticed. Maybe a few weeks later, somebody would accidentally see something wrong and spend a few hours trying to figure out which PR introduced this regression. And the consequences would be severe.

You should care about these!

Apart from making your CI workflows observable and being able to detect, troubleshoot, and debug issues like a breeze, there are some more topics that software teams should care about.

Reduce Lead Time

You can reduce the lead time with CI observability. 

Developers always complain. CI is slow, what's taking so long, and so. And the DevOps, SRE, or the platform teams keep adding CPUs, memories more and more.

Having CI visibility, you start monitoring these things. You can look into which job is slow. Is it the test, the build, the build, or the deploy? You can pinpoint what is actually taking long and why it is taking long, and when did it change. Has it recently started taking long or was it a month ago that it became slow?

So by applying observability in your CI workflows, in other words, earlier in the development process, we can improve how our teams work and give them the ability to be able to ship better software.

Reduce MTTR

By introducing observability into our CI workflows, we can significantly reduce the mean time to resolution. Let me show you how.

This is a PR in our production system. By looking at a single screen, we can have an idea about the workflow and test success/failure status, durations, costs, and if there are any untested code changes in that specific PR.

Before Foresight, as a developer, when a test failed, all I saw was a test failed, and I got an error 500 from the service that I was talking to. If I wanted to fix that test, I had to try and reproduce the error. This means fetching that PR that I was trying to test here, spinning out whatever dependencies I had, rerunning the test, and then figuring out what was actually happening.

With Foresight, we have an overview of our PRs to make sure everything is on track. Looking at the success/failure status, duration, cost, and test gap; we can come to an initial decision to deep dive into a PR or not.

Increase MTBF

Mean time between failures is one of the crucial metrics that most organizations closely watch. By using Foresight instead of Datadog CI visibility, you can have a good understanding of your CI workflow and have actionable insights into it.

The biggest reason why workflows fail is because of flaky tests. A developer commits the PR, the build starts, and it fails. But he knows his code doesn’t break anything, so he hits that retry button again. After a couple of failures, maybe after the third time or the fourth time, the test passes. It was just a flaky test.

So by applying observability in our CI workflow, we can actually start looking into which tests are flaky, and which PR introduces that flakiness. We can start investigating how flaky is this test. Does it flake once in every hundred executions or every fourth time it runs? How long does this test take to run? Is this a test that takes 30 minutes to run or just 3 seconds?

Increase developer productivity

That is efficiency. If we do not block our developers with CI latencies or prevent them from spending countless hours trying to find a small root cause, developer efficiency would automatically start increasing. Instead of pushing the HR to hire more and better developers and spend time onboarding them, wouldn’t it be better if we equip them with the best tooling tailored to their needs and incentivize them to performance increase?

Shift Left Observability Leads to an Observable CI

Monitoring your CI workflows, tests, builds, stages, jobs, etc. brings confidence in your software development process. By reducing MTTR, and lead time, increasing MTBF, and developer productivity, the quality of your software increases automatically.

Preventing production regressions is possible by applying observability earlier in the development life cycle. Tools like Foresight and Datadog CI visibility lets you have an observable CI. There are lots and lots that you can achieve by applying observability to your CI workflows.

Why you should use Foresight over Datadog for CI observability

If you’ve read this article until here, I want to emphasize the difference between Foresight and Datadog CI visibility for you. Thinking that some of you might be considering one of the products to use in your own CI environment.

Dedicated solutions for open-source projects

Some well-known open-source projects choose to monitor their CI environments with Foresight because of this reason. Open Telemetry, Keycloak, PYPA, and Mathjs are some of them.

Foresight provides insightful comments on your PRs on GitHub. This provides a number of benefits, including helping you improve your workflows. PR comments can help automate the code review and deployment process. It can help engineering teams improve their code quality, enhance collaboration, and streamline their development workflow.

In these PR comments, Foresight also gives insights about the potential major impacts that may create a negative impact on your workflows. See the documentation for more information.

Detailed PR and cost analysis

Foresight aggregates the workflow, test, and test gap results of a pull request in a single dashboard. Foresight's PR analysis helps you to maintain your contributions safely. It helps you keep your contributions clean for your projects.

You can see if the new code is tested enough and affects your test performance. Additionally, detect the changes that slow down your time to merge. You can find answers to some of the most common questions such as;

  • What is the most executed workflow in a PR?
  • What is the cost of the workflow runs in a PR?
  • What is the average execution duration of each workflow in a PR?
  • What is the performance of the latest workflow run?
  • Which tests block the pipeline the most?
  • What is the average execution duration of the most failed tests in a PR?

Apart from these, Foresight has a generous free plan and reasonable pricing for private projects. In addition to that, the support has a high response time and is able to provide custom SLA upon request.

Flexible pricing that scales with your team, and free for open-source!

See our pricing plans