Skip to content
Christoph Burgmer edited this page Apr 28, 2016 · 11 revisions

Questions buildviz is trying to answer

This page adds some background to the graphs offered by buildviz. As they are linked from within the tool, please only change the title of the following questions if you are willing to make a pull request :) Do feel free though to extend the answers and provide your own experience.

Runtime

When are we getting final feedback on changes?

Continuous Integration and Continuous Delivery both stress the importance of a rapid feedback loop. We only know if a change is any good once it passed all our checks and gates. The faster your pipeline does that, the faster you can move on.

Has a job gotten considerably slower?

One job's runtime can have a big impact on the overall pipeline's speed (see When are we getting final feedback on changes?), especially if it is a bottleneck. Having an eye on which part is getting slower will help mitigating the creeping slowdown of the pipeline.

Has a job gotten suspiciously fast?

One job starting to become fast very quick might be a godsend, or just a broken configuration.

Where is time wasted in the pipeline?

Slow builds can be annoying, when we impatiently wait for the results. Worse than that though are those cases where a changeset is waiting in between two chained builds, after the first has already finished, but is waiting to be picked up by the next. Those wait times can point to bottlenecks. Showing those wait times will bring this otherwise hidden waste to our attention. As most build tools support steps being triggered multiple times in parallel, the reason for such waits might rather stem from a local resource that cannot support more than one change (e.g. code deployed to a fixed environment).

Where are multiple changes possibly queuing up for processing?

A bottleneck in the pipeline might make more than one change queue up, thus having multiple changes passing through once the job gets free again. If the build subsequently fails it gets harder to reason about what change introduced the failure.

A simplified calculation: With one change, the reason for failure might be a 50/50 bet on either the change itself or rather outer factors, whereas with two changes packed together you are less likely to debug the right cause in the first place (33/33/33 possibility).

Where is most of the time spent?

Tests and checks provide benefits but they come with costs. Adding more gates to a pipeline will provide more points to guard against changes, if done right this will contribute a lot towards better quality. However the time for feedback is probably as much as a concern as the quality of the feedback itself, so balancing the amount of checks is important. An overview of all the builds will help to understand on which topics time is spent.

Where is the time spent in testing?

Whether unit or integration tests, making sure that tests are fast will help maintaining quick feedback cycles. An overview over the different packages/files and a detailed drill down where the time is spent helps keeping slower pieces in check and finding good candidates for optimisation. As test setup, which adds considerable time to overall test runtime, is often shared between multiple tests, looking into classes of tests helps find quick wins for improvement.

What could be the first place to look at to improve test runtime?

Having mentioned optimisation by classes/files in the previous question, sometimes there are those odd slow tests, and flagging those provide a quick way of improving the outliers. Also they provide an exemplary insight into the state of the code base.

Failure

What is the general health of the build system?

A broken build means something is wrong. While broken production code is going to introduce a bug once gone live, broken test implementation on the other hand means one sanity check less, and over time less trust in the build/test system ("Broken Window syndrome"). Either way, fix broken builds immediately.

How much are we stopping the pipeline?

Failing builds require our manual intervention, pausing and delaying the things we are currently working on. A breaking build commonly halts the pipeline, preventing further testing of any subsequent changes.

While failing builds and tests provide a massively important value, too many failures might indicate a deeper issue.

How quickly can we resume the pipeline after failure?

A failing test or build will need time to be understood and investigated. This is then followed by time to fix the issue and then by the re-run of the job and all previous stages which add on to the overall time. If this process takes too long, not only does it take up much of the inspecting party's time but will also impede any other users of the pipeline.

What needs most manual intervention?

Similar to How much are we stopping the pipeline?, most if not all failures need manual intervention, cutting into development time. Some of the more often failing jobs might be exposing underlying issues. One thing is fore sure: If the job fails too often for no good reason, folks will start just re-running it in the hope it passes next time. It only goes downhill from there.

Where are the biggest quality issues?

and

Where do we receive either not so valuable or actually very valuable feedback?

A failing test can be a false or a true positive. We want to minimise the former and maximise the latter. A job that fails often is either inherently badly built, or just really efficient. Either way, it probably makes sense to understand the underlying source for the failures.

Stability

Where are implicit dependencies not made obvious?

A flaky job (same input different output) is a direct result of an uncontrolled dependency. If the changing state of the dependency is a welcomed one, the input should be declared to buildviz. Most likely though the dependency is intended to be stable and yield reproducible results but does not do so.

While quite often there is no easy path leading to the source of the flakiness, software and most of the hardware are pretty reliable, so the source of seemingly nondeterministic behaviour most often stems from a fault in the design/implementation of the underlying system being built (in other words "is just a simple user error").

Also see Which tests provide questionable value and will probably be trusted the least?.

Which jobs will probably be trusted the least?

Flaky jobs are quickly abandoned mentally by teams. Just like the story about the Boy Who Cried Wolf, a build that screams "fail" two times without reason will not be believed the third time. Worst case scenario: the smoke test after a live deployment has been red on an off, and is now ignored by the team.

A massively flaky job can even provide a net-negative value, as it influences the perception of the whole pipeline and radiate to other jobs. There is not much choice, either remove the job or invest time for a fix.

Which tests provide questionable value and will probably be trusted the least?

Flaky tests quickly undermine the credibility of your test setup ("Broken Window syndrome"), cut into development time and delay possibly important changes (e.g. hotfixes). In short: there's nothing good coming out of a flaky test.

Reasons for flakiness can be manifold: random test data, date/time issues, infrastructure availability, latency, manual intervention, uncontrollable third-party changes... If the test proves hard to fix, consider removing it.