TEST RELIABILITY

Flaky Tests: Why Your Green Dashboard Is Lying to You

QAShift EngineeringJune 2, 20266 min read

Ask any engineer what they do when a known-flaky test fails and they will tell you the truth: they re-run it. Ask them what they do when a *new* test fails in a suite full of flaky ones, and they will tell you the same thing. That is the real cost of flakiness — not the wasted compute, but the learned reflex of distrusting red.

Industry studies consistently find that 15–30% of test failures in mature suites are environmental rather than real. Left unmanaged, that noise floor rises until the suite is a formality everyone scrolls past.

The four root-cause classes

Nearly every flaky test falls into one of four buckets. Timing: the test asserts before the application settles — animation, debounce, network. Isolation: the test depends on state left behind by another test, and fails when execution order changes. Environment: staging data resets, third-party rate limits, clock-dependent logic. Selector fragility: the test finds elements by markup details that change with every redesign.

The fix differs per class, which is why "just re-run it" is not a strategy. Timing flakes need explicit waits on application state, not sleeps. Isolation flakes need per-test data setup. Environment flakes need scheduling awareness — if your search index rebuilds at 08:30, don't assert latency at 08:31. Selector flakes need semantic selectors: roles, labels, test IDs.

Quarantine, with a name on it

The single highest-leverage practice is a public quarantine list: tests that are excluded from the release verdict, each with a documented reason, an owner, and a fix in progress. Quarantine is not deletion — quarantined tests keep running, and their pass history tells you when the fix worked.

What makes quarantine work is visibility. If flaky tests silently disappear from the suite, coverage erodes without anyone deciding it should. If they silently keep failing, the dashboard lies. A named quarantine section in the daily report — "these two tests are held, here's why, here's the fix ETA" — is the difference between a green dashboard and a trustworthy one.

At QAShift, every morning report carries that quarantine section, and a human engineer owns each entry. Customers tell us it is the part of the report that built their trust fastest — because it is the part most vendors hide.

Flaky Tests: Why Your Green Dashboard Is Lying to You

The four root-cause classes

Quarantine, with a name on it

KEEP READING