Code Health Scorecard: Leadership Always Knows Code Quality

Published

15 Jun 2026

Author

Akash Shakya

Table of Contents

When the velocity chart hides a decaying codebase

A team-as-a-service engagement is in its eighteenth month. The burndown looks clean. Standups are short. Release notes go out fortnightly. From the leadership dashboard, the team is healthy.

Then a senior engineer resigns. The replacement takes five weeks to ship anything meaningful. A library the squad has been quietly pinning has a critical CVE and there is no clean upgrade path. A one-line change to a feature breaks two unrelated workflows. None of this surfaced in the weekly status report, because that report measures throughput, not the substrate underneath it.

This is the failure pattern this article addresses. Engineering leadership rarely loses control of code quality through a single bad decision. It loses control through the slow accumulation of choices that nobody is scoring. A codebase decays the way buildings do — invisible until something falls.

The Code Health Scorecard is the instrument that prevents this drift. Scored across five dimensions every sprint, it gives engineering leads, CTOs, and non-technical stakeholders one honest view of the codebase. Same view for the people writing the code and the people paying for it. No vanity charts. No reassuring story dressed up as a metric.

It belongs inside Built to Last™ 2.0, the engineering framework we apply to every engagement at EB Pearls. The scorecard sits in the Right Code pillar alongside the technical debt register, peer review framework, and CI/CD pipeline. It is most valuable in long-running engagements — particularly the team-as-a-service model, where engineers rotate and the codebase has to keep its shape through those rotations.

What it costs when nobody is scoring it

Without a scorecard, you do not notice decay. You notice the consequences.

The first consequence is replacement cost. A new engineer who should be productive in three days takes three weeks because nothing is documented, the dependency tree has surprises, and tests cover only the happy paths. In an embedded squad model where engineers rotate, this is not a one-off; it is a tax paid every time the composition changes. Across an 18-month engagement, the cumulative cost is significant — quietly absorbed into project budgets that nobody connects back to documentation discipline.

The second consequence is fragility. A codebase nobody is scoring accumulates dependency risk that compounds. Packages drift behind security patches. Transitive dependencies introduce CVEs the squad does not know about. When something breaks in production, the team is reverse-engineering the cause from logs while customers wait.

The third consequence is the loss of the most valuable asset in any engagement: trust. Once leadership has been surprised by a hidden problem — a rewrite that was not on the roadmap, an incident traced to an unmaintained module, an onboarding that took two months — every future status update is read sceptically. The squad now spends cycles defending the work instead of doing it.

Stack Overflow's Developer Survey has reported consistently that working with poor documentation and unclear codebases is among the friction points working developers cite most often. The awareness exists in most teams. What is usually missing is the instrument that converts that awareness into something leadership can act on before the consequence arrives.

What the scorecard actually is

The Code Health Scorecard is a single sheet, scored every sprint, that reports five dimensions of the codebase. The same view goes to the engineering lead, the squad, and the client stakeholder. No filtered version. No "leadership cut" that hides the awkward numbers.

The five dimensions are deliberately chosen. They are not the only things you could measure — they are the ones that, in our experience across long-running engagements, actually predict whether a codebase will last.

Test coverage. What percentage of production code is exercised by automated tests, broken out by critical path versus utility code. Coverage alone is a vanity metric — you can hit a high number by testing trivial getters. The scorecard reports coverage of the modules that matter: payments, authentication, integration surfaces, the parts of the system where a regression costs real money. Our internal benchmark in BTL 2.0 engagements is 85%+ coverage on critical-path code, achieved through AI-generated tests with engineer review rather than manual writing.

Documentation completeness. Specifically: can a new engineer onboard from the documentation alone, without needing someone to walk them through it? Measured by recency, coverage of major modules, and the actual onboarding times observed when new engineers join. Documentation that nobody reads scores low even if it exists.

Technical debt volume. Drawn from the Technical Debt Register, which is its own discipline. The scorecard reports the total count of recorded debt items, the sum of estimated remediation cost, the number rated high-impact, and how many items have been paid down versus added this period. Tracked debt is a tool. Untracked debt is what eventually triggers the rewrite.

Dependency risk. Count of out-of-date direct dependencies, count of known CVEs in the dependency tree, and the oldest unsupported dependency in the system. The NIST National Vulnerability Database is the canonical source for the CVE data; tooling pulls from it on every build. A green dependency dimension means there are no surprises waiting to be discovered.

Onboarding friction. The most leadership-relevant of the five. Measured by the time from a new squad member's first commit access to their first merged production change, and the time to their first solo on-call shift. This is the dimension that surfaces decay nothing else does. A codebase whose tests, debt, and dependencies look fine but whose onboarding has crept from four days to fifteen is a codebase that has accumulated tribal knowledge — the most expensive kind of debt, because it lives in heads and walks out of buildings.

Each dimension is scored 1–5 against published criteria. The five scores combine into a single headline number, but the individual scores are always visible alongside it. Leadership reads the headline; engineering acts on the components.

The scorecard is reviewed in the sprint demo. Trends matter more than absolutes — a score moving from 18 to 16 over four sprints is a more important signal than a one-time 22. The squad lead presents the trend, names what is behind any movement, and proposes the work needed to correct it. That work goes into the next sprint backlog like any other commitment.

What the scorecard is not: a performance review of individual engineers, a basis for billing, or a marketing artefact. It exists so that one question — is the codebase we are paying for healthier or sicker this sprint than last? — has an honest answer in front of the people who need it.

How to put one in place

The scorecard works best when it is introduced in week one of an engagement, but most teams come to it mid-flight. The implementation pattern is the same either way.

In the first week, identify the five dimensions in your context. Test coverage, documentation completeness, technical debt volume, and dependency risk apply universally. Onboarding friction needs a baseline — if you have recently onboarded a new engineer, the measured time is your starting point; if not, the next onboarding event captures it. Do not wait for perfect data; start with whatever exists and refine over the next two sprints.

In the first sprint, set up the data sources. Coverage from your existing test runner. Dependency risk from a scanner in CI — Snyk, Dependabot, npm audit, OWASP Dependency-Check; the tool matters less than that it runs on every commit, and a DevOps-led pipeline is where this discipline lives. Technical debt count from the register. Documentation completeness is the dimension most teams have to set up from scratch — agree the criteria and assign one engineer to maintain the assessment fortnightly.

In the first month, present the first scorecard in a sprint demo. Walk through each dimension. Name what is behind each score. Set targets for the next sprint and put the work to reach them in the backlog. The first scorecard is often uncomfortable — most engagements that did not have one have at least two dimensions in the red. That discomfort is the point. Better to see it now than in month nine when leadership starts losing patience.

What to avoid: tying the scorecard to individual engineer performance, hiding bad scores from clients, or letting dimensions go ungraded without a stated reason. The scorecard's value comes from honesty and continuity. A scorecard you only run when things look good is worse than no scorecard at all — it teaches leadership that the instrument cannot be trusted.

The scorecard depends on other Right Code components to work properly. The Technical Debt Register has to exist. The CI/CD pipeline has to run dependency scanning on every commit. A Developer Onboarding Guide needs to exist, or the onboarding friction dimension is just measuring chaos. If any of these are missing, implementing them is the prerequisite, not the scorecard itself. This is why the scorecard fits naturally inside the way we deliver custom software — the prerequisites are already on the board from sprint one.

A composite engagement: when the velocity chart looked fine

Consider an Australian fintech we worked with, around 18 months into an embedded squad relationship — three engineers and a squad lead. Roadmap delivery had been steady. The fortnightly status reports were uneventful. Velocity charts trended within a stable band.

When the squad first stood up the Code Health Scorecard, two of the five dimensions came back red.

The first was dependency risk. Three packages in the production dependency tree had high-severity CVEs. The team had been pinning at older versions because an earlier upgrade had broken a payment flow eight months prior, and nobody had revisited the lockfile since. The CVEs had appeared in the intervening months. None of the standard sprint reporting surfaced this — patch upgrades are not features, and they do not appear on burndowns.

The second was onboarding friction. The most recent squad rotation had taken thirteen days for the new engineer to ship a first solo change, up from a baseline of four. Nobody had connected that drift to anything actionable. With the scorecard in place, the cause became visible: the documentation for the integrations module had not been touched since the original handover; the only person who knew it intimately had left two rotations earlier.

The remediation was unglamorous. Two sprints of dependency upgrades and integration test work. One sprint on integrations documentation, including a paired walkthrough recorded for future onboarding. The velocity chart looked slightly worse during the remediation period — fewer features shipped. The scorecard trend looked dramatically better, and leadership had something honest to look at while feature velocity dipped. That made the conversation about the dip a productive one.

The point is not that the scorecard fixed anything. It made fixable things visible.

When it matters most, and when it can wait

The Code Health Scorecard earns its overhead in long-running engagements. The longer the codebase will live and the more the squad composition will change, the more value the scorecard adds. A 12-month embedded squad with planned rotations needs it. An 18-month custom build with a stable team benefits from it. A two-month MVP build with a single engineer probably does not — the data points are too sparse, and the scorecard becomes process for its own sake.

It also matters more when leadership is at a distance from the code. A CTO who reads code daily already has the signal. A non-technical founder or a client stakeholder whose only window into the codebase is what the vendor shows them — the scorecard is the only honest window available.

You can defer it if the engagement is genuinely short, if the codebase is fully owned by a single engineer who will hand it off in one piece, or if a planned rewrite is six months away. Anything longer or more distributed than that, and the cost of not having it compounds quietly.

What to do next

Pick one dimension and start scoring it this sprint. Test coverage is the easiest first one — the data already exists in your CI output, and it forces the question of which code matters most to cover. The other four dimensions follow naturally once the team is in the habit of scoring something.

If you want to see how the scorecard sits alongside the other Right Code disciplines in an embedded engagement, our team-as-a-service engagement model shows how the framework operates across a long-running squad. For the broader rhythm the scorecard plugs into, the project delivery framework overview is the next read, and the Nuvei / Till Payments engagement is a longstanding example of the disciplines applied in practice.

Frequently Asked Questions

How do we measure code health in practice?

Score the codebase every sprint against five dimensions: test coverage on critical-path code, documentation completeness, technical debt volume, dependency risk, and onboarding friction. Each dimension scores 1–5 against published criteria. Combine into a single headline number, but always show the individual scores alongside it. The combined number lets leadership track the trend; the components tell engineering where to act.

What should we report to leadership about code health?

Report the scorecard itself, unfiltered, every sprint. The five dimensions, the trend over the last four to six sprints, and the work scheduled in the current sprint to address any dimension in the red. Avoid the temptation to summarise or "translate" — leadership is more capable of reading five numbers than most engineering teams give them credit for, and the summarised version is where dishonesty creeps in.

What counts as a vanity metric here?

Raw lines of code shipped. Story points completed per sprint. Total test count without coverage context. Build success rate without a corresponding failure-cause breakdown. These move with effort but do not predict whether the codebase will hold up. The dimensions the scorecard includes were chosen because they predict the failure modes that arrive in month 18 — fragility, drift, onboarding cost.

How often should we score?

Every sprint, at the demo. More often than that is process overhead; less often loses the early-warning signal. Sprint cadence also lets the scorecard plug into existing rituals — the trend gets reviewed in front of the same people who see the working software, and the remediation work goes into the next sprint backlog without needing a separate event.

Who owns the scorecard?

The squad lead is accountable for producing it and presenting the trend. The engineers are accountable for the underlying scores by working honestly on each dimension. Leadership is accountable for reading the scorecard, acting on what it says (including approving remediation work that displaces features), and not penalising the squad for reporting red dimensions honestly. If any of those three accountabilities slip, the scorecard stops being useful within two or three sprints.

Will scoring code health slow our delivery?

Some, at first. The first scorecard usually surfaces work that needs scheduling, and that work displaces feature work in the short term. In our experience across long-running engagements, the time recovered later — in faster onboarding, fewer production surprises, less rework — is consistently greater than the time spent on remediation. The question to ask is whether you would rather pay the cost in known sprints now or in surprise sprints later.

Akash Shakya Chief Operating Officer and Co-Founder

Discover app development insights and AI trends with Akash Shakya, COO of EB Pearls. Learn how we build successful digital products.