Most teams discover their performance ceiling on the day they need it not to exist. The marketing campaign drops, the email goes out, traffic climbs through the first multiple of normal, then the second, and somewhere around the third the database connection pool exhausts and the checkout page starts returning errors. The engineering team gets on a call to diagnose; by the time the fix lands, the revenue window the campaign was meant to capture has already closed.
Performance and load testing — done seriously — is the discipline that closes this gap before users open it. It models the load you expect, runs the system against escalating multiples of that load, captures the breaking points, and feeds the results into a capacity plan and a pre-launch readiness gate. Not "we ran a quick test in staging the day before launch", but a documented exercise that the Production Readiness Review™ depends on and the Production Readiness Score™ reflects.
This is not a story about engineers who didn't care. It is a story about a discipline that quietly stops being a pre-launch deliverable and becomes a post-launch firefight. Load testing is one of those activities that everyone agrees is important, that most teams plan to do "before launch", and that gets squeezed the moment the calendar tightens. The result is a launch where the peak load is unknown and the system's first encounter with production traffic is also its first encounter with production failure conditions.
The discipline sits inside the Built to Last™ 2.0 framework's P02 — The Right Infrastructure — pillar, alongside observability and pre-launch readiness checks. All three answer the same underlying question: do we find out about production problems before customers do, or after?
Why missing this hurts more than it looks like it should
The cost of skipping load testing is rarely a single dramatic event. It is a series of smaller costs that compound. The first is revenue lost during peak events — promotional campaigns, product launches, seasonal traffic, press coverage. These are exactly the moments an unprepared system buckles. The conversion you spent six months building toward arrives, and the checkout page returns errors.
The second is reputational. A site that fails under load gets remembered. Customers who hit an error during checkout are statistically less likely to return; the trust hit is asymmetric. For B2B products the same dynamic plays out as escalations, account-team scrambles, and renewal risk.
The third is the engineering cost of tuning under pressure. Performance work done before launch can be deliberate: a profile here, an index there, a connection pool resized, a query rewritten. The same work done while production is on fire is the same engineering, with less time, and with the cost of the outage compounding while it happens. Fixes that would have cost half a sprint pre-launch cost a weekend of incident response and a Monday-morning post-mortem.
There is a fourth cost, hardest to quantify: decisions made on missing information. Without baseline performance numbers and known breaking points, capacity planning becomes guesswork. Infrastructure spend grows to reassure the team rather than to match modelled demand. Performance testing is not only about avoiding outages; it gives the team the numbers it needs to make the next twelve months of decisions sanely.
What performance and load testing actually covers
Performance and load testing is the structured assessment of how a system behaves under specified levels of demand. The deliverable is not a green tick. It is a set of curves, thresholds, and breaking points, captured against an agreed model of expected use, that the team can compare future system changes against. Four activities sit inside it, and skipping any one of them leaves a gap:
Performance baselining
Load modelling
Scalability testing
Capacity planning
Performance testing without capacity planning produces interesting charts. Capacity planning without performance testing produces unfunded promises. The two only earn their cost when they are paired and fed into the Production Readiness Review as required inputs, not optional ones.
Who is in the room
A load testing exercise that runs well has at minimum the engineer who owns the system under test, the engineer who owns the infrastructure it runs on, and someone responsible for the business case behind the expected load — typically the product owner or growth lead, because they hold the assumption about what "10x" actually means in user-journey terms. For systems touching payments, regulated data, or third-party APIs, the relevant integration owner needs to be in the room as well. Performance issues at scale are often integration issues at scale, and third-party rate limits are usually the first ceiling a team discovers.
What gets documented
A complete load test artefact contains the load model used, the test scenarios run, the baseline numbers, the breaking points observed, the failure modes at each breaking point, and the capacity plan agreed against the results. This document lives in the same repository as the Architecture Decision Records and the runbook library — somewhere any on-call engineer can find it during an incident. The next person who runs the test in twelve months needs to inherit the model, not rebuild it.
Failure modes even when the test is present
Load testing fails most often in three ways. The first is testing against an unrealistic model — synthetic traffic that hits the cheapest endpoints and misses the expensive user journeys, leading to a confident conclusion that the system scales. The second is testing in a non-representative environment — a staging instance with a tenth of production's data, none of its third-party integrations, and a CDN configuration that quietly absorbs half the load. The third is testing once. A baseline established in week six of a build is interesting; a baseline that is never re-run before launch, and not maintained afterwards, is decorative. Each of these failure modes is preventable. None of them are uncommon.
A concrete example
Picture an eCommerce site preparing for a promotional event. Normal traffic is around 5,000 sessions per hour. The campaign brief projects 50,000 at peak, concentrated in a two-hour window. The load model captures that distribution: 70% homepage, 20% product detail, 8% cart, 2% checkout. The team runs against today's load, then 2x, 5x, 10x. At 5x, response times climb. At 8x, the database connection pool saturates. At 10x, the checkout API begins timing out. The team is now armed with three numbers: the safe peak, the degraded peak, and the failure point. They resize the pool, add a circuit breaker around the third-party payment provider, and tune the cache. Traffic peaks at around 6x normal on the day, well below the now-known ceiling. That is the deliverable.
How to make load testing a pre-launch habit
The implementation path is more about discipline than technology. The tooling space — k6, Locust, Gatling, JMeter, the cloud providers' own load generation services — is well-served and stable. The harder problem is making the test a non-negotiable part of pre-launch, owned, scheduled, and resourced like any other engineering deliverable.
Start with the load model, not the tool. Pick the five to ten user journeys that account for the majority of traffic and revenue. Document them, with realistic input data. Get the product owner to sign off that this is what production will look like. Everything that follows rests on this; a sophisticated tool running an unrealistic scenario is worse than a simple tool running a realistic one.
Pick a target environment that genuinely mirrors production: same data volume, same third-party integrations or representative stubs, same CDN configuration, same database engine and version. Where a gap is unavoidable — a payment provider's sandbox, for example — document the deviation and the assumption you are making.
Establish a baseline at current production load, then test in escalating multiples: 2x, 5x, 10x. Stop at the breaking point. Note the symptom — was it CPU, memory, database connections, third-party rate limits, internal queue depth, thread pool exhaustion? Each symptom points to a different fix. Each fix needs to be re-tested to confirm the breaking point has actually moved.
Wire the result into the Production Readiness Review. The review treats the load test as a required input, not an optional one. A system that has not been tested against its expected peak does not pass review, and the Production Readiness Score reflects that. The score is a gate, not an aspiration.
What to avoid: testing on launch day, testing in production without traffic shaping, testing without an agreed model, and treating one test run as evidence the system will hold. Performance regresses with every release. A load test from six weeks ago is a historical artefact, not a current assessment. Wire the test into the CI/CD pipeline as a scheduled job — daily on the staging build, weekly against a higher-fidelity environment — so the performance curves are maintained rather than re-discovered each quarter. Our custom software delivery approach treats this scheduled scalability testing as a property of the delivery pipeline rather than an event.
Implementation does not require every other Built to Last component to be in place, but it works best alongside observability, environment architecture, and a working CI/CD pipeline. Without observability, the test gives outcome data with no causation. Without staging that mirrors production, the numbers don't transfer. Without CI/CD, the test becomes a quarterly event rather than a continuous discipline.
A tale of two launches
An Australian eCommerce business at the Scale stage was preparing its biggest seasonal campaign of the year. Marketing had projected peak traffic at roughly five times the previous year's seasonal peak. The site had been running fine at normal load for months. Load testing was on the pre-launch list but kept getting deprioritised in favour of more visible features. Two days before campaign launch, a quick smoke test was run against a staging environment with a fraction of production's data and none of the live payment integrations. The test passed.
The campaign launched. Traffic climbed past the previous year's peak around the second hour. Somewhere between three and four times normal load, the database connection pool — sized for steady-state traffic months earlier and never revisited — saturated. Checkout requests started timing out. The team spent two hours diagnosing while the campaign's most valuable window slipped by. The fix, once found, was a fifteen-minute configuration change.
Contrast that with a comparable retailer that ran the full discipline. Load model built against expected campaign behaviour, scalability testing in escalating multiples to ten times projected peak, two breaking points identified — the same connection pool issue and a third-party fraud-check API that rate-limited under load. Both fixed before launch. Traffic on the day hit roughly six times normal load, well below the now-known ceiling. Nothing broke. Similar codebases, similar traffic profiles, similar budgets. The discipline of load testing before launch was the only meaningful difference.
When this matters most — and when it can wait
This matters most when you have any combination of three conditions: a known traffic spike ahead of you (campaign, launch, seasonal peak), revenue tightly coupled to availability (eCommerce, payments, anything where downtime is dollars), or scale at which a percentage-point error rate translates into a real number of broken customer journeys. If you check any of those boxes, the discipline is non-negotiable. P02 — The Right Infrastructure — treats load testing as a Production Readiness Review deliverable for a reason. For products built on the custom software and mobile delivery approaches, performance benchmarks are part of the standard launch package.
When can it wait? Early-stage products with small, predictable user bases, internal tools where peak load is bounded by headcount, and pre-product-market-fit builds where the system will be rewritten before traffic matters anyway can defer the full discipline. Even there, a simple baseline is cheap insurance — knowing how the system behaves at current load makes the moment you do need scale a planned event rather than an emergency.
The honest broker answer: most teams treat load testing as deferrable for longer than they should. The first time it becomes urgent is usually the first time it is too late to do well. Build the baseline early, even if you don't yet need the breaking point, because the baseline is what tells you when the next release has regressed. Our project delivery framework treats this kind of baseline as a standard pre-launch artefact rather than an optional extra.
What to do next
If you have a launch in the next quarter and no load test against a realistic model, book the model-building conversation this week. The model is the prerequisite; everything else follows from it. Get the product owner, the system owner, and the infrastructure owner in the same room for an hour, agree the five most important user journeys and the load profile, and book the test against a representative environment before the launch milestone. For the broader infrastructure discipline this sits inside, our DevOps and infrastructure approach covers how performance testing is wired into the wider pre-launch process.
Frequently Asked Questions
Will it scale?
What happens at 10x current load?
Where will it break first?
How do we test realistically?
How often should we run load tests?
What tools should we use?
Should load testing be part of our Production Readiness Review?
Discover app development insights and AI trends with Akash Shakya, COO of EB Pearls. Learn how we build successful digital products.
Read more Articles by this Author