The Cloud Audit Session: Map What You Have Before You Change Anything

The Cloud Audit Session: Map What You Have Before You Change Anything
Published

19 Jun 2026

Author
Gorakh Shrestha

Gorakh Shrestha

The Cloud Audit Session: Map What You Have Before You Change Anything
5:22
Table of Contents

A company's engineering team had been battling production outages for months — services crashing under load, response times spiking during business hours, and a growing backlog of reliability-related tickets that never seemed to shrink. The proposed solution, brought to them by a vendor, was a full Kubernetes migration. Repackage everything into containers, deploy to a managed cluster, implement service mesh for traffic management. Twelve weeks, significant budget, and a team that would need to stop feature work to manage the transition.

Before signing, they agreed to run a cloud audit session™. The findings landed within a week. The actual problems were misconfigured auto-scaling rules that prevented instances from scaling during peak traffic and missing health checks that let failing containers sit in the load balancer rotation, serving errors to users. The auto-scaling fix took a day. The health checks took two. Total cost was a fraction of the Kubernetes quote — and the reliability problems stopped. The migration wasn't wrong as a future direction, but it was the wrong response to the immediate problem. Across 900+ projects delivered, we've seen this pattern consistently: teams invest in architectural overhauls when the real issues are configuration gaps that a structured audit would have surfaced in days.

Why Auditing Comes Before Changing

The instinct to fix cloud infrastructure by replacing it is deeply embedded in how engineering teams think. If the servers are unreliable, move to containers. If costs are climbing, switch providers. If deployments are slow, rebuild the pipeline. Each of these responses assumes the problem is the platform. The audit asks whether the problem is actually how the platform is being used.

This distinction matters because cloud infrastructure accumulates configuration decisions over time — decisions made under deadlines, by engineers who have since left, for requirements that have since changed. No single decision is catastrophic. But collectively, they create an environment where the infrastructure's actual state diverges significantly from its intended state. Security groups that were opened temporarily and never closed. Logging configurations that capture everything or nothing. Instance types chosen during initial setup and never revisited as workloads changed.

Changing platforms without understanding this accumulated state doesn't resolve it. It transfers it. The misconfigured security group becomes a misconfigured network policy. The oversized instance becomes an oversized container resource allocation. The missing health check remains missing, just in a different orchestration layer. According to Gartner's research on cloud spending, organisations routinely overspend on cloud services — and a significant portion of that waste traces back to infrastructure that was never audited after initial deployment.

The cloud environment you need to build starts with understanding the cloud environment you actually have. Skip the audit and you carry every undocumented decision, every configuration drift, and every accumulated risk into whatever comes next.

What a Cloud Audit Session Actually Covers

A cloud audit session is a structured assessment of your current cloud infrastructure — its architecture, its configuration, its costs, its security posture, and its operational readiness. It's not a vendor evaluation or a migration plan. It's a map of what exists, what's working, what isn't, and what's quietly becoming a problem.

Infrastructure Inventory and Architecture Review

The first step is establishing what's actually running. Not what the architecture diagram shows — what's live, right now, consuming resources and serving traffic. This means cataloguing every compute instance, database, storage bucket, load balancer, CDN configuration, queue, cache layer, and managed service across every region and every account.

Teams are consistently surprised by what the inventory reveals. Development environments left running from projects that ended months ago. Redundant load balancers from a migration that was never fully completed. Storage buckets accumulating data with no lifecycle policy and no clear owner. Services running in regions that made sense for a previous customer base but not the current one. Each of these represents cost, complexity, and potential risk — and none of them appear in the architecture diagram anyone reviews in meetings.

Security and Compliance Posture

The audit examines how the infrastructure is configured from a security perspective — not theoretically, but practically. This covers identity and access management (who can access what, and are those permissions still appropriate), network configuration (what's exposed, what's segmented, what's open that shouldn't be), encryption (at rest and in transit, and whether it's actually enabled everywhere it should be), and secrets management (how credentials are stored, rotated, and accessed).

For teams operating under compliance requirements — and increasingly, that's most teams — the audit maps current configuration against relevant standards. ISO 27001, SOC 2, the Australian Privacy Act, industry-specific regulations. The gap between "we believe we're compliant" and "our configuration demonstrates compliance" is where audit findings concentrate. This is especially critical for software development projects handling sensitive customer data.

Cost Analysis and Resource Optimisation

Cloud costs drift upward through a combination of inattention and accumulation. The audit examines spending patterns across services, identifies resources that are over-provisioned relative to actual usage, flags resources with no utilisation at all, and evaluates whether reserved capacity or commitment discounts align with actual consumption.

The goal isn't to produce a cost-cutting target. It's to produce a clear picture of where money is going, whether that spend is justified by the workload it supports, and where optimisation opportunities exist without affecting performance or reliability. The difference between cost reduction and cost optimisation matters: cutting spend by degrading service is easy and destructive. Reducing spend by eliminating waste is sustainable and typically invisible to end users.

Operational Readiness and Reliability

The audit evaluates whether the infrastructure can handle what's coming — not just what's happening today. This covers auto-scaling configuration (is it actually configured, and does it work under real load patterns), backup and recovery (do backups exist, have they been tested, and can they actually restore to a functional state), monitoring and alerting (what's measured, what triggers alerts, and do those alerts reach someone who can act on them), and disaster recovery (if a region goes down, what happens).

Operational readiness gaps are the most dangerous audit findings because they're invisible until they're catastrophic. An auto-scaling policy that caps at the wrong threshold doesn't cause problems until traffic exceeds that threshold. A backup that's never been tested doesn't reveal its corruption until a restore is needed. Missing health checks don't matter until they do — and then they matter enormously.

How to Run a Cloud Audit

A cloud audit follows a structured process designed to surface findings systematically rather than relying on intuition about where problems might exist. This approach works consistently when integrated into a broader project delivery framework.

Step 1: Establish the scope. Define which accounts, regions, and services are in scope. For a single-product company, this may be everything. For a larger organisation with multiple products and teams, scope the audit to the infrastructure supporting a specific workload or business unit. Trying to audit everything simultaneously produces breadth without depth.

Step 2: Collect the data. Pull configuration data programmatically — cloud provider APIs, infrastructure-as-code state files, cost and billing exports, access logs, and monitoring data. Manual review catches things automation misses, but automation catches things at scale that manual review cannot. Both are necessary.

Step 3: Map the architecture. Produce a current-state architecture diagram based on what the data reveals, not what existing documentation claims. Overlay network connectivity, data flows, and dependency relationships. This diagram becomes the foundation for every finding and recommendation.

Step 4: Assess against benchmarks. Evaluate the mapped architecture against security benchmarks (CIS benchmarks for the relevant cloud provider), cost benchmarks (right-sizing guidelines based on actual utilisation), and operational benchmarks (reliability patterns appropriate for the workload's criticality). Each gap between current state and benchmark becomes a finding.

Step 5: Prioritise and report. Rank findings by risk severity and remediation effort. Critical security exposures and reliability gaps rank highest. Cost optimisation opportunities rank by potential savings relative to implementation effort. The output is a prioritised remediation roadmap — not a tool recommendation, not a migration proposal, but a specific, actionable list of what to fix and in what order.

The Kubernetes Quote That Wasn't Needed

A product team running a customer-facing SaaS application had been experiencing increasing reliability problems over six months. The application ran on a standard cloud setup — compute instances behind a load balancer, a managed database, a caching layer, and a CDN for static assets. As the user base grew, the team noticed response times degrading during peak hours, occasional service unavailability, and a pattern of instances becoming unresponsive under load.

The team engaged a consultancy that recommended a containerised architecture with Kubernetes orchestration. The proposal included refactoring the application for containerisation, building a CI/CD pipeline with Helm charts, provisioning and configuring a managed Kubernetes cluster, migrating the database to a cloud-native option, and implementing comprehensive monitoring across the new stack. Timeline: three months. The team would need to pause feature development to manage the transition.

Before committing, they ran a cloud audit session. The findings were specific and, in retrospect, predictable. First, the auto-scaling configuration had a maximum instance count set during initial deployment that was now well below what peak traffic required. The instances would scale up to the cap, and then additional traffic would degrade performance across all instances. Raising the cap and adjusting the scaling thresholds took one day to implement and test.

Second, the load balancer's health checks were using a basic TCP connection test rather than an application-level endpoint. Instances where the application process had crashed but the operating system was still running passed the health check and continued receiving traffic — serving errors to every request routed to them. Configuring an HTTP health check against the application's existing status endpoint took two hours.

Third, the caching layer's eviction policy was misconfigured, causing cache misses under load that cascaded into database queries the application wasn't optimised to handle in volume. Correcting the eviction policy and adding appropriate cache headers resolved the database pressure.

Total remediation time: four days. The reliability problems stopped. The team resumed feature development the following week. The Kubernetes migration remained a viable option for future architectural evolution, but as a scaling strategy — not as a reliability fix for problems that had nothing to do with the orchestration layer.

When a Cloud Audit Is Critical — and When It Can Wait

Invest now if you're experiencing reliability issues you can't explain, your cloud costs have been climbing without a corresponding increase in usage, you're preparing for a major architectural change or migration, you haven't reviewed your infrastructure configuration in twelve months or more, or your team has turned over significantly since the infrastructure was originally built. Any of these conditions increases the likelihood that your infrastructure's actual state has diverged from its intended state in ways that are costing you money, creating risk, or both. This is particularly relevant for teams navigating current development trends that demand higher infrastructure reliability.

It can wait if you've recently built and deployed your infrastructure with current best practices, your team has strong visibility into your cloud environment through well-maintained infrastructure-as-code and monitoring, or you're still in the concept-to-launch phase with no production traffic. An audit requires a production environment with history — configuration decisions that have accumulated over time, usage patterns that have evolved, and drift that has had time to develop.

The transition point is clear: the moment your infrastructure becomes complex enough that no single engineer can hold the complete picture in their head. Once you're running across multiple services, multiple environments, and multiple engineers making configuration changes, the cumulative drift begins — and the audit becomes the only reliable way to understand what you actually have before deciding what you need.

What to Do Next

Pull your cloud provider's cost report for the last three months. Identify the five most expensive line items. For each one, write down what workload it supports, whether that workload justifies the spend, and when the configuration was last reviewed. If you can't answer those questions for even one of them — and most teams can't for at least three — you have your starting point.

The infrastructure you need starts with understanding the infrastructure you have. When you're ready to run a structured cloud audit that identifies what to fix before deciding what to change, talk to our DevOps team. Across 1400+ businesses served, with ISO 9001 and ISO 27001 certification backing our processes, we build cloud infrastructure that's reliable because it was assessed first — not because the architecture was expensive.

Frequently Asked Questions

What is a cloud audit session?

A cloud audit session is a structured assessment of your existing cloud infrastructure — covering architecture, configuration, security posture, cost efficiency, and operational readiness. It produces a detailed map of what you have, identifies gaps between your current state and best practices, and delivers a prioritised remediation plan. The goal is to understand your infrastructure thoroughly before making decisions about changes, migrations, or new tooling. It's the difference between fixing the problems you actually have and rebuilding to avoid problems you've assumed.

How long does a cloud audit take?

For a single-product environment running on one cloud provider, a thorough audit typically takes five to ten days. This includes data collection, architecture mapping, security and cost analysis, and the production of a prioritised findings report. More complex environments — multi-cloud, multi-region, or multiple teams managing separate infrastructure — may require two to three weeks. The investment is modest relative to the cost of the architectural decisions the audit informs. Teams following a structured mobile app development methodology find that auditing supporting cloud infrastructure is a natural extension of their quality process.

What are the most common cloud audit findings?

The findings are remarkably consistent across organisations. Over-provisioned resources running well below capacity but billing at full rate. Security groups or firewall rules that were opened temporarily and never tightened. Missing or untested backup and recovery procedures. Auto-scaling configurations that don't match actual traffic patterns. Orphaned resources from previous projects still consuming budget. Logging and monitoring gaps where critical services have no alerting configured. According to the AWS Well-Architected Framework, regular infrastructure review against established pillars — operational excellence, security, reliability, performance efficiency, and cost optimisation — is fundamental to maintaining a healthy cloud environment.

Should we audit before or after a cloud migration?

Both — but the pre-migration audit is non-negotiable. Auditing before migration ensures you understand what you're moving, why you're moving it, and which problems will follow you regardless of the destination. Teams that skip the pre-migration audit consistently report that their new environment inherits the same issues their old environment had, just in different packaging. A post-migration audit then validates that the migration achieved its goals and that the new environment is configured correctly from the start.

How do we know if we're overspending on cloud infrastructure?

If you can't explain what every significant line item on your cloud bill supports, you're likely overspending. The audit produces a resource-by-resource analysis that maps spend to workload. Common sources of overspend include instances sized for peak load that could use auto-scaling instead, storage accumulating without lifecycle policies, data transfer costs from architecture decisions that route traffic inefficiently, and development or staging environments running production-grade resources. Most audits identify immediate savings opportunities equivalent to fifteen to thirty per cent of current spend — not through service degradation, but through eliminating genuine waste.

What's the risk of skipping the audit and going straight to re-architecture?

The risk is solving the wrong problem expensively. Re-architecture is a significant investment in time, budget, and engineering attention. Without an audit, the decision about what to re-architect is based on assumptions about where the problems are — and those assumptions are frequently wrong. The team in our opening example was ready to invest months in a Kubernetes migration to solve problems that took days to fix once properly diagnosed. Skipping the audit doesn't save time. It spends time on the wrong things and delays the fixes that would have made the immediate difference.

Got an App Idea But No Technical Co-Founder?

You don't need one. You need a team that turns business logic into a shippable product — scope, architecture, and build. 900+ products delivered. Book a free scoping call and walk away with clarity on cost, timeline, and what to build first.