Architecting Isolation
Why We Run Staging in Production
The safest environment in the world is an empty one. It has zero noise, infinite resources, and perfect latency. It is also completely useless.
In the lifecycle of a technology startup, there is often a tension between fidelity (how real is the test?) and safety (what happens if it breaks?). Most teams solve this by building a “staging” environment … but that can result in a low-fidelity ghost town, a sterile lab that bears no resemblance to the chaotic and noisy reality of the live “production” environment.
For years, we followed the industry playbook and stuck to absolute separation, with distinct GCP or AWS projects and no connectivity between the two. This looks good on paper: there’s no way for a staging release to affect a production environment, even if undergoing load-testing. The environments are configured identically, so as long as our deployment artifacts are immutable we can be confident that seeing it work in staging means it will run the same in production, right?
In reality, it’s a nightmare of duplication. We aren’t just mirroring code or artifacts; we’re also mirroring infrastructure (and operational costs). We pay for CloudSQL instances that sit idle for 20 hours a day. We spin up redundant VMs, Apache Spark clusters, and CloudRun deployments that get little to no traffic. We create “prod-like” datasets in BigQuery, creating a sprawling estate of resources that need to be patched, upgraded, and monitored. And we STILL can’t be sure that the code running in staging will actually work with the amount and type of data we have in production!
So we try to mitigate the mismatch with elaborate workarounds: exporting and anonymising production data on a schedule, tweaking the configuration in ways we deem “safe enough”, and/or implementing “scale-down” rules to save money during off-hours. It’s rarely optimal, and requires significant engineering effort to build and maintain. It can also introduce new vectors for bugs (data sanitization failures, startup race conditions) and ultimately distract the team from shipping features that actually provide customer value.
We’re taking a different approach and architecting isolation: running staging and production in the same Kubernetes cluster and giving staging a clone of production data.
The Submarine Philosophy
Naval architects don’t design submarines assuming the hull will never breach. They design them with bulkheads: watertight compartments that ensure a flood in one section doesn’t drag the entire vessel to the bottom.
In our architecture, the GKE cluster is the hull, providing shared infrastructure (nodes, ingress, monitoring, etc). And namespaces are the bulkheads.
By sharing the hull, we gain operational leverage. We patch one cluster. We monitor one control plane. But by strictly enforcing the bulkhead boundaries, we ensure that a memory leak in a staging experiment cannot “flood” the engine room of our production database. It also unlocks powerful data workflows and operational efficiencies:
- BigQuery Zero-Copy Clones: We can now instantly provision staging datasets that are byte-for-byte identical to production. We can test complex transformations on “real” data using BigQuery’s time-travel and clone features without duplicating storage costs or risking the live dataset.
- Logical Isolation Efficiency: We stop paying for idle infrastructure. Instead of spinning up a dedicated Cloud SQL instance for a staging environment that sees minimal traffic, we use logical isolation i.e. separate schemas and users within the shared production instance. The same applies to Redis and other managed services.
- The “Shared Service” Reality: We realized we were already relying on shared infrastructure. Our authentication provider (Clerk), our payment gateways, and our email providers don’t offer physically air-gapped instances for every environment. If we trust external vendors to segregate our data logically, we should build the internal competence to do the same.
- Forced Discipline: Finally, this architecture forces us to be better operators. When a bad staging deployment could theoretically impact production, you stop treating configuration as an afterthought. It forces the team to master Resource Quotas, NetworkPolicies, and admission controllers.
This isn’t just a cost-saving measure either, but a strategy for high-fidelity simulation. We want our staging workloads to fight for resources. We want them to experience the same network topography and noisy neighbors as our live services. But to do this without sinking the ship, we rely on logical isolation and bulkheads, not duplicates.
Isolated Neighbours
We treat the staging namespace as a hostile environment. We assume that any pod within it could be compromised or behaving erratically. To handle this, we implement a “Default Deny” posture.
- The Silence Protocol: By default, no pod in the cluster can talk to any other pod.
- The One-Way Mirror: We define NetworkPolicy rules that allow specific internal communication within
staging, but explicitly block any egress to theprodnamespace. Staging can look at itself, but it is blind to Production.
This simple manifest is the cornerstone of our bulkhead strategy. It ensures that unless traffic is explicitly allowed, it is forbidden.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: staging
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Identity Firewall
The most critical bulkhead is the data layer. In Kubernetes, it’s easy to accidentally mount a “god mode” secret. We avoid this by using workload identity to tie permissions to the pod, not the node.
- Production Identity: A pod in
prodpossesses a Google Cloud IAM identity withCloud SQL Clientpermissions on the live database. - Staging Identity: A pod in
staging, running the exact same binary, possesses a different IAM identity. This identity has no permissions for the production instance. Even if a developer hardcodes the production DB connection string, the connection will be rejected at the IAM level.
apiVersion: v1
kind: ServiceAccount
metadata:
name: backend-service
namespace: staging
annotations:
# This binds the K8s ServiceAccount to a specific Google Cloud IAM Service Account
iam.gke.io/gcp-service-account: staging-sa@my-project.iam.gserviceaccount.com
Resource Quotas & Priority Classes
In a shared hull, the danger isn’t just security; it’s weight. A runaway process in staging could consume all available CPU, starving production services (i.e. the Noisy Neighbor problem).
- Priority Classes: We assign
PriorityClass: Highto production workloads. If the cluster runs out of resources, the Kubernetes scheduler acts as the captain: it will mercilessly evict (kill) staging pods to free up space for production. - Resource Quotas: We set hard limits on the
stagingnamespace. It is capped at a specific amount of CPU and Memory. It cannot balloon beyond its bulkhead, no matter how much load it tries to pull. (We can also implement per-pod limits if required.)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: production-high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for production service pods only."
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: staging-quota
namespace: staging
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
A Safer Place to Fail … and to Succeed
The biggest killer of developer velocity isn’t bad code; it’s deployment anxiety.
When an engineer isn’t sure if their “next version” experiment might accidentally wipe a production table, they hesitate. They double-check. They ask for permission. This fear creates a culture of gatekeeping.
An architecture of isolation flips this. By codifying strict boundaries (“guardrails”) we stop relying on human vigilance and start relying on platform guarantees. When a developer knows that the network policy physically prevents their pod from talking to the production database, they stop worrying and start shipping.
We don’t build this architecture because we don’t trust our developers. We build it because we do trust them to experiment, to push boundaries, and to move fast. Isolation isn’t about restricting access; it’s about granting the freedom to break things safely.
Architecting isolation allows us to have our cake and eat it too: the high fidelity of a shared production environment with the safety of an air-gapped lab. It turns production from a “scary” destination into a resilient, compartmentalized vessel where innovation can happen without fear.
This article was written with drafting and grammar assistance provided by AI. I take full responsibility for the content, including any and all errors. And I’m human, I think!