Rolling Back a Bad Terraform Apply
Terraform has no rollback command. When an apply goes wrong — a misconfigured security group, a dropped database, a botched provider upgrade — you're on your own. The state file records what is, not what was, so there's no built-in way to rewind to a previous known-good state.
This guide walks through the options teams actually use to recover from a bad apply, explains why each one is fragile, and then looks at how to reduce the need for rollback in the first place.
Why rollback is hard
Terraform is a desired-state system. You declare what you want, and Terraform figures out how to get there. It doesn't maintain a changelog of what it did — it maintains a snapshot of what exists right now. That's great for convergence, but it means:
- There's no undo log to replay in reverse. A reversible refactoring capability has been proposed, but remains unimplemented.
- The state file after a failed apply may be partially updated — some resources changed, others didn't. Terraform doesn't clearly indicate which resources were successfully changed during a failed execution, making manual recovery harder.
- Some changes are inherently irreversible. A destroyed database is gone. A renamed resource has a new identity. Terraform's built-in
prevent_destroylifecycle flag — meant to guard against accidental destruction — has been widely considered broken since 2015 because removing a resource from configuration causes an error instead of gracefully skipping the destroy.
With that in mind, here are the recovery strategies teams use.
Strategy 1: Revert the commit and re-apply
The simplest approach: git revert the offending commit, then run terraform apply again.
git revert HEAD
terraform plan
terraform apply
When it works: the bad apply was purely additive — it created new resources that you now want removed. Reverting the code removes the resource declarations, and the next apply destroys them cleanly.
When it doesn't: the bad apply modified or destroyed existing resources. Reverting the code puts the declarations back, but the actual infrastructure has already changed. Terraform will try to reconcile, and the plan may not do what you expect:
- If a resource was modified in-place (e.g. a security group rule was changed), reverting the code and re-applying should restore the original configuration. This usually works, but you're dependent on the provider implementing the update correctly.
- If a resource was destroyed and recreated with a different identity, reverting won't bring back the original. You'll get a new resource with a new ID, which may break references from other systems.
- If a resource was destroyed and the code revert re-declares it, Terraform will try to create it fresh — which may fail if the name or unique identifier is still held by a partially-deleted remnant.
Strategy 2: Restore the state file
If you're using a remote backend with versioning enabled (S3 with versioning, for instance), you can restore a previous version of the state file.
# Download the previous state version from S3
aws s3api get-object \
--bucket my-terraform-state \
--key env/prod/terraform.tfstate \
--version-id "abc123previousversion" \
previous.tfstate
# Push it as the current state
terraform state push previous.tfstate
When it works: the infrastructure was not actually changed — for example, the apply failed partway through and no resources were modified. Restoring the state file simply corrects Terraform's view of the world.
When it doesn't: the infrastructure was changed but you restored an old state. Now Terraform's state says resource X has configuration A, but the real resource has configuration B. The next terraform plan will show unexpected diffs, and applying them may cause further damage. You've introduced state drift on purpose.
This strategy is also dangerous because it can silently drop resources from state. If the bad apply created new resources, the old state file doesn't know about them. They become orphaned — still running, still costing money, but invisible to Terraform.
Strategy 3: Targeted resource surgery
For surgical fixes, you can manipulate individual resources in the state file.
Import a resource back into state
If a resource was removed from state but still exists in the cloud:
terraform import aws_security_group.web sg-0abc123def456
You need the resource's cloud-side ID, which means digging through the cloud console or CLI to find it.
Remove and re-import
If a resource's state is corrupted or out of sync:
terraform state rm aws_instance.api_server
terraform import aws_instance.api_server i-0abc123def456
Taint and replace
If a resource is in a bad state and you want Terraform to destroy and recreate it:
terraform taint aws_instance.api_server
terraform apply
Or with newer Terraform versions:
terraform apply -replace="aws_instance.api_server"
The risks: all of these operations require you to know exactly what went wrong, which resources are affected, and what their cloud-side identifiers are. One wrong state rm can orphan a resource. One wrong import can associate the wrong cloud resource with the wrong Terraform declaration. And none of these tools help you if the problem spans dozens of resources.
Strategy 4: Manual cloud-side fixes
Sometimes the fastest recovery path is to bypass Terraform entirely and fix things in the cloud console or CLI:
# Restore the security group rule directly
aws ec2 authorize-security-group-ingress \
--group-id sg-0abc123def456 \
--protocol tcp \
--port 443 \
--cidr 0.0.0.0/0
Then run terraform plan to verify the actual state matches the desired state. If it does, the plan shows no changes and you're back in sync.
The risk: you've fixed the immediate problem, but now your infrastructure's actual state diverges from what your team thinks Terraform is managing. If someone later runs terraform apply without checking the plan, the manual fix might get overwritten.
The real problem
All of these strategies share a common weakness: they're reactive. You're already in a bad state, and you're trying to get out. The effort required scales with:
- Blast radius — how many resources were affected. A monolithic state with 500 resources means a bad apply can touch anything.
- Complexity — how many cross-resource dependencies exist. Fixing the security group is easy; fixing the security group, the load balancer that references it, the target group that references the load balancer, and the DNS record that references all of them is not.
- Time pressure — production is down, people are paging you, and you're manually running
terraform importcommands.
Reducing the need for rollback
Rather than getting better at rollback, the more effective approach is to make bad applies less likely and less damaging.
Plan review and approval gates
The single most effective safeguard is requiring human review of every plan before it's applied. Most bad applies happen because someone ran terraform apply without carefully reading the plan, or because a CI pipeline auto-applied without any gate. Terraform's sensitive marking compounds this — it suppresses entire local-exec command output when any input is sensitive, and one sensitive field taints the entire object, making plan output harder to review. Operators who need to debug can't view sensitive values in state at all.
Snap CD enforces this with approval thresholds — you can require one or more approvals before an apply proceeds. The plan output is visible in the dashboard, so reviewers can see exactly what will change before it happens. The Snap CD Terraform provider makes this declarative:
resource "snapcd_module" "production_network" {
name = "networking"
namespace_id = snapcd_namespace.prod.id
source_url = "https://github.com/myorg/infra-networking.git"
runner_id = snapcd_runner.prod.id
apply_approval_threshold = 2
}
Two people have to approve before anything touches the production network. That alone prevents the majority of bad applies.
Smaller blast radius through modules
A monolithic state is a single point of failure. If your networking, compute, database, and application resources all share one state, a bad apply to any of them risks all of them.
Splitting into smaller, focused states (Snap CD Modules) means a bad apply to the networking Module can't touch the database. The blast radius is contained by design.
Drift detection
Sometimes the problem isn't a bad apply — it's a manual change that someone made in the console that Terraform doesn't know about. The next terraform apply overwrites the manual change, and now you need to "roll back" an apply that was technically correct but destructive.
Snap CD can run periodic plans per Module to detect drift between the declared state and the actual infrastructure, alerting you before the next apply blindsides someone's manual fix. Drift detection is one of the most requested features across the infrastructure tooling ecosystem — Atlantis's drift detection request is the single most requested feature in the project's history.
Version-pinned sources
A common cause of unexpected plan changes is an upstream module or provider updating without warning. Snap CD Modules can pin to specific Git tags or commits, so new versions only deploy when you explicitly update the reference and approve the resulting plan.
Tips
- Always read the plan. This sounds obvious, but most bad applies come from skipping plan review. If your workflow doesn't force you to read the plan, fix the workflow.
- Enable state file versioning. Even if restoring old state is risky, having the option is better than not. S3 versioning, GCS versioning, or Terraform Cloud's built-in state history all work.
- Keep blast radius small. If you're worried about rollback, you probably have too many resources in one state. Split them.
- Test in a non-production environment first. Obvious, but violated constantly. If your CI pipeline applies to production without first applying to staging, you're skipping the cheapest possible safety net.
- Document your recovery procedures. When production is down is not the time to figure out how
terraform importworks. Write runbooks for your most critical resources.
See also
- Modular Deployments — how smaller Modules reduce blast radius
- Detecting and Managing Terraform Drift — catching manual changes before the next apply overwrites them
- A Permission System Built for Infrastructure — approval gates that prevent bad applies
- The Problem with Large Terraform States — why monolithic states make rollback harder