This site uses cookies for authentication, security, and preferences. Privacy Policy

Detecting and Managing Terraform Drift

Terraform assumes it's the only thing managing your infrastructure. The moment something changes outside of Terraform — a manual console edit, an auto-scaler adjusting capacity, another tool modifying a resource, an emergency hotfix applied directly in the cloud — Terraform's state file no longer reflects reality.

That gap between what Terraform thinks exists and what actually exists is drift. Every team experiences it. Few have a reliable way to detect it.

What drift looks like

Drift isn't always obvious. Some common scenarios:

  • Emergency console changes. Production is down. An engineer opens the AWS console and widens a security group to restore traffic. The fix works. Nobody updates the Terraform code. Two weeks later, someone runs terraform apply on an unrelated change, and the plan silently reverts the security group — taking production down again.

  • Auto-scaling and managed services. AWS auto-scaling changes the desired count on an ASG. Azure adjusts throughput on a Cosmos DB instance. GCP resizes a managed instance group. These are expected changes made by the cloud provider, but Terraform's state doesn't know about them. The next plan shows phantom diffs that confuse reviewers.

  • Cross-tool modifications. A Kubernetes operator creates a load balancer that Terraform also manages. A CI pipeline updates an IAM policy outside of Terraform. A different team uses Pulumi for their resources but shares a VPC that Terraform created. Any of these can modify resources that Terraform considers under its control.

  • Provider upgrades. A new version of the AWS provider reads a resource differently — normalising JSON policies, reordering security group rules, or adding new default attributes. The resource hasn't changed, but the plan shows a diff. This is one of the most common sources of noisy drift — expected changes from the refresh report that aren't real drift but look like it.

Why drift is dangerous

Silent overwrites

The most immediate danger: terraform apply will converge the real infrastructure to match the declared state. If someone made a manual fix that isn't reflected in the code, the next apply reverts it. There's no warning — the plan just shows a diff, and if the reviewer doesn't recognise it as "that emergency fix from last Tuesday," it gets applied.

Misleading plans

When state and reality diverge, terraform plan output becomes unreliable. A plan that shows "3 to change" might actually represent 1 intentional change and 2 drift reversions. Reviewers can't tell which is which. Over time, teams stop trusting the plan output — which defeats the entire purpose of plan review.

Compliance drift

Security-sensitive resources are the highest-risk category. A security group opened to 0.0.0.0/0 during an incident, an IAM policy with overly broad permissions added manually, a database encryption setting changed in the console — all of these are compliance violations that persist silently until someone runs a plan and either catches the diff or blindly applies over it.

Cascading across states

When infrastructure is split across multiple states, drift in one state can cascade. If the networking state's actual VPC configuration has drifted from what Terraform believes, every downstream state that depends on networking outputs is making decisions based on stale data. The compute state thinks the VPC has three subnets; it actually has four. Nothing breaks until it does.

How teams detect drift today

Manual terraform plan

The simplest approach: someone runs terraform plan and looks for unexpected diffs.

terraform plan -detailed-exitcode
# Exit code 0: no changes
# Exit code 1: error
# Exit code 2: changes detected

Why it doesn't scale:

  • It requires someone to remember to run it. Under deadline pressure, drift checks are the first thing skipped.
  • It holds the state lock for the entire plan duration. On a large state, that's minutes of blocking other operations.
  • The output mixes intentional changes with drift. If someone has uncommitted code changes locally, the plan shows both — and distinguishing them takes expertise.
  • There's no structured output. You're reading terminal text, not querying a system that knows "this resource drifted."

Scheduled CI plans

A step up: a cron-triggered CI pipeline that runs terraform plan on a schedule and alerts on non-zero exit codes.

# .github/workflows/drift-check.yml
on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6 AM

jobs:
  drift-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform plan -detailed-exitcode

Problems:

  • Lock contention. The drift-check plan holds the state lock. If an engineer tries to run terraform plan at the same time, they're blocked.
  • No structured alerting. The pipeline either passes or fails. There's no "these 3 resources drifted" — just a wall of plan text in a CI log. Drift detection is the most requested Atlantis feature, and how to even trigger it is a recurring question — because CI-based approaches are fundamentally awkward.
  • False positives. Provider version differences between CI and local, or expected changes from auto-managed attributes, generate noise that drowns out real drift.
  • CI costs. Running a full plan across 20 states daily burns CI minutes. Running it hourly burns more. Most of those runs find nothing.
  • Single state at a time. Each pipeline job checks one state. Cross-state drift — where one state's actual outputs don't match what a dependent state consumed — isn't detected at all.

terraform plan -refresh-only

Terraform 1.1 added the -refresh-only flag, which separates the refresh phase from the planning phase. It shows you what changed in the real world without proposing any configuration changes:

terraform plan -refresh-only

This is better for drift detection than a full plan because it doesn't conflate drift with intentional code changes. But it still requires manual execution, still holds the state lock, and still produces unstructured text output. It's also had reliability issues — OpenTofu's implementation returned false positives where --refresh-only --detailed-exitcode exited with code 2 even when there were no actual changes.

Cloud-native tools

AWS Config, Azure Policy, and GCP Security Command Center can detect configuration changes at the cloud level. They're good at what they do — but they don't understand Terraform.

AWS Config can tell you that a security group rule changed. It can't tell you which Terraform resource manages that security group, which state file it lives in, or whether the change is intentional. Correlating a Config finding back to Terraform code is manual detective work.

These tools complement Terraform drift detection. They don't replace it.

Third-party tools

Tools like driftctl (now part of Snyk) scan cloud resources and compare them against Terraform state. They can find resources that exist in the cloud but aren't in any state file (unmanaged resources) — something terraform plan can't do.

The trade-off is another tool to maintain, another set of credentials to manage, and another source of truth to reconcile. Most of these tools work at a point-in-time snapshot level, not continuously.

How Snap CD handles drift

Snap CD treats drift detection as a first-class operation, not a bolted-on CI job.

Scheduled drift checks per Module

Each Snap CD Module can run periodic plans on a schedule, independent of code changes. The Server triggers a plan, the Runner executes it, and the result is stored with the same structured metadata as any other deployment.

resource "snapcd_module" "networking" {
  name         = "networking"
  namespace_id = snapcd_namespace.prod.id
  source_url   = "https://github.com/myorg/infra-networking.git"
  runner_id    = snapcd_runner.prod.id
}

Drift checks run as normal plans — they refresh state against reality and show any differences. The key distinction is that they're triggered by the scheduler, not by a code change, so the plan output represents pure drift: changes made outside of Terraform.

Structured results in the dashboard

When drift is detected, it's visible in the Snap CD dashboard as a plan with changes. Reviewers can see exactly which resources drifted and what changed — not a wall of CI log text, but a structured plan output with the same review interface used for normal deployments.

Approval before correction

A drift detection plan that shows changes doesn't automatically apply. It enters the same approval workflow as any other plan. If the drift is intentional (an emergency fix that needs to stay), someone can dismiss the plan. If it's unintentional (someone accidentally changed a setting in the console), the team can approve the corrective apply to bring reality back in line with code.

This is the critical difference from continuous reconciliation tools like Crossplane, which would silently revert the change. Snap CD surfaces drift and lets humans decide what to do about it.

RBAC for drift visibility

Not everyone needs to see drift in every Module. Snap CD's permission system controls who can view plans (including drift check results) and who can approve corrective applies. The security team can have Reader access to see drift across all Modules without the ability to approve changes. The networking team can approve corrections to their own Modules without needing access to application infrastructure.

Cross-state drift awareness

When drift is detected in a Module that produces outputs consumed by other Modules, Snap CD understands the dependency graph. If the networking Module's actual vpc_id has drifted, Snap CD knows that the compute and database Modules depend on that output. Correcting the drift in networking can trigger re-plans in dependent Modules, catching cascading effects that CI-based drift detection misses entirely.

A practical setup

A team with five Modules across networking, compute, database, application, and DNS:

resource "snapcd_module" "networking" {
  name         = "networking"
  namespace_id = snapcd_namespace.prod.id
  source_url   = "https://github.com/myorg/infra-networking.git"
  runner_id    = snapcd_runner.prod.id
  apply_approval_threshold = 2
}

resource "snapcd_module" "compute" {
  name         = "compute"
  namespace_id = snapcd_namespace.prod.id
  source_url   = "https://github.com/myorg/infra-compute.git"
  runner_id    = snapcd_runner.prod.id
  apply_approval_threshold = 1
}

resource "snapcd_module_input_from_output" "vpc_to_compute" {
  module_id        = snapcd_module.compute.id
  input_kind       = "Param"
  name             = "vpc_id"
  output_module_id = snapcd_module.networking.id
  output_name      = "vpc_id"
}

When the scheduled drift check runs on the networking module and detects that a subnet was added manually in the console:

  1. The drift appears in the dashboard as a plan showing the unexpected subnet.
  2. Two approvers review the plan (networking requires 2 approvals).
  3. If they approve the corrective apply, Terraform removes the manually-added subnet (or, if the team wants to keep it, they update the code first and the next plan shows no changes).
  4. If the corrective apply changes networking's outputs, compute automatically re-plans.

No CI cron job. No Slack message asking "did anyone change the VPC?" No terraform plan holding the lock while an engineer reads the output.

Comparison

Manual plan CI cron Cloud-native tools Snap CD
Automation None Schedule-based Continuous Schedule-based
Lock contention Yes Yes No Managed per-Module
Structured results No (terminal text) No (CI logs) Yes (but not Terraform-aware) Yes (dashboard)
Approval before fix Manual Manual N/A Built-in
Cross-state awareness None None None Dependency graph
Drift vs. code change Mixed Mixed N/A Separated
Cost Engineer time CI minutes Cloud service pricing Included

Tips

  • Check drift more often on high-risk resources. Security groups, IAM policies, and database configurations are the most common targets for manual changes. Schedule drift checks for modules containing these resources more frequently than stable infrastructure like DNS.
  • Don't auto-apply drift corrections. The whole point is to surface drift for human review. Automatic correction is just continuous reconciliation with extra steps — and it defeats the safety of the plan-then-approve workflow.
  • Investigate before correcting. When drift is detected, the first question is "why?" If someone made an emergency change, the fix is to update the Terraform code to match, not to revert the change. If the drift is from a provider bug or an auto-managed attribute, the fix might be ignore_changes, not a corrective apply.
  • Track your drift rate. If the same module drifts repeatedly, that's a signal — either the manual change is actually needed (and should be in code) or the team doesn't trust the Terraform workflow enough to use it for urgent changes.
  • Separate drift-prone resources. Resources that are frequently modified outside Terraform (auto-scaled groups, resources managed by operators) should be in their own Module, so their expected drift doesn't create noise in Modules that should never drift.

See also

Snap CD

Intelligent GitOps for Infrastructure as Code. Automate, orchestrate, and scale your infrastructure deployments with confidence.


© 2026 Snap CD. All rights reserved.

An unhandled error has occurred. Reload 🗙