This site uses cookies for authentication, security, and preferences. Privacy Policy

Terraform State Locking and Concurrency

Terraform state is a single file. Every plan reads it, every apply writes it, and only one operation can hold the lock at a time. That's fine when one person runs Terraform on a laptop. It breaks down the moment a second engineer — or a second CI pipeline — tries to touch the same state.

This guide explains how Terraform's locking mechanism works, where it falls short, and what you can do about it.

How state locking works

When you run terraform plan or terraform apply, Terraform attempts to acquire a lock on the state file before doing anything. The lock is a mutual-exclusion primitive — if someone else holds it, your operation fails immediately with an error:

Error: Error acquiring the state lock

Error message: ConditionalCheckFailedException: The conditional request failed
Lock Info:
  ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
  Path:      my-bucket/terraform.tfstate
  Operation: OperationTypeApply
  Who:       alice@laptop
  Version:   1.9.0
  Created:   2026-06-15 14:32:01.123456 +0000 UTC

The implementation depends on your backend:

Backend Lock mechanism
S3 DynamoDB table with conditional writes
Azure Blob Blob leases (60-second renewable)
GCS Object versioning with generation match
Consul Session-based locks
PostgreSQL Advisory locks
Terraform Cloud API-managed run queue

The lock is released automatically when the operation completes — successfully or not. If Terraform crashes mid-apply, the lock stays held until it times out or someone manually removes it.

The problems

1. No queuing

Terraform's lock is a mutex, not a queue. If the lock is held, your operation doesn't wait — it fails. You get an error message, and you retry manually. In a CI pipeline, this means a failed build that someone has to re-trigger. In concurrent GCS-backend scenarios, multiple processes can even obtain the same lock simultaneously, leading to silent state corruption.

There's no built-in way to say "wait until the lock is free, then run." You either get the lock or you don't. Users have requested a retry mechanism for years, but the -lock-timeout flag only makes a single retry attempt — there's no exponential backoff or configurable retry count.

2. Stale locks

If Terraform crashes, loses network connectivity, or if someone kills the process mid-apply, the lock can remain held with no process behind it. Different backends handle this differently:

  • DynamoDB: The lock record stays until you run terraform force-unlock <ID>.
  • Azure Blob: The lease expires after 60 seconds (if not renewed), but can also get stuck.
  • GCS: Similar to DynamoDB — manual intervention required.

terraform force-unlock removes the lock, but it's a dangerous command. If you force-unlock while an operation is genuinely still running (maybe on a different machine, or in a CI runner you can't see), you can end up with two concurrent applies writing to the same state. That's how you get corrupted state. When the unlock itself fails, the state can become completely stuck until someone intervenes manually. Users frequently report issues with locks that never release properly, error messages that don't include the lock ID needed to force-unlock, and inability to force-unlock certain backends like azurerm.

3. Contention scales with state size

The bigger the state, the longer each operation takes. A state with 500 resources might take 3–5 minutes just for terraform plan (because every resource gets refreshed). During that entire time, the lock is held. If three engineers are working on the same infrastructure, two of them are constantly bouncing off the lock.

This is the real problem. Locking is a solved problem for small states with infrequent changes. It becomes a daily friction point for large states with multiple contributors. The OpenTofu community has recognized this as a fundamental design limitation and proposed per-resource granular locking and non-locking plan operations, but neither has been implemented.

4. Plans go stale

Even when locking works correctly, there's a race condition in the human workflow. You run terraform plan, review the output, decide it looks good, then run terraform apply. Between your plan and your apply, someone else might have applied a different change. Your plan was based on a state that no longer exists.

Terraform handles this by re-planning during apply and checking that the plan still matches. If it doesn't, the apply fails. But you've already spent time reviewing a plan that turned out to be stale — and now you have to start over.

Common workarounds

"Just coordinate"

The simplest approach: tell people to announce in Slack when they're about to run Terraform. This works for a team of three. It collapses at ten. It's also impossible to enforce — someone will forget, or be in a different timezone, or not check the channel.

CI serialisation

Configure your CI pipeline to run Terraform jobs sequentially — one at a time, in a queue. GitHub Actions can do this with concurrency groups:

concurrency:
  group: terraform-production
  cancel-in-progress: false

This eliminates lock contention but introduces a different problem: every Terraform change across your entire infrastructure is serialised. A DNS record update waits behind a database migration waits behind a security group change. Your CI queue becomes a bottleneck.

Workspace-per-branch

Some teams create a separate Terraform workspace for each feature branch, so engineers never contend on the same state. The intent is right — isolate concurrent work — but the execution is messy:

  • Workspaces share the same backend configuration, so you still need the same credentials everywhere.
  • You need cleanup automation to destroy workspaces after branches merge.
  • The workspace state diverges from production immediately, so your plan output doesn't reflect what will actually happen when you merge.
  • If two branches change the same resource, you discover the conflict at merge time, not at plan time.

Smaller states

The most effective workaround is splitting your monolithic state into smaller ones. If the networking team has their own state and the application team has theirs, they never contend on the same lock. This is the right direction, but it introduces a new problem: managing the dependencies between states. (See Splitting a Terraform Monolith into Smaller States for a detailed walkthrough.)

How Snap CD handles concurrency

Snap CD eliminates state lock contention by design. Instead of engineers and CI pipelines racing to acquire a lock, deployments flow through an orchestrated queue.

One queue per Module

Each Snap CD Module — which corresponds to a single Terraform root and state file — has its own deployment queue. When a deployment is triggered (by a source change, a dependency update, or a manual action), it enters the queue. The Snap CD Server ensures that only one operation runs against a given Module at a time. There's no lock contention because there's no contention — operations are serialised at the orchestration layer, before they reach Terraform.

If a second deployment is triggered while the first is still running, it waits. No error, no manual retry, no Slack coordination. It just runs next.

Cascading without races

When a Module's outputs change, Snap CD automatically queues re-plans for all dependent Modules. These cascading deployments are also serialised per Module. If Module A's apply triggers re-plans for Modules B and C, those re-plans enter B's queue and C's queue respectively. B and C can plan and apply in parallel (they're different states), but two operations never hit the same state simultaneously.

Module A (networking) applies
  ├── Module B (compute) queued → plans → awaits approval → applies
  └── Module C (database) queued → plans → awaits approval → applies

Plans are always fresh

Because the orchestrator controls when plans run, there's no gap between "someone reviews a plan" and "the state changes underneath." The plan is generated from the current state at the time of execution, and approval gates hold the apply until a human (or a policy) signs off. If the Module's inputs change while waiting for approval, Snap CD re-plans automatically.

No force-unlock

Since Terraform is executed by Runners under the orchestrator's control, there are no orphaned locks. If a Runner crashes, the Server detects the lost connection and marks the deployment as failed. The next deployment in the queue starts clean — no manual force-unlock required.

When locking is enough

Not every project needs an orchestrator. If your team is small (two or three people), your state is modest (under a hundred resources), and you deploy infrequently (a few times a week), Terraform's built-in locking is fine. The overhead of coordinating is low, and contention is rare.

The signals that locking alone is no longer enough:

  • Engineers regularly hit lock errors and have to retry.
  • CI pipelines fail due to lock contention, and no one notices for hours.
  • You've resorted to Slack-based coordination or "deployment windows."
  • Someone has had to run force-unlock more than once.
  • Plan times exceed a minute or two, extending the window during which the lock is held.

At that point, the problem isn't locking — it's that you're asking a mutex to do the job of a scheduler.

Tips

  • Never run force-unlock without verifying that no operation is running. Check your CI dashboards, ask your team. A premature force-unlock can corrupt state.
  • If you're hitting lock contention regularly, the answer is smaller states, not longer timeouts. Increasing -lock-timeout just shifts the wait from "instant failure" to "slow failure."
  • Serialised CI queues are a stop-gap, not a solution. They eliminate contention at the cost of throughput. If your infrastructure is large enough to have contention problems, it's large enough to need parallel deployments across independent states.
  • Monitor your lock duration. If the average lock hold time is growing, your state is growing — and a split is overdue.

See also

Snap CD

Intelligent GitOps for Infrastructure as Code. Automate, orchestrate, and scale your infrastructure deployments with confidence.


© 2026 Snap CD. All rights reserved.

An unhandled error has occurred. Reload 🗙