Terraform State Locking and Concurrency
Terraform state is a single file. Every plan reads it, every apply writes it, and only one operation can hold the lock at a time. That's fine when one person runs Terraform on a laptop. It breaks down the moment a second engineer — or a second CI pipeline — tries to touch the same state.
This guide explains how Terraform's locking mechanism works, where it falls short, and what you can do about it.
How state locking works
When you run terraform plan or terraform apply, Terraform attempts to acquire a lock on the state file before doing anything. The lock is a mutual-exclusion primitive — if someone else holds it, your operation fails immediately with an error:
Error: Error acquiring the state lock
Error message: ConditionalCheckFailedException: The conditional request failed
Lock Info:
ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890
Path: my-bucket/terraform.tfstate
Operation: OperationTypeApply
Who: alice@laptop
Version: 1.9.0
Created: 2026-06-15 14:32:01.123456 +0000 UTC
The implementation depends on your backend:
| Backend | Lock mechanism |
|---|---|
| S3 | DynamoDB table with conditional writes |
| Azure Blob | Blob leases (60-second renewable) |
| GCS | Object versioning with generation match |
| Consul | Session-based locks |
| PostgreSQL | Advisory locks |
| Terraform Cloud | API-managed run queue |
The lock is released automatically when the operation completes — successfully or not. If Terraform crashes mid-apply, the lock stays held until it times out or someone manually removes it.
The problems
1. No queuing
Terraform's lock is a mutex, not a queue. If the lock is held, your operation doesn't wait — it fails. You get an error message, and you retry manually. In a CI pipeline, this means a failed build that someone has to re-trigger. In concurrent GCS-backend scenarios, multiple processes can even obtain the same lock simultaneously, leading to silent state corruption.
There's no built-in way to say "wait until the lock is free, then run." You either get the lock or you don't. Users have requested a retry mechanism for years, but the -lock-timeout flag only makes a single retry attempt — there's no exponential backoff or configurable retry count.
2. Stale locks
If Terraform crashes, loses network connectivity, or if someone kills the process mid-apply, the lock can remain held with no process behind it. Different backends handle this differently:
- DynamoDB: The lock record stays until you run
terraform force-unlock <ID>. - Azure Blob: The lease expires after 60 seconds (if not renewed), but can also get stuck.
- GCS: Similar to DynamoDB — manual intervention required.
terraform force-unlock removes the lock, but it's a dangerous command. If you force-unlock while an operation is genuinely still running (maybe on a different machine, or in a CI runner you can't see), you can end up with two concurrent applies writing to the same state. That's how you get corrupted state. When the unlock itself fails, the state can become completely stuck until someone intervenes manually. Users frequently report issues with locks that never release properly, error messages that don't include the lock ID needed to force-unlock, and inability to force-unlock certain backends like azurerm.
3. Contention scales with state size
The bigger the state, the longer each operation takes. A state with 500 resources might take 3–5 minutes just for terraform plan (because every resource gets refreshed). During that entire time, the lock is held. If three engineers are working on the same infrastructure, two of them are constantly bouncing off the lock.
This is the real problem. Locking is a solved problem for small states with infrequent changes. It becomes a daily friction point for large states with multiple contributors. The OpenTofu community has recognized this as a fundamental design limitation and proposed per-resource granular locking and non-locking plan operations, but neither has been implemented.
4. Plans go stale
Even when locking works correctly, there's a race condition in the human workflow. You run terraform plan, review the output, decide it looks good, then run terraform apply. Between your plan and your apply, someone else might have applied a different change. Your plan was based on a state that no longer exists.
Terraform handles this by re-planning during apply and checking that the plan still matches. If it doesn't, the apply fails. But you've already spent time reviewing a plan that turned out to be stale — and now you have to start over.
Common workarounds
"Just coordinate"
The simplest approach: tell people to announce in Slack when they're about to run Terraform. This works for a team of three. It collapses at ten. It's also impossible to enforce — someone will forget, or be in a different timezone, or not check the channel.
CI serialisation
Configure your CI pipeline to run Terraform jobs sequentially — one at a time, in a queue. GitHub Actions can do this with concurrency groups:
concurrency:
group: terraform-production
cancel-in-progress: false
This eliminates lock contention but introduces a different problem: every Terraform change across your entire infrastructure is serialised. A DNS record update waits behind a database migration waits behind a security group change. Your CI queue becomes a bottleneck.
Workspace-per-branch
Some teams create a separate Terraform workspace for each feature branch, so engineers never contend on the same state. The intent is right — isolate concurrent work — but the execution is messy:
- Workspaces share the same backend configuration, so you still need the same credentials everywhere.
- You need cleanup automation to destroy workspaces after branches merge.
- The workspace state diverges from production immediately, so your plan output doesn't reflect what will actually happen when you merge.
- If two branches change the same resource, you discover the conflict at merge time, not at plan time.
Smaller states
The most effective workaround is splitting your monolithic state into smaller ones. If the networking team has their own state and the application team has theirs, they never contend on the same lock. This is the right direction, but it introduces a new problem: managing the dependencies between states. (See Splitting a Terraform Monolith into Smaller States for a detailed walkthrough.)
How Snap CD handles concurrency
Snap CD eliminates state lock contention by design. Instead of engineers and CI pipelines racing to acquire a lock, deployments flow through an orchestrated queue.
One queue per Module
Each Snap CD Module — which corresponds to a single Terraform root and state file — has its own deployment queue. When a deployment is triggered (by a source change, a dependency update, or a manual action), it enters the queue. The Snap CD Server ensures that only one operation runs against a given Module at a time. There's no lock contention because there's no contention — operations are serialised at the orchestration layer, before they reach Terraform.
If a second deployment is triggered while the first is still running, it waits. No error, no manual retry, no Slack coordination. It just runs next.
Cascading without races
When a Module's outputs change, Snap CD automatically queues re-plans for all dependent Modules. These cascading deployments are also serialised per Module. If Module A's apply triggers re-plans for Modules B and C, those re-plans enter B's queue and C's queue respectively. B and C can plan and apply in parallel (they're different states), but two operations never hit the same state simultaneously.
Module A (networking) applies
├── Module B (compute) queued → plans → awaits approval → applies
└── Module C (database) queued → plans → awaits approval → applies
Plans are always fresh
Because the orchestrator controls when plans run, there's no gap between "someone reviews a plan" and "the state changes underneath." The plan is generated from the current state at the time of execution, and approval gates hold the apply until a human (or a policy) signs off. If the Module's inputs change while waiting for approval, Snap CD re-plans automatically.
No force-unlock
Since Terraform is executed by Runners under the orchestrator's control, there are no orphaned locks. If a Runner crashes, the Server detects the lost connection and marks the deployment as failed. The next deployment in the queue starts clean — no manual force-unlock required.
When locking is enough
Not every project needs an orchestrator. If your team is small (two or three people), your state is modest (under a hundred resources), and you deploy infrequently (a few times a week), Terraform's built-in locking is fine. The overhead of coordinating is low, and contention is rare.
The signals that locking alone is no longer enough:
- Engineers regularly hit lock errors and have to retry.
- CI pipelines fail due to lock contention, and no one notices for hours.
- You've resorted to Slack-based coordination or "deployment windows."
- Someone has had to run
force-unlockmore than once. - Plan times exceed a minute or two, extending the window during which the lock is held.
At that point, the problem isn't locking — it's that you're asking a mutex to do the job of a scheduler.
Tips
- Never run
force-unlockwithout verifying that no operation is running. Check your CI dashboards, ask your team. A premature force-unlock can corrupt state. - If you're hitting lock contention regularly, the answer is smaller states, not longer timeouts. Increasing
-lock-timeoutjust shifts the wait from "instant failure" to "slow failure." - Serialised CI queues are a stop-gap, not a solution. They eliminate contention at the cost of throughput. If your infrastructure is large enough to have contention problems, it's large enough to need parallel deployments across independent states.
- Monitor your lock duration. If the average lock hold time is growing, your state is growing — and a split is overdue.
See also
- The Problem with Large Terraform States — why large states make locking worse
- Splitting a Terraform Monolith — reducing contention by splitting into smaller states
- Modular Deployments — how Snap CD's Module system manages parallel deployments
- Event-driven Continuous Deployment — how cascading changes flow through the deployment queue