Why Snap CD: AI on a Leash

Karl Schriek·March 22, 2026

AI coding agents are showing up in infrastructure workflows. They can diagnose a failed terraform apply, summarise what changed across a dozen modules overnight, draft a fix for a misconfigured security group, and recommend whether a plan is safe to approve. The potential to eliminate toil is real.

But so is the potential to break things. A bad terraform apply can delete a production database. An agent that auto-approves plans without understanding blast radius is not a productivity tool — it's a liability. The question isn't whether to use AI in infrastructure management. It's how to use it without handing over the keys.

The problem with unrestricted agents

Most AI agent frameworks assume broad access. Give the agent credentials, point it at your infrastructure, and let it figure things out. This works fine for generating code in a branch. It's a terrible model for infrastructure, where the gap between "run this command" and "destroy this resource" is one flag.

The usual mitigations are crude:

Read-only API keys. The agent can observe but not act. You get diagnostics but no automation — the human still has to do everything.
Wrapper scripts with allow-lists. You write a shell script that only permits certain Terraform commands. Fragile, hard to maintain, and easy to outgrow.
Separate CI pipelines. The agent commits to a branch, CI runs the plan, a human reviews. This works but adds latency and doesn't let the agent participate in approval or deployment at all.

None of these give you a spectrum of trust. It's all-or-nothing: either the agent can do everything, or it's limited to generating text that a human has to act on manually.

What you actually want

A useful model looks more like how you'd onboard a new team member:

Start them with read access so they can learn the system and diagnose issues.
Give them deploy access to test so they can move fast without risk.
Let them approve low-risk changes in staging once they've proven reliable.
Grant production access only when trust is established — and even then, scoped to the systems they own.

The same progression makes sense for an AI agent. The challenge is finding a permission system that supports this without building a separate authorization layer just for AI.

How Snap CD handles this

In Snap CD, an AI agent is just another principal — a service principal with role assignments, exactly like a human user or a CI service account. There is no special "AI mode" or separate agent permission system. The same RBAC that governs human access governs agent access.

Scoped role assignments

You grant an agent roles at whatever granularity makes sense — scoped to a Stack, Namespace, or Module:

# The agent can read everything in prod — diagnose issues, view plans, inspect state
resource "snapcd_role_assignment" "agent_prod_reader" {
  principal_id = snapcd_service_principal.ai_agent.id
  role         = "Reader"
  scope_id     = snapcd_stack.prod.id
}

# The agent can deploy freely in test — run plans, approve, apply
resource "snapcd_role_assignment" "agent_test_contributor" {
  principal_id = snapcd_service_principal.ai_agent.id
  role         = "Contributor"
  scope_id     = snapcd_stack.test.id
}

# The agent can approve plans in staging, but only for the networking namespace
resource "snapcd_role_assignment" "agent_staging_approver" {
  principal_id = snapcd_service_principal.ai_agent.id
  role         = "Approver"
  scope_id     = snapcd_namespace.staging_networking.id
}

If the agent tries to approve a production deploy, it gets a permission denied — same as any user without the Approver role on that scope. No special-case logic, no wrapper scripts.

Approval gates as natural checkpoints

Snap CD's approval system works the same regardless of who (or what) created the plan. A Module can require a minimum number of approvals before an apply proceeds. This means:

An agent can trigger a plan and recommend approval.
A human reviews the plan output and approves or rejects.
The apply only proceeds once the required approval count is met.

You can also set up a workflow where the agent itself is one of multiple required approvers. Two humans and one agent, or two agents and one human — whatever quorum makes sense for the risk level. The approval system doesn't care whether the approver is biological.

Full audit trail

Every action an agent takes — triggering a plan, approving a deployment, reading state — is logged and attributed to its service principal. You can answer "what did the agent do last Tuesday?" the same way you'd answer it for any user: check the audit log.

Practical scenarios

Drift detection and remediation

An agent with Reader access across your Stacks can periodically inspect plan outputs for drift — resources that have changed outside of Terraform. When it detects drift:

It opens a PR in the source repo with the fix (a code change, not a Terraform command).
The PR triggers a Snap CD plan via the normal GitOps flow.
A human reviews the plan and approves.

The agent never runs terraform apply directly. It participates in the workflow at the level you've authorised: detect, propose, then step back.

Plan review and selective approval

An agent with Approver access on a staging Namespace can review plans and make approval decisions based on risk:

No resource deletions, no provider changes, only tag updates? The agent approves automatically.
A security group rule change? The agent flags it for human review and withholds approval.
A database deletion? The agent rejects and notifies the team.

The rules are in the agent's logic. The enforcement is in Snap CD's permission system. Even if the agent's logic has a bug and it tries to approve a plan it shouldn't, the permission boundary holds — it can only approve where it has the Approver role.

Environment-scoped autonomy

A common pattern: give the agent broad permissions in test, narrow permissions in prod.

test/          → Contributor (full deploy access)
staging/       → Approver (can approve, scoped to specific namespaces)
prod/          → Reader (observe only)

In test, the agent can deploy freely — run plans, approve them, apply them. It iterates fast, catches issues early, and doesn't need human intervention for routine changes. In staging, it participates in the approval process but can't act unilaterally on sensitive namespaces. In prod, it can diagnose and report but never modify.

This isn't a rigid hierarchy. You can adjust per Namespace or per Module. Maybe the agent gets Contributor on prod/monitoring because deploying a new dashboard is low-risk, while prod/database stays human-only. The permission system is granular enough to express whatever trust model you need.

Why this works

The key insight is that there's nothing AI-specific about any of this. Snap CD doesn't have an "AI agent" feature. It has a permission system that treats every principal — human, service account, AI agent — the same way.

This is a deliberate design choice. A separate AI permission layer would mean:

Two systems to reason about. "Can the agent do X?" becomes a question about both the AI policy and the RBAC policy.
Inconsistent enforcement. If the AI layer permits something the RBAC layer denies (or vice versa), you get confusing failures.
Maintenance burden. Every new feature needs to consider both human and AI access paths.

By using the same system, you get one mental model, one audit log, and one set of roles to manage. An agent that's granted Contributor on a Namespace behaves exactly like a human Contributor on that Namespace — same capabilities, same restrictions, same audit trail.

Getting started

Setting up an AI agent as a Snap CD principal is the same as setting up any service account:

Create a service principal for the agent in your Snap CD organization.
Assign roles scoped to the Stacks, Namespaces, or Modules where the agent should operate.
Generate an API key for the agent to authenticate with.
Configure approval gates on Modules where you want human oversight before apply.

Start conservative. Give the agent Reader everywhere and Contributor on a test Stack. Watch what it does, review the audit log, and expand access as trust develops. You can always grant more — revoking is harder to do gracefully once workflows depend on the access.

Tips

Start with diagnostics. An agent with read-only access that summarises failed plans, detects drift, and reports on deployment status is immediately useful with zero risk.
Use approval gates as training wheels. Even if you trust the agent to approve, require at least one human approval initially. Remove the requirement later once you've validated the agent's judgement.
Scope credentials on the Runner too. Snap CD permissions control what the agent can do within Snap CD. Runner-level cloud credentials control what Terraform can do in your cloud. Both layers matter.
Treat agent permissions like code. Define role assignments in Terraform via the Snap CD provider. Review changes in PRs. Don't hand-configure permissions through the UI for automated principals.
Log everything, review regularly. The audit trail is there — use it. A monthly review of what the agent approved and deployed builds confidence (or reveals problems) faster than assumptions.