Why Snap CD: Self-Hosted Terraform Runners with Credential Isolation
Most infrastructure teams run Terraform from a CI pipeline. That pipeline has credentials — cloud provider keys, state backend tokens, maybe a vault token to fetch more secrets. Early on, one pipeline with one set of credentials works fine. But as the infrastructure grows and more environments come online, the shared-runner model starts creating problems that are hard to fix without rethinking the architecture.
The shared-runner problem
When a single CI runner (or pool of identical runners) handles all Terraform work, several things go wrong at the same time.
Credential sprawl
Your CI runner needs to deploy networking in production, spin up a dev Kubernetes cluster, manage DNS records, and provision a staging database. That means it holds credentials for all of those things — often across multiple cloud providers and accounts.
Every credential on the runner is accessible to every job that runs on it. A misconfigured pipeline step for the dev environment can reach production AWS keys. The blast radius of a compromised runner is everything it has access to.
Blast radius
A bad Terraform run is supposed to be scoped to the infrastructure it manages. But when the runner has broad access, a bug in one pipeline — or a malicious commit — can reach resources it was never intended to touch. The runner doesn't know that a dev pipeline shouldn't be able to destroy production resources. It just runs whatever Terraform tells it to, with whatever credentials it has.
Compliance and auditability
Auditors want to know who (or what) can access production, and they want that list to be short and verifiable. "Our CI runner can access everything" is not a satisfying answer. Showing that only a specific, dedicated runner with a specific identity can reach production — and that it can only be invoked by specific modules with specific approval gates — is a much stronger story.
Team boundaries
Different teams own different parts of the infrastructure. The networking team shouldn't need to care about the application team's deployment pipeline, and vice versa. But when they share a runner, they share the pipeline configuration, the credential setup, the job queue, and the failure modes.
How teams typically cope
These are real patterns that work, up to a point.
Separate CI projects
Create one CI project per environment or per team. The prod project has prod credentials; the dev project has dev credentials. This solves credential scoping but multiplies the number of CI configurations you maintain. Pipeline logic gets duplicated or abstracted into shared templates that become their own maintenance burden.
Vault-based credential injection
Use HashiCorp Vault (or a cloud-native equivalent) to issue short-lived credentials at job time. The runner itself has minimal standing access — it authenticates to Vault, gets scoped credentials, and uses them for one job.
This is architecturally sound but adds operational complexity: you need a Vault cluster (or managed service), policies for every credential path, rotation logic, and monitoring for lease expiry. The runner still executes all jobs — you've scoped the credentials, but the execution environment is shared.
Environment-specific pipelines
Separate pipeline definitions for each environment, each with their own credential configuration. Similar to separate CI projects but within a single CI system. You get some isolation but the runner infrastructure is still shared, and the pipeline definitions tend to diverge over time.
Self-hosted runner groups
CI systems like GitHub Actions and GitLab CI support runner groups or tags. You deploy dedicated runner machines for production and different ones for development, then use labels to route jobs to the right group.
This works well for compute isolation but you're now managing runner infrastructure yourself — provisioning machines, keeping them patched, scaling them, and managing the credential distribution to each group. The CI system orchestrates which job goes where, but the operational burden is on you.
Snap CD's approach: separate orchestration from execution
Snap CD organises infrastructure in a three-level hierarchy: Stacks, Namespaces, and Modules. Stacks typically represent environments (production, staging), Namespaces group by team or infrastructure layer (networking, data-platform), and Modules are individual Terraform roots. Permissions, secrets, and Runner access can be scoped at any level and inherited downward. For a full walkthrough of this hierarchy and the input system, see Modular Deployments, and for the permission system that supports it, see A Permission System Built for Infrastructure.
Snap CD was designed around the idea that the system coordinating deployments should not be the same system executing them.
Two distinct roles
The Snap CD Server (hosted at snapcd.io, or self-hosted) handles:
- Module definitions — what to deploy, from which source, with which inputs
- Dependency tracking — which modules depend on which outputs
- Change detection — watching Git repos and upstream outputs for changes
- Plan review and approval gates
- Logging and audit trails
The server never touches your cloud provider. It never holds your AWS keys or Azure credentials. It doesn't run terraform plan or terraform apply.
Runners handle execution. A Runner is a lightweight, self-hosted worker that you deploy wherever makes sense — a Kubernetes pod in your cluster, a VM in your cloud account, a container on a developer machine. The Runner:
- Connects to the Snap CD server via a long-lived, authenticated, bi-directional WebSocket connection.
- Picks up jobs assigned to it (plan, apply).
- Downloads the module source code.
- Executes standard Terraform/OpenTofu commands in a local shell session.
- Reports results (plan output, apply output, state changes) back to the server.
The Runner only has the credentials you give it. A Runner deployed into your production Azure subscription with a managed identity has access to production Azure — and nothing else. A Runner on a dev machine with dev AWS keys can only reach dev AWS.
No credential forwarding
The Snap CD server never sees, stores, or forwards cloud credentials. Credentials live on the Runner, configured the same way you'd configure them for any local Terraform run — environment variables, cloud provider metadata services, credential files. The server tells the Runner what to do; the Runner uses its own credentials to do it.
This means a compromise of the Snap CD server does not expose your cloud credentials. The server knows your module definitions and deployment history, but it cannot execute infrastructure changes on its own.
Permission-controlled runner access
Snap CD's permission system extends to Runners. You can control which Modules are allowed to use which Runners through supply resources:
The HCL examples below use the Snap CD Terraform provider — the canonical way to configure Snap CD with Terraform. For more on the provider, see An Extensive Supporting Toolset.
resource "snapcd_runner" "prod_azure" {
name = "prod-azure"
organization_id = snapcd_organization.main.id
}
resource "snapcd_runner" "dev_azure" {
name = "dev-azure"
organization_id = snapcd_organization.main.id
}
Then use supply resources to declare which Runners are available to which scopes. A Runner "supplies" itself to a Stack, Namespace, or individual Module. The most common pattern is supplying a Runner to a Stack — since Stacks typically represent environments (production, staging, dev), this gives you per-environment credential isolation:
resource "snapcd_runner_stack_supply" "prod" {
runner_id = snapcd_runner.prod_azure.id
stack_id = snapcd_stack.production.id
}
resource "snapcd_runner_stack_supply" "dev" {
runner_id = snapcd_runner.dev_azure.id
stack_id = snapcd_stack.dev.id
}
Every Module in the production Stack — across all its Namespaces (networking, compute, data, etc.) — can only execute on prod-azure. Every Module in dev can only execute on dev-azure.
Supply resources also work at the Namespace and Module level, so you can drill down when a team or a specific piece of infrastructure needs its own isolated Runner.
Namespace-level supply. Useful when a team manages critical resources that require dedicated credentials — for example, a data-platform Namespace where only a Runner with access to production databases should execute:
resource "snapcd_runner" "prod_data" {
name = "prod-data-platform"
organization_id = snapcd_organization.main.id
}
resource "snapcd_runner_namespace_supply" "data_platform" {
runner_id = snapcd_runner.prod_data.id
namespace_id = snapcd_namespace.data_platform.id
}
Modules in the data_platform Namespace execute on prod-data-platform instead of the Stack-level Runner. Other Namespaces in the same Stack continue using the Stack-level Runner.
Module-level supply. Rarely needed, but available for cases where a single module requires its own isolated runner — for example, a module that manages a key vault or certificate authority with uniquely sensitive credentials:
resource "snapcd_runner_module_supply" "key_vault" {
runner_id = snapcd_runner.prod_keyvault.id
module_id = snapcd_module.prod_key_vault.id
}
The boundary is enforced by the server before a job is dispatched — a module without a matching supply will not execute, regardless of what credentials are available elsewhere.
Deployment patterns
One Runner per environment
The most common pattern. Deploy a Runner into each environment (dev, staging, production), each with credentials scoped to that environment.
snapcd.io
│
├── Runner-dev (dev AWS credentials)
├── Runner-staging (staging AWS credentials)
└── Runner-prod (prod AWS credentials, approval gates required)
Clean credential boundaries. A compromised dev Runner cannot reach production. Simple to reason about.
One Runner per cloud provider
When your infrastructure spans multiple clouds, deploy Runners with provider-specific credentials:
snapcd.io
│
├── Runner-azure (Azure managed identity)
├── Runner-aws (AWS IAM role)
└── Runner-gcp (GCP service account)
Useful when environment boundaries are less important than provider boundaries — for example, if your Azure and AWS infrastructure are managed by different teams with different credential policies.
Combined: environment × provider
For larger organizations, combine both dimensions:
snapcd.io
│
├── Runner-azure-dev
├── Runner-azure-prod
├── Runner-aws-dev
└── Runner-aws-prod
Each Runner has exactly the credentials it needs — nothing more.
Shared Runner with scoped permissions
For smaller teams that don't need strict environment isolation, a single Runner with broad credentials can work. Use Snap CD's permission system to control which users and service principals can trigger deployments on that Runner, and rely on approval gates for production changes.
This trades some isolation for operational simplicity. It's a reasonable starting point that you can tighten as the team and infrastructure grow.
Fully source-available
Snap CD is source-available — the entire codebase, including the Server, Runner, and Terraform provider, is maintained in a single monorepo. You can inspect every component, understand exactly what runs in your environment, and modify it if you need to.
The Runner is designed to be stateless between jobs. It downloads Module source code, runs Terraform, reports results, and cleans up. No job data persists on the Runner after completion, which simplifies security reviews and makes Runners easy to replace or scale.
Compared to the alternatives
| Concern | Separate CI projects | Vault + shared runner | Self-hosted runner groups | Snap CD Runners |
|---|---|---|---|---|
| Credential scoping | Per-project | Per-job (dynamic) | Per-group | Per-Runner |
| Execution isolation | Separate machines | Shared machine | Separate machines | Separate machines |
| Orchestration burden | Duplicated pipelines | Single pipeline + Vault | Labels/tags in CI | Managed by Snap CD Server |
| Dependency awareness | None (manual ordering) | None (manual ordering) | None (manual ordering) | Built-in (Module dependencies) |
| Audit trail | CI logs per project | CI logs + Vault audit | CI logs per group | Centralized in Snap CD |
The key difference is that Snap CD Runners are not general-purpose CI machines repurposed for Terraform. They're purpose-built for infrastructure deployment, integrated with a system that understands Module dependencies, approval gates, and deployment ordering.
Tips
- Start with one Runner and split later. You don't need per-environment Runners on day one. Start with a single Runner and add more as your isolation requirements become clear.
- Use managed identities where possible. A Runner in Azure with a managed identity, or in AWS with an IAM role attached to its instance profile, avoids storing long-lived credentials entirely.
- Keep Runners stateless. Don't store Terraform state on the Runner. Use remote backends (which Snap CD manages) so Runners can be replaced without losing state.
- Monitor Runner connectivity. The WebSocket connection is long-lived but not immortal. The Runner reconnects automatically, but monitoring connection status helps you catch network issues before they block deployments.
- Scope permissions early. It's easier to set up Runner permissions correctly from the start than to tighten them later when teams are already used to a permissive setup.
See also
- Modular Deployments — how Runner isolation fits into the broader Module and dependency system
- Managing Secrets in Terraform — how secrets are scoped and injected per Module
- A Permission System Built for Infrastructure — granular RBAC for Stacks, Namespaces, Modules, and Runners
- The Problem with Large Terraform States — why splitting states matters, and how Runner isolation supports it