POV: Designing Infrastructure Deployments with GitHub Actions
It started simple — I had some Terraform code to set up my Google Cloud project. A few buckets, a Cloud SQL instance, maybe a Kubernetes cluster.
Running it locally was easy at first. Just
terraform plan and terraform apply. Until it
wasn't.
It became harder when I needed to test before production. I had to manage multiple environments (dev, staging, prod). Sometimes I'd forget to switch credentials and almost deploy to the wrong project. Teammates started doing the same, and everyone's setup was slightly different. We were spending more time preparing the environment than deploying infrastructure.
And to make it worse, I had to run Terraform plan manually every so often, just to check if someone changed something in the console and the state drifted.
That's when I realized — it's not just about running Terraform. It's about making the process reproducible and safe.
At some point I got tired of re-configuring my local environment.
Every time I wanted to deploy, I had to check:
- Terraform version
- Google Cloud SDK version
- Proper authentication
That's when the idea came — what if the deployment environment itself was versioned and shared?
So I built a Docker image that could do everything:
run Terraform and gcloud commands — because not
everything in Google Cloud is available in Terraform. For example,
enabling APIs or configuring Pub/Sub push endpoints often still needs
gcloud.
The image became a reliable, reproducible environment that anyone on the team could use. You could clone the repo, run a single command, and deploy without worrying about installing or configuring anything manually.
At first, it was built just for local use — our "deployment tool in a box." But then I realized: if it works locally, why not use it in CI?
That's when it evolved into something more — a universal deploy environment, shared between local and automated runs.
Example of a deployment command:
docker run --rm \
-v $(pwd)/cd:/workspace/cd \
-v ~/.secrets/gcp-service-account.json:/workspace/creds/gcp-service-account.json \
-e GCP_CREDS_FILE="/workspace/creds/gcp-service-account.json" \
-e ACTION="plan" \
usabilitydynamics/udx-worker-tooling:latest
It also needed to handle authentication cleanly.
At first, we all used service account keys — it's the
simplest way to get Terraform or gcloud talking to Google
Cloud. You download a JSON key, mount it into the container, and the
Docker image picks it up through an environment variable. It works,
but it's not great for the long term.
The problem is, those keys don't expire. They sit on laptops, in environment variables, sometimes even in repos. And after a few months, nobody remembers which key belongs to what. Keys don't have descriptions, so it's hard to tell if one is still in use or safe to delete.
So I decided to switch to short-lived tokens for GitHub Actions deployments. They act just like credentials, but expire automatically.
In CI, GitHub Actions can request one dynamically using Workload Identity. That identity is linked to the same service account we use locally — so permissions stay consistent across both workflows.
Example of authorization step with Workload Identity
- name: Authorize Google
uses: google-github-actions/auth@v3
with:
workload_identity_provider: ${{ env.DEPLOY_AUTH_PROVIDER }}
service_account: ${{ env.DEPLOY_CLOUD_ACCOUNT }}
Output
Run google-github-actions/auth@v3
Created credentials file at "/home/runner/work/aws-cache-invalidation/aws-cache-invalidation/gha-creds-93969fc5c6d1c48e.json"
The best part: once you move CI to Workload Identity, you can safely delete all static keys from the service account. Even if your security team rotates or removes every key, CI deployments keep running — because they don't rely on keys at all.
Locally, we still use service account keys when needed, but they're temporary and can be recreated anytime. During security reviews, we can confidently remove all keys, knowing that production deployments stay safe and keyless.
That's what I wanted from the start: same service account, same permissions, no leftover secrets.
Once the container and authentication were stable, I started designing the workflow itself.
Most configuration now lives inside GitHub environments, using a combination of shared variables and secrets.
- Shared values (like regions, image names) come from repository-level vars/secrets.
- Environment-specific ones (like project IDs or service accounts) are defined per GitHub environment.
- Some values are dynamic — using prefixes or suffixes based on the branch or environment name.
It's clean, auditable, and flexible. You can see exactly what each environment runs with, without digging through YAML.
I didn't want GitHub Actions to just "run Terraform." I wanted it to act like a smart operator — a system that knows when to plan, when to apply, and when to stop.
So I started with a config job. It doesn't deploy anything — it just understands the situation.
It checks:
- Which branch triggered the run
- Which environment that branch maps to
- Whether this is a PR (plan) or a push to main (apply)
- What credentials and variables are available
- If any configuration files are missing
If something is missing, it tries to generate defaults — and if that's not possible, it fails early with a clear explanation. No more guessing why Terraform broke halfway through.
Configuration Summary Output:
📋 Infrastructure Deployment Configuration Summary
──────────────────────────────────────────────
🏷️ Version: 0.9.12 (source: GitVersion)
📦 Image: udx-worker-tooling:latest
🗂️ Terraform Path: ./cd/terraform
🔐 Auth Mode: Workload Identity (short-lived token)
🌿 Branch: main
🌍 Target Environment: production
🚀 Trigger: push
Execution Plan
──────────────────────────────────────────────
🧩 Config Job: ✅ Completed
📄 Terraform Plan: ✅ Will run
⚙️ Terraform Apply: ✅ Will run (auto-approved)
📬 Slack Notifications: ✅ Enabled
🔑 Service Account Key: ⛔ Not used (Workload Identity)
☁️ Cloud APIs Check: ✅ Enabled
Summary
──────────────────────────────────────────────
🔒 Security Upload: ✅ Enabled
🧠 Environment Detected: production (matched)
👤 Triggered by: github-actions[bot]
──────────────────────────────────────────────
The config job exports outputs that define what happens next. That's the key — later jobs don't have to decide anything; they just follow the plan.
Once config finishes, the workflow moves to the terraform job. That's where the deploy image does the heavy lifting.
This job:
- Authenticates to Google Cloud
- Verifies the backend bucket for Terraform state
- Runs both Terraform and
gcloudwhere needed -
Executes
terraform init,plan, and — if allowed —apply
Everything runs inside the same Docker image. If it works locally, it'll work in CI exactly the same way.
Example of terraform job output:
──────────────────────────────────────────────
▶ Preparing environment...
📋 Plan-only mode enabled
✅ Terraform initialized successfully
🌍 Project: client-udx
📦 Environment: staging
📁 Config loaded: /workspace/cd/configs/worker.yaml
🔑 Authenticated via service account key
🏗️ Running terraform plan...
──────────────────────────────────────────────
Plan: 0 to add, 2 to change, 0 to destroy
✔️ Plan complete, no errors detected
──────────────────────────────────────────────
Logging and communication became the next focus. Terraform and
gcloud outputs can be noisy, so I added structured
checkpoints and summaries.
Example: Log Output (Plan-Only)
▶ Determine Environment
🌿 Branch: feature/update-storage → main
✅ Environment detected: staging
⚙️ Mode: plan-only
▶ Authenticate to Google Cloud
✅ Authenticated via Workload Identity
🌍 Project: client-udx
▶ Terraform Plan
✅ Initialized successfully
🔧 Plan complete — Add: 0 | Change: 2 | Destroy: 0
▶ Notify
✅ Slack message sent (plan success - staging)
Readable logs aren't decoration — they build trust. When something fails, I want to know what step failed, why, and what it means — not just "Exit 1."
When the deployment finishes, the workflow communicates.
Inside GitHub, I use annotations — short summaries like:
- Environment detected
- Plan-only mode
- Configuration merged
- Plan results
Example: Workflow Annotations
ℹ️ infrastructure / terraform
Plan-only mode enabled — infrastructure will NOT be modified
ℹ️ infrastructure / terraform
Environment-specific config file found for environment: production
ℹ️ infrastructure / terraform
Successfully merged 3 files into `.tmp/merged-production-infra.yaml`
ℹ️ infrastructure / terraform
Found environment-specific files for 'production'
ℹ️ infrastructure / terraform
Environment files:
./infra/configs/production/gcp-storage.yaml
./infra/configs/production/gcp-monitoring.yaml
./infra/configs/production/sql-instance.yaml
ℹ️ infrastructure / terraform
Using 3 files from environment directory `./infra/configs/production` for 'production'
Outside GitHub, the workflow sends Slack notifications — and I standardized them to make every message clear and predictable.
Success Example:
✅ Infrastructure Deployment Succeeded
client-udx
Environment: production
Changes: Add: 0 | Change: 2 | Destroy: 0
Status: Success
View Workflow Logs
Failure Example:
❌ Infrastructure Deployment Failed
client-udx
Environment: production
Changes: Add: 0 | Change: 2 | Destroy: 0
Status: Failed
Reason: Missing configuration files or invalid variables
View Workflow Logs
Same structure, same context — only the outcome changes. That consistency makes notifications useful, not noisy.
Over time, I added more structure and safety.
For example, control gates — rules that decide when jobs can run. Production deployments can't be triggered manually; they only run when a PR is reviewed, approved, and merged.
That way, I can still plan from a feature branch, but only the reviewed code can apply changes to production. It's not just automation — it's governance built in.
Another improvement was reusability. When you write enough workflows, you start repeating the same patterns.
So I began building composite actions — step templates that bundle complex logic into a single, reusable block. For example, one handles Slack notifications, another prepares Terraform environments, and another manages Google Cloud authentication.
Then I moved to workflow templates — complete
workflows that define the standard pipeline structure. Every Terraform
repo can inherit the same pattern:
config → plan → apply → notify. You just drop one YAML
file in a repo and it works.
The key idea was simplicity — making complex delivery logic easy to reuse, understand, and maintain. Hard stuff, wrapped in a light interface.
Eventually, the workflow became more than automation — it became orchestration.
When I push code now, I see a complete story:
- Config detects the environment and validates inputs.
-
Terraform and
gcloudrun in a clean, shared environment. - Control gates decide what's allowed.
- Notifications summarize the result.
It's not just code execution — it's controlled delivery. No manual setup, no hidden state, no confusion.
And it all started because I got tired of switching credentials and Terraform versions locally. That small decision — to package the deploy logic into a single reproducible container — ended up defining how my team delivers infrastructure today.
I don't just automate Terraform anymore. I design workflows that make infrastructure delivery predictable, explainable, and reusable.