The prettiest workflows are useless if they don’t work in production. I’ve read countless GitHub Actions tutorials that show you a pristine 15-line YAML file and call it a day. That’s not how real workflows look. Real workflows have weird conditionals, retry logic, and comments that say things like “DO NOT REMOVE - fixes race condition on Windows.”
This post is different. I’m going to walk through complete, production-grade workflows from real projects—the good, the bad, and the lessons learned along the way. These are the kind of workflows that have survived contact with actual users, flaky networks, and that one coworker who keeps force-pushing to main.
If you’re new to the series, check out the intro post first. But if you want to see how Actions work in the trenches, you’re in the right place.
Case Study 1: Open Source Library Release Pipeline
Let’s start with something many developers need: releasing a library to a package registry. This example is for a JavaScript library published to npm, but the patterns apply to PyPI, RubyGems, Maven—wherever you’re publishing.
The Problem
We had a popular open-source library with a manual release process that went something like this:
- Update version in
package.json - Update CHANGELOG.md
- Run tests locally
- Build the dist
npm publish- Create a GitHub Release
- Write release notes
- Announce on social media
Steps got skipped. Versions got mismatched. Once we published a release with failing tests because someone ran npm publish from a dirty working directory. Not our finest moment.
The Evolution
Version 1: The Naive Approach
Our first workflow was embarrassingly simple:
# .github/workflows/release.yml - Version 1
# Spoiler: this didn't last long
name: Release
on:
push:
tags:
- 'v*'
jobs:
release:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
registry-url: 'https://registry.npmjs.org'
- run: npm ci
- run: npm test
- run: npm run build
- run: npm publish
env:
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
It worked! Until it didn’t.
What went wrong:
- No matrix testing—we broke Node 18 support without realizing it
- No changelog generation—we were still writing release notes by hand
- No GitHub Release—just a tag floating in the void
- Published even when tests passed on Ubuntu but failed on Windows
Version 2: Adding Matrix Testing
# .github/workflows/release.yml - Version 2
# Now with matrix builds!
name: Release
on:
push:
tags:
- 'v*'
jobs:
# Test across Node versions and OSes before releasing
test:
strategy:
matrix:
node: [18, 20, 22]
os: [ubuntu-latest, macos-latest, windows-latest]
fail-fast: false # Run all tests even if one fails
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node }}
- run: npm ci
- run: npm test
release:
needs: test # Only release if ALL tests pass
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
registry-url: 'https://registry.npmjs.org'
- run: npm ci
- run: npm run build
- run: npm publish
env:
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
# Now we create a GitHub Release too
- name: Create GitHub Release
uses: softprops/action-gh-release@v2
with:
generate_release_notes: true
This was better. The needs: test line means the release job waits for all matrix tests to pass. We caught Windows-specific bugs before they hit npm.
But we still had problems:
generate_release_notes: trueproduces… okay release notes. Not great- We still had to manually manage the changelog
- Version in package.json didn’t always match the tag
- We once pushed a tag, realized we forgot something, deleted the tag, fixed the issue, and pushed the tag again. The workflow ran twice.
Version 3: The Production Version
Here’s what we actually run today:
# .github/workflows/release.yml - Version 3
# The battle-hardened version
name: Release
on:
push:
tags:
- 'v*'
# Prevent duplicate releases from re-pushed tags
concurrency:
group: release-${{ github.ref }}
cancel-in-progress: true
jobs:
# Verify the tag matches package.json version
validate:
runs-on: ubuntu-latest
outputs:
version: ${{ steps.check.outputs.version }}
steps:
- uses: actions/checkout@v4
- name: Verify version match
id: check
run: |
TAG_VERSION="${GITHUB_REF#refs/tags/v}"
PKG_VERSION=$(node -p "require('./package.json').version")
if [ "$TAG_VERSION" != "$PKG_VERSION" ]; then
echo "::error::Tag version ($TAG_VERSION) doesn't match package.json ($PKG_VERSION)"
exit 1
fi
echo "version=$PKG_VERSION" >> $GITHUB_OUTPUT
echo "Releasing version $PKG_VERSION"
test:
needs: validate
strategy:
matrix:
node: [18, 20, 22]
os: [ubuntu-latest, macos-latest, windows-latest]
fail-fast: false
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node }}
cache: 'npm' # Cache dependencies for faster builds
- run: npm ci
- run: npm test
# Run additional checks that only make sense on one OS
- name: Lint
if: matrix.os == 'ubuntu-latest' && matrix.node == '20'
run: npm run lint
- name: Type check
if: matrix.os == 'ubuntu-latest' && matrix.node == '20'
run: npm run typecheck
release:
needs: [validate, test]
runs-on: ubuntu-latest
permissions:
contents: write # Needed for creating releases
id-token: write # Needed for npm provenance
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Need full history for changelog
- uses: actions/setup-node@v4
with:
node-version: '20'
registry-url: 'https://registry.npmjs.org'
cache: 'npm'
- run: npm ci
- run: npm run build
# Generate changelog from conventional commits
- name: Generate changelog
id: changelog
uses: orhun/git-cliff-action@v3
with:
config: cliff.toml
args: --latest --strip header
env:
OUTPUT: CHANGELOG.md
# Publish with provenance for supply chain security
- name: Publish to npm
run: npm publish --provenance --access public
env:
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
- name: Create GitHub Release
uses: softprops/action-gh-release@v2
with:
body: ${{ steps.changelog.outputs.content }}
files: |
dist/*.js
dist/*.d.ts
# Notify on Discord (optional, but nice)
- name: Announce release
if: success()
uses: sarisia/actions-status-discord@v1
with:
webhook: ${{ secrets.DISCORD_WEBHOOK }}
title: "Released v${{ needs.validate.outputs.version }}"
description: ${{ steps.changelog.outputs.content }}
# Clean up if release fails halfway through
rollback:
needs: release
if: failure()
runs-on: ubuntu-latest
steps:
- name: Notify about failure
uses: sarisia/actions-status-discord@v1
with:
webhook: ${{ secrets.DISCORD_WEBHOOK }}
status: failure
title: "Release failed!"
description: "Manual intervention may be required"
Lessons Learned
Version validation is essential. The number of times we pushed a tag that didn’t match package.json… validate early, fail fast.
concurrencyprevents chaos. Without it, deleting and re-pushing a tag runs two workflows simultaneously. One will fail in confusing ways.npm provenance is free trust. With
--provenance, npm cryptographically links your package to your GitHub workflow. Users can verify the package came from your repo.fail-fast: falseis usually what you want. If Node 18 on Ubuntu fails, you still want to know if Node 22 on Windows also fails. Don’t stop at the first failure.Automated changelogs require commit discipline. git-cliff is amazing, but only if you use conventional commits. We had to retrofit this onto an existing project—it was painful but worth it.
Case Study 2: Mobile App CI/CD (iOS + Android)
Mobile CI/CD is a special kind of pain. Code signing, provisioning profiles, build times measured in geological eras—it’s enough to make you nostalgic for “works on my machine.”
The Problem
We had an app that shipped to both iOS and Android. The release process looked like this:
- Build Android APK (relatively easy)
- Build iOS IPA (requires macOS, signing, provisioning profiles, sacrificing a goat)
- Upload to Play Store internal testing
- Upload to TestFlight
- Wait for both to process
- Promote to production (manually, from a specific person’s laptop)
The iOS build was the killer. It only worked on our lead developer’s MacBook because that’s where the signing certificates lived. When he went on vacation, nobody could release.
The Painful Evolution
Version 1: Android Only (The Easy Part)
# .github/workflows/android.yml
# Android is almost pleasant
name: Android Build
on:
push:
branches: [main]
pull_request:
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-java@v4
with:
distribution: 'temurin'
java-version: '17'
cache: 'gradle'
- name: Build debug APK
run: ./gradlew assembleDebug
- name: Run tests
run: ./gradlew test
- name: Upload APK
uses: actions/upload-artifact@v4
with:
name: debug-apk
path: app/build/outputs/apk/debug/*.apk
This worked immediately. Android’s tooling on Linux is solid.
Version 2: iOS Attempt #1 (Failure)
# .github/workflows/ios.yml
# Attempt 1: The naive approach
# Spoiler: this did not work
name: iOS Build
on:
push:
branches: [main]
jobs:
build:
runs-on: macos-latest
steps:
- uses: actions/checkout@v4
- name: Build
run: xcodebuild -scheme MyApp -configuration Debug
# Error: Signing requires a development team
Yeah, that didn’t work. Code signing in CI is its own circle of hell.
Version 3: iOS With Code Signing (The Real Deal)
After much suffering (and a lot of Stack Overflow), here’s what actually works:
# .github/workflows/ios.yml
# The version that actually works
name: iOS Build
on:
push:
branches: [main]
pull_request:
workflow_dispatch:
inputs:
deploy:
description: 'Deploy to TestFlight'
type: boolean
default: false
jobs:
build:
runs-on: macos-14 # Specific version, not macos-latest
# macos-latest was macos-12 for ages, then suddenly wasn't
# Pin to a specific version to avoid surprises
env:
DEVELOPER_DIR: /Applications/Xcode_15.2.app/Contents/Developer
# Pin Xcode version too. Different versions have different behaviors.
steps:
- uses: actions/checkout@v4
# Install the Apple certificate and provisioning profile
# This is the secret sauce
- name: Install Apple Certificate
env:
BUILD_CERTIFICATE_BASE64: ${{ secrets.BUILD_CERTIFICATE_BASE64 }}
P12_PASSWORD: ${{ secrets.P12_PASSWORD }}
KEYCHAIN_PASSWORD: ${{ secrets.KEYCHAIN_PASSWORD }}
run: |
# Create variables
CERTIFICATE_PATH=$RUNNER_TEMP/build_certificate.p12
KEYCHAIN_PATH=$RUNNER_TEMP/app-signing.keychain-db
# Decode certificate from base64
echo -n "$BUILD_CERTIFICATE_BASE64" | base64 --decode -o $CERTIFICATE_PATH
# Create temporary keychain
security create-keychain -p "$KEYCHAIN_PASSWORD" $KEYCHAIN_PATH
security set-keychain-settings -lut 21600 $KEYCHAIN_PATH
security unlock-keychain -p "$KEYCHAIN_PASSWORD" $KEYCHAIN_PATH
# Import certificate to keychain
security import $CERTIFICATE_PATH -P "$P12_PASSWORD" -A -t cert -f pkcs12 -k $KEYCHAIN_PATH
security list-keychain -d user -s $KEYCHAIN_PATH
- name: Install Provisioning Profile
env:
PROVISIONING_PROFILE_BASE64: ${{ secrets.PROVISIONING_PROFILE_BASE64 }}
run: |
PP_PATH=$RUNNER_TEMP/build_pp.mobileprovision
echo -n "$PROVISIONING_PROFILE_BASE64" | base64 --decode -o $PP_PATH
mkdir -p ~/Library/MobileDevice/Provisioning\ Profiles
cp $PP_PATH ~/Library/MobileDevice/Provisioning\ Profiles/
# Now we can actually build
- name: Build archive
run: |
xcodebuild -scheme MyApp \
-archivePath $RUNNER_TEMP/MyApp.xcarchive \
-sdk iphoneos \
-configuration Release \
-destination 'generic/platform=iOS' \
clean archive \
CODE_SIGN_IDENTITY="Apple Distribution" \
PROVISIONING_PROFILE_SPECIFIER="${{ secrets.PROVISIONING_PROFILE_NAME }}"
- name: Export IPA
run: |
xcodebuild -exportArchive \
-archivePath $RUNNER_TEMP/MyApp.xcarchive \
-exportOptionsPlist ExportOptions.plist \
-exportPath $RUNNER_TEMP/build
- name: Upload IPA
uses: actions/upload-artifact@v4
with:
name: ios-ipa
path: ${{ runner.temp }}/build/*.ipa
# Only deploy to TestFlight on manual trigger
- name: Upload to TestFlight
if: github.event.inputs.deploy == 'true'
env:
APPLE_API_KEY: ${{ secrets.APPLE_API_KEY }}
APPLE_API_ISSUER: ${{ secrets.APPLE_API_ISSUER }}
run: |
xcrun altool --upload-app \
-f $RUNNER_TEMP/build/*.ipa \
-t ios \
--apiKey $APPLE_API_KEY \
--apiIssuer $APPLE_API_ISSUER
# Clean up keychain
- name: Cleanup
if: always()
run: |
security delete-keychain $RUNNER_TEMP/app-signing.keychain-db || true
rm -f $RUNNER_TEMP/build_certificate.p12 || true
rm -f $RUNNER_TEMP/build_pp.mobileprovision || true
The Combined Workflow
Eventually we unified Android and iOS into a single release workflow:
# .github/workflows/release.yml
# Mobile release workflow - the complete picture
name: Mobile Release
on:
push:
tags:
- 'v*'
jobs:
android:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-java@v4
with:
distribution: 'temurin'
java-version: '17'
cache: 'gradle'
- name: Decode keystore
env:
KEYSTORE_BASE64: ${{ secrets.ANDROID_KEYSTORE_BASE64 }}
run: |
echo "$KEYSTORE_BASE64" | base64 --decode > app/release.keystore
- name: Build release APK
env:
KEYSTORE_PASSWORD: ${{ secrets.ANDROID_KEYSTORE_PASSWORD }}
KEY_ALIAS: ${{ secrets.ANDROID_KEY_ALIAS }}
KEY_PASSWORD: ${{ secrets.ANDROID_KEY_PASSWORD }}
run: |
./gradlew assembleRelease \
-Pandroid.injected.signing.store.file=$PWD/app/release.keystore \
-Pandroid.injected.signing.store.password=$KEYSTORE_PASSWORD \
-Pandroid.injected.signing.key.alias=$KEY_ALIAS \
-Pandroid.injected.signing.key.password=$KEY_PASSWORD
- name: Upload to Play Store
uses: r0adkll/upload-google-play@v1
with:
serviceAccountJsonPlainText: ${{ secrets.GOOGLE_PLAY_SERVICE_ACCOUNT }}
packageName: com.example.myapp
releaseFiles: app/build/outputs/apk/release/*.apk
track: internal
- uses: actions/upload-artifact@v4
with:
name: android-release
path: app/build/outputs/apk/release/*.apk
ios:
runs-on: macos-14
env:
DEVELOPER_DIR: /Applications/Xcode_15.2.app/Contents/Developer
steps:
# ... (same iOS steps as above)
create-release:
needs: [android, ios]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/download-artifact@v4
with:
path: artifacts
- name: Create Release
uses: softprops/action-gh-release@v2
with:
files: |
artifacts/android-release/*.apk
artifacts/ios-ipa/*.ipa
generate_release_notes: true
Lessons Learned
Pin your macOS and Xcode versions.
macos-latestis a trap. It changes without warning, and suddenly your build breaks because Apple deprecated an API between Xcode versions.Base64-encode your certificates. GitHub secrets can’t handle binary files directly. Encode them:
base64 -i Certificate.p12 | pbcopy, then store the result as a secret.Always clean up your keychain. The
if: always()cleanup step prevents secrets from leaking if the build fails partway through.macOS runners are expensive. At the time of writing, macOS runners cost 10x what Linux runners cost. Cache aggressively, and consider running quick checks on Linux first before spinning up macOS.
TestFlight processing takes forever. Don’t wait for it in your workflow. Upload and move on. Apple will email you when it’s ready (usually 10-30 minutes later).
The initial setup is brutal, but it’s worth it. Getting code signing working in CI took us about two days. But now anyone on the team can release, and we haven’t had a “works on Dave’s laptop” incident since.
Case Study 3: Monorepo with Multiple Services
Monorepos are great until your CI takes 45 minutes because you changed a typo in the README and it rebuilt everything.
The Problem
We had a monorepo with:
- A React frontend
- A Node.js API
- A Python data processing service
- Shared TypeScript types
- Infrastructure as Code (Terraform)
- Documentation
Every push triggered every workflow. Changed the frontend? Full backend test suite. Fixed a typo in docs? Terraform plan. It was miserable.
The Solution: Path-Based Filtering and Composite Actions
Composite Action for Common Setup
First, we extracted common setup into a composite action:
# .github/actions/setup-node/action.yml
# Reusable Node.js setup
name: 'Setup Node.js'
description: 'Sets up Node.js with caching'
inputs:
working-directory:
description: 'Working directory'
default: '.'
runs:
using: 'composite'
steps:
- uses: actions/setup-node@v4
with:
node-version-file: '${{ inputs.working-directory }}/.nvmrc'
cache: 'npm'
cache-dependency-path: '${{ inputs.working-directory }}/package-lock.json'
- name: Install dependencies
shell: bash
working-directory: ${{ inputs.working-directory }}
run: npm ci
Path-Filtered Workflows
# .github/workflows/frontend.yml
# Only runs when frontend code changes
name: Frontend
on:
push:
branches: [main]
paths:
- 'packages/frontend/**'
- 'packages/shared-types/**' # Frontend depends on shared types
- '.github/workflows/frontend.yml'
pull_request:
paths:
- 'packages/frontend/**'
- 'packages/shared-types/**'
- '.github/workflows/frontend.yml'
defaults:
run:
working-directory: packages/frontend
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/setup-node
with:
working-directory: packages/frontend
- run: npm run lint
typecheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/setup-node
with:
working-directory: packages/frontend
- run: npm run typecheck
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/setup-node
with:
working-directory: packages/frontend
- run: npm test -- --coverage
- uses: codecov/codecov-action@v4
with:
directory: packages/frontend/coverage
build:
needs: [lint, typecheck, test]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/setup-node
with:
working-directory: packages/frontend
- run: npm run build
- uses: actions/upload-artifact@v4
with:
name: frontend-build
path: packages/frontend/dist
Backend with Database Testing
# .github/workflows/api.yml
name: API
on:
push:
branches: [main]
paths:
- 'packages/api/**'
- 'packages/shared-types/**'
- '.github/workflows/api.yml'
pull_request:
paths:
- 'packages/api/**'
- 'packages/shared-types/**'
- '.github/workflows/api.yml'
defaults:
run:
working-directory: packages/api
jobs:
test:
runs-on: ubuntu-latest
# Spin up real databases for integration tests
services:
postgres:
image: postgres:15
env:
POSTGRES_USER: test
POSTGRES_PASSWORD: test
POSTGRES_DB: test
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7
ports:
- 6379:6379
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/setup-node
with:
working-directory: packages/api
- name: Run migrations
run: npm run db:migrate
env:
DATABASE_URL: postgres://test:test@localhost:5432/test
- name: Run tests
run: npm test
env:
DATABASE_URL: postgres://test:test@localhost:5432/test
REDIS_URL: redis://localhost:6379
Coordinated Deployment
# .github/workflows/deploy.yml
# Deploys whatever changed
name: Deploy
on:
push:
branches: [main]
jobs:
# Figure out what changed
changes:
runs-on: ubuntu-latest
outputs:
frontend: ${{ steps.filter.outputs.frontend }}
api: ${{ steps.filter.outputs.api }}
data: ${{ steps.filter.outputs.data }}
steps:
- uses: actions/checkout@v4
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
frontend:
- 'packages/frontend/**'
api:
- 'packages/api/**'
data:
- 'packages/data-processing/**'
deploy-frontend:
needs: changes
if: needs.changes.outputs.frontend == 'true'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/setup-node
with:
working-directory: packages/frontend
- run: npm run build
- name: Deploy to Cloudflare Pages
uses: cloudflare/wrangler-action@v3
with:
apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }}
command: pages deploy packages/frontend/dist --project-name=myapp
deploy-api:
needs: changes
if: needs.changes.outputs.api == 'true'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy to Fly.io
uses: superfly/flyctl-actions/setup-flyctl@master
- run: flyctl deploy --config packages/api/fly.toml
env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
# Make sure frontend and API are compatible
integration-test:
needs: [deploy-frontend, deploy-api]
if: always() && (needs.deploy-frontend.result == 'success' || needs.deploy-api.result == 'success')
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/setup-node
with:
working-directory: packages/e2e
- run: npm test
env:
BASE_URL: https://staging.myapp.com
Lessons Learned
Composite actions are a game-changer. We went from copying 15 lines of setup code everywhere to
uses: ./.github/actions/setup-node. Way easier to maintain.Path filters need to include workflow files. If you change
.github/workflows/frontend.yml, you want the frontend workflow to run so you can test your changes.dorny/paths-filteris more flexible than native path filtering. The nativepaths:filter is all-or-nothing per workflow. The action gives you outputs you can use in conditionals.Services are real containers. When you add a
services:section, you get actual Postgres, Redis, etc. No mocking required for integration tests.Dependency graphs are tricky. Our frontend depends on shared-types, so changes to shared-types should trigger frontend CI. Getting these dependencies right took iteration.
Coordinated deployments need integration tests. If you deploy frontend and API separately, you need to verify they work together. We learned this the hard way when a breaking API change got deployed without the corresponding frontend update.
Case Study 4: Infrastructure as Code Pipeline
“We should apply this Terraform change” is a sentence that has ruined many a Friday afternoon. IaC pipelines need to be paranoid.
The Problem
We had Terraform managing our entire cloud infrastructure: VPCs, databases, Kubernetes clusters—everything. Changes were applied manually from developer laptops with varying degrees of care. Sometimes the state file got corrupted. Sometimes changes were applied to production when someone thought they were in staging.
The Solution: PR-Based Infrastructure Changes
# .github/workflows/terraform.yml
# Infrastructure as Code pipeline
# Every change requires a PR and approval
name: Terraform
on:
push:
branches: [main]
paths:
- 'infrastructure/**'
- '.github/workflows/terraform.yml'
pull_request:
paths:
- 'infrastructure/**'
- '.github/workflows/terraform.yml'
env:
TF_VERSION: '1.6.0'
AWS_REGION: us-east-1
jobs:
# Run on all environments to catch issues early
plan:
runs-on: ubuntu-latest
strategy:
matrix:
environment: [dev, staging, prod]
fail-fast: false
permissions:
contents: read
pull-requests: write # To post plan comments
defaults:
run:
working-directory: infrastructure/${{ matrix.environment }}
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
# Different credentials per environment
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::${{ secrets[format('{0}_AWS_ACCOUNT_ID', matrix.environment)] }}:role/TerraformRole
aws-region: ${{ env.AWS_REGION }}
- name: Terraform Init
run: terraform init -backend-config="bucket=terraform-state-${{ matrix.environment }}"
- name: Terraform Validate
run: terraform validate
- name: Terraform Plan
id: plan
run: |
terraform plan -no-color -out=tfplan 2>&1 | tee plan.txt
# Check if there are changes
if terraform show -json tfplan | jq -e '.resource_changes | length > 0' > /dev/null; then
echo "has_changes=true" >> $GITHUB_OUTPUT
else
echo "has_changes=false" >> $GITHUB_OUTPUT
fi
continue-on-error: true
# Post plan as PR comment
- name: Comment Plan on PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const plan = fs.readFileSync('infrastructure/${{ matrix.environment }}/plan.txt', 'utf8');
const output = `### Terraform Plan - ${{ matrix.environment }}
<details>
<summary>Click to expand</summary>
\`\`\`hcl
${plan.substring(0, 65000)}
\`\`\`
</details>`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
});
- name: Plan Status
if: steps.plan.outcome == 'failure'
run: exit 1
# Save plan for apply job
- name: Upload Plan
if: github.ref == 'refs/heads/main'
uses: actions/upload-artifact@v4
with:
name: tfplan-${{ matrix.environment }}
path: infrastructure/${{ matrix.environment }}/tfplan
retention-days: 5
# Apply to dev automatically
apply-dev:
needs: plan
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: dev # Requires environment protection rules
defaults:
run:
working-directory: infrastructure/dev
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::${{ secrets.DEV_AWS_ACCOUNT_ID }}:role/TerraformRole
aws-region: ${{ env.AWS_REGION }}
- name: Terraform Init
run: terraform init -backend-config="bucket=terraform-state-dev"
- uses: actions/download-artifact@v4
with:
name: tfplan-dev
path: infrastructure/dev
- name: Terraform Apply
run: terraform apply -auto-approve tfplan
# Staging requires manual approval
apply-staging:
needs: apply-dev
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: staging # Has required reviewers
defaults:
run:
working-directory: infrastructure/staging
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::${{ secrets.STAGING_AWS_ACCOUNT_ID }}:role/TerraformRole
aws-region: ${{ env.AWS_REGION }}
- run: terraform init -backend-config="bucket=terraform-state-staging"
- uses: actions/download-artifact@v4
with:
name: tfplan-staging
path: infrastructure/staging
- run: terraform apply -auto-approve tfplan
# Production requires multiple approvals
apply-prod:
needs: apply-staging
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production # Has required reviewers and wait timer
defaults:
run:
working-directory: infrastructure/prod
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::${{ secrets.PROD_AWS_ACCOUNT_ID }}:role/TerraformRole
aws-region: ${{ env.AWS_REGION }}
- run: terraform init -backend-config="bucket=terraform-state-prod"
- uses: actions/download-artifact@v4
with:
name: tfplan-prod
path: infrastructure/prod
- run: terraform apply -auto-approve tfplan
Drift Detection (Scheduled)
# .github/workflows/terraform-drift.yml
# Detect when reality doesn't match our config
name: Terraform Drift Detection
on:
schedule:
- cron: '0 6 * * *' # Daily at 6 AM
workflow_dispatch:
jobs:
drift-check:
runs-on: ubuntu-latest
strategy:
matrix:
environment: [dev, staging, prod]
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: '1.6.0'
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::${{ secrets[format('{0}_AWS_ACCOUNT_ID', matrix.environment)] }}:role/TerraformRole
aws-region: us-east-1
- name: Check for drift
id: drift
working-directory: infrastructure/${{ matrix.environment }}
run: |
terraform init -backend-config="bucket=terraform-state-${{ matrix.environment }}"
# Plan and check for changes
if terraform plan -detailed-exitcode; then
echo "No drift detected"
echo "drift=false" >> $GITHUB_OUTPUT
else
exit_code=$?
if [ $exit_code -eq 2 ]; then
echo "Drift detected!"
echo "drift=true" >> $GITHUB_OUTPUT
else
echo "Error running plan"
exit 1
fi
fi
- name: Create issue for drift
if: steps.drift.outputs.drift == 'true'
uses: actions/github-script@v7
with:
script: |
const title = `Infrastructure drift detected in ${{ matrix.environment }}`;
const body = `Terraform detected configuration drift in the ${{ matrix.environment }} environment.
This usually means someone made a manual change in the AWS console, or an automated process modified resources.
Please investigate and either:
1. Update the Terraform config to match reality
2. Run \`terraform apply\` to restore the expected state
Run the Terraform workflow manually to see the full diff.`;
github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: title,
body: body,
labels: ['infrastructure', 'drift', '${{ matrix.environment }}']
});
Lessons Learned
OIDC is better than long-lived credentials. We use
role-to-assumeinstead of storing AWS keys in secrets. GitHub’s OIDC integration means no permanent credentials to rotate.Plan artifacts are essential. The plan you reviewed in the PR should be the exact plan that gets applied. Without artifacts, the apply might include changes that happened between the plan and apply.
Environments add guardrails. GitHub Environments let you require approvals, add wait timers, and restrict deployments to specific branches. Production requires two approvals and a 10-minute wait.
Drift detection catches shadow IT. Someone always tweaks something in the console “just temporarily.” Daily drift checks catch it before it causes problems.
Posting plans to PRs is magical. Reviewers can see exactly what will change without running Terraform locally. It’s like a really good code diff for infrastructure.
State file access is the security boundary. Anyone who can access the Terraform state file can see all your secrets (database passwords, API keys). Treat state bucket access as carefully as production database access.
Case Study 5: Documentation Site with Preview Deployments
Documentation might not be as exciting as production code, but it’s just as important to keep working. This case study is about a docs site built with Docusaurus, deployed to Vercel.
The Problem
Our docs were always stale. Engineers would update code, ship it, and “update docs later” (which never happened). Docs PRs sat for days because nobody could easily see how the changes looked. Broken links accumulated until someone noticed the 404s in analytics.
The Solution: Preview Deploys and Automated Checks
# .github/workflows/docs.yml
# Documentation CI/CD with preview deployments
name: Documentation
on:
push:
branches: [main]
paths:
- 'docs/**'
- '.github/workflows/docs.yml'
pull_request:
paths:
- 'docs/**'
- '.github/workflows/docs.yml'
defaults:
run:
working-directory: docs
jobs:
# Quick checks first
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
cache-dependency-path: docs/package-lock.json
- run: npm ci
# Check for broken Markdown links
- name: Check internal links
run: npm run lint:links
# Spell check
- name: Spell check
run: npm run lint:spelling
# Check that code examples compile
- name: Validate code blocks
run: npm run lint:code-blocks
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
cache-dependency-path: docs/package-lock.json
- run: npm ci
- run: npm run build
- uses: actions/upload-artifact@v4
with:
name: docs-build
path: docs/build
# Preview deployments for PRs
preview:
needs: [lint, build]
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
permissions:
pull-requests: write
deployments: write
steps:
- uses: actions/checkout@v4
- uses: actions/download-artifact@v4
with:
name: docs-build
path: docs/build
# Deploy to Vercel preview
- name: Deploy to Vercel
id: deploy
uses: amondnet/vercel-action@v25
with:
vercel-token: ${{ secrets.VERCEL_TOKEN }}
vercel-org-id: ${{ secrets.VERCEL_ORG_ID }}
vercel-project-id: ${{ secrets.VERCEL_DOCS_PROJECT_ID }}
working-directory: docs/build
# Comment with preview URL
- name: Comment Preview URL
uses: actions/github-script@v7
with:
script: |
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `Preview deployment ready!\n\n${process.env.PREVIEW_URL}\n\nThis preview will be available until the PR is closed.`
});
env:
PREVIEW_URL: ${{ steps.deploy.outputs.preview-url }}
# Run Lighthouse on preview
- name: Lighthouse CI
uses: treosh/lighthouse-ci-action@v11
with:
urls: ${{ steps.deploy.outputs.preview-url }}
configPath: docs/lighthouserc.json
uploadArtifacts: true
# Full link validation after preview is up
validate-links:
needs: preview
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Check all links including external ones
- name: Check all links
uses: lycheeverse/lychee-action@v1
with:
args: --verbose --no-progress --exclude-mail './docs/**/*.md'
fail: true
# Production deployment
deploy:
needs: [lint, build]
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/download-artifact@v4
with:
name: docs-build
path: docs/build
- name: Deploy to Vercel (Production)
uses: amondnet/vercel-action@v25
with:
vercel-token: ${{ secrets.VERCEL_TOKEN }}
vercel-org-id: ${{ secrets.VERCEL_ORG_ID }}
vercel-project-id: ${{ secrets.VERCEL_DOCS_PROJECT_ID }}
vercel-args: '--prod'
working-directory: docs/build
# Purge CDN cache
- name: Purge Cloudflare Cache
run: |
curl -X POST "https://api.cloudflare.com/client/v4/zones/${{ secrets.CF_ZONE_ID }}/purge_cache" \
-H "Authorization: Bearer ${{ secrets.CF_API_TOKEN }}" \
-H "Content-Type: application/json" \
--data '{"purge_everything":true}'
# Check production site
- name: Smoke test
run: |
sleep 30 # Wait for deployment to propagate
curl -f https://docs.example.com || exit 1
curl -f https://docs.example.com/getting-started || exit 1
Lighthouse Configuration
{
"ci": {
"assert": {
"assertions": {
"categories:performance": ["error", { "minScore": 0.9 }],
"categories:accessibility": ["error", { "minScore": 0.95 }],
"categories:best-practices": ["error", { "minScore": 0.9 }],
"categories:seo": ["error", { "minScore": 0.9 }]
}
}
}
}
Lessons Learned
Preview deployments transform docs PRs. Reviewers actually review when they can click a link and see the result. Our docs PR review time dropped from days to hours.
Link checking catches so many problems. Internal links break when pages move. External links break when third parties change their URLs. Automated checking catches these before users do.
Lighthouse catches performance regressions. Someone added an unoptimized 5MB image to the docs. Lighthouse caught it before merge.
Code block validation is underrated. Nothing’s worse than docs with syntax errors in the code examples. We extract code blocks and run them through compilers/linters.
Cache purging is essential. CDNs love to cache aggressively. Without explicit purging, users might see stale docs for hours after a deployment.
Smoke tests catch deployment failures. Just because the deploy succeeded doesn’t mean the site works. A quick curl against key pages catches the obvious problems.
Common Patterns Across All Case Studies
Looking back at these workflows, some patterns keep showing up:
1. Fail Fast, Then Fail Completely
Run quick checks first (linting, validation), then slower checks (tests, builds). If the quick checks fail, don’t waste time on the slow ones.
But when you do run comprehensive checks, use fail-fast: false in matrix builds. You want to know all the failures, not just the first one.
2. Artifact Everything
Plans, builds, test results—upload them all. You’ll want them for debugging, and you’ll want them for later jobs in the workflow.
3. Post to the PR
Put information where developers already are. Post Terraform plans as PR comments. Post preview URLs as PR comments. Post Lighthouse scores as PR comments. Make it impossible to ignore.
4. Environments Add Guardrails
GitHub Environments let you require approvals, add wait timers, restrict to specific branches, and provide environment-specific secrets. Use them.
5. Clean Up After Yourself
Temporary files, keychains, credentials—delete them in a step with if: always(). Don’t assume the workflow will complete normally.
6. Concurrency Controls Prevent Chaos
Without concurrency controls, pushing twice in quick succession runs two workflows simultaneously. One will fail in confusing ways, or worse, both will succeed in incompatible ways.
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
The Reality Check
I want to be honest: getting to these workflows took time. Lots of failed runs, lots of debugging, lots of reading documentation. The first version of each workflow was much simpler—and much less reliable.
The workflows you see in tutorials are usually version 1. The workflows you see in production are version 5 (or 15). Don’t be discouraged if your first attempt doesn’t handle every edge case. Start simple, add complexity as you hit real problems.
And when you do hit problems, capture them in the workflow. That comment that says “DO NOT REMOVE - fixes race condition on Windows”? That’s institutional knowledge. Future you (or future teammates) will thank you.
GitHub Actions isn’t just CI/CD. It’s a platform for encoding your team’s operational knowledge into code. Every weird conditional, every retry loop, every cleanup step—they’re all lessons learned the hard way.
Your workflows are documentation of how things actually work, not how they’re supposed to work in theory.
Next up in this series: niche and unexpected uses—the weird, wonderful, and surprisingly practical things people build with Actions.