GitHub Actions: Real-World Case Studies

Complete, battle-tested GitHub Actions workflows from real projects. The full YAML, the bugs we hit, the lessons learned, and the evolution from first commit to production.

The prettiest workflows are useless if they don’t work in production. I’ve read countless GitHub Actions tutorials that show you a pristine 15-line YAML file and call it a day. That’s not how real workflows look. Real workflows have weird conditionals, retry logic, and comments that say things like “DO NOT REMOVE - fixes race condition on Windows.”

This post is different. I’m going to walk through complete, production-grade workflows from real projects—the good, the bad, and the lessons learned along the way. These are the kind of workflows that have survived contact with actual users, flaky networks, and that one coworker who keeps force-pushing to main.

If you’re new to the series, check out the intro post first. But if you want to see how Actions work in the trenches, you’re in the right place.

Case Study 1: Open Source Library Release Pipeline

Let’s start with something many developers need: releasing a library to a package registry. This example is for a JavaScript library published to npm, but the patterns apply to PyPI, RubyGems, Maven—wherever you’re publishing.

The Problem

We had a popular open-source library with a manual release process that went something like this:

  1. Update version in package.json
  2. Update CHANGELOG.md
  3. Run tests locally
  4. Build the dist
  5. npm publish
  6. Create a GitHub Release
  7. Write release notes
  8. Announce on social media

Steps got skipped. Versions got mismatched. Once we published a release with failing tests because someone ran npm publish from a dirty working directory. Not our finest moment.

The Evolution

Version 1: The Naive Approach

Our first workflow was embarrassingly simple:

# .github/workflows/release.yml - Version 1
# Spoiler: this didn't last long

name: Release

on:
  push:
    tags:
      - 'v*'

jobs:
  release:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          registry-url: 'https://registry.npmjs.org'

      - run: npm ci
      - run: npm test
      - run: npm run build
      - run: npm publish
        env:
          NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}

It worked! Until it didn’t.

What went wrong:

  • No matrix testing—we broke Node 18 support without realizing it
  • No changelog generation—we were still writing release notes by hand
  • No GitHub Release—just a tag floating in the void
  • Published even when tests passed on Ubuntu but failed on Windows

Version 2: Adding Matrix Testing

# .github/workflows/release.yml - Version 2
# Now with matrix builds!

name: Release

on:
  push:
    tags:
      - 'v*'

jobs:
  # Test across Node versions and OSes before releasing
  test:
    strategy:
      matrix:
        node: [18, 20, 22]
        os: [ubuntu-latest, macos-latest, windows-latest]
      fail-fast: false  # Run all tests even if one fails
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}

      - run: npm ci
      - run: npm test

  release:
    needs: test  # Only release if ALL tests pass
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          registry-url: 'https://registry.npmjs.org'

      - run: npm ci
      - run: npm run build

      - run: npm publish
        env:
          NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}

      # Now we create a GitHub Release too
      - name: Create GitHub Release
        uses: softprops/action-gh-release@v2
        with:
          generate_release_notes: true

This was better. The needs: test line means the release job waits for all matrix tests to pass. We caught Windows-specific bugs before they hit npm.

But we still had problems:

  • generate_release_notes: true produces… okay release notes. Not great
  • We still had to manually manage the changelog
  • Version in package.json didn’t always match the tag
  • We once pushed a tag, realized we forgot something, deleted the tag, fixed the issue, and pushed the tag again. The workflow ran twice.

Version 3: The Production Version

Here’s what we actually run today:

# .github/workflows/release.yml - Version 3
# The battle-hardened version

name: Release

on:
  push:
    tags:
      - 'v*'

# Prevent duplicate releases from re-pushed tags
concurrency:
  group: release-${{ github.ref }}
  cancel-in-progress: true

jobs:
  # Verify the tag matches package.json version
  validate:
    runs-on: ubuntu-latest
    outputs:
      version: ${{ steps.check.outputs.version }}
    steps:
      - uses: actions/checkout@v4

      - name: Verify version match
        id: check
        run: |
          TAG_VERSION="${GITHUB_REF#refs/tags/v}"
          PKG_VERSION=$(node -p "require('./package.json').version")

          if [ "$TAG_VERSION" != "$PKG_VERSION" ]; then
            echo "::error::Tag version ($TAG_VERSION) doesn't match package.json ($PKG_VERSION)"
            exit 1
          fi

          echo "version=$PKG_VERSION" >> $GITHUB_OUTPUT
          echo "Releasing version $PKG_VERSION"

  test:
    needs: validate
    strategy:
      matrix:
        node: [18, 20, 22]
        os: [ubuntu-latest, macos-latest, windows-latest]
      fail-fast: false
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
          cache: 'npm'  # Cache dependencies for faster builds

      - run: npm ci
      - run: npm test

      # Run additional checks that only make sense on one OS
      - name: Lint
        if: matrix.os == 'ubuntu-latest' && matrix.node == '20'
        run: npm run lint

      - name: Type check
        if: matrix.os == 'ubuntu-latest' && matrix.node == '20'
        run: npm run typecheck

  release:
    needs: [validate, test]
    runs-on: ubuntu-latest
    permissions:
      contents: write  # Needed for creating releases
      id-token: write  # Needed for npm provenance
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need full history for changelog

      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          registry-url: 'https://registry.npmjs.org'
          cache: 'npm'

      - run: npm ci
      - run: npm run build

      # Generate changelog from conventional commits
      - name: Generate changelog
        id: changelog
        uses: orhun/git-cliff-action@v3
        with:
          config: cliff.toml
          args: --latest --strip header
        env:
          OUTPUT: CHANGELOG.md

      # Publish with provenance for supply chain security
      - name: Publish to npm
        run: npm publish --provenance --access public
        env:
          NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}

      - name: Create GitHub Release
        uses: softprops/action-gh-release@v2
        with:
          body: ${{ steps.changelog.outputs.content }}
          files: |
            dist/*.js
            dist/*.d.ts

      # Notify on Discord (optional, but nice)
      - name: Announce release
        if: success()
        uses: sarisia/actions-status-discord@v1
        with:
          webhook: ${{ secrets.DISCORD_WEBHOOK }}
          title: "Released v${{ needs.validate.outputs.version }}"
          description: ${{ steps.changelog.outputs.content }}

  # Clean up if release fails halfway through
  rollback:
    needs: release
    if: failure()
    runs-on: ubuntu-latest
    steps:
      - name: Notify about failure
        uses: sarisia/actions-status-discord@v1
        with:
          webhook: ${{ secrets.DISCORD_WEBHOOK }}
          status: failure
          title: "Release failed!"
          description: "Manual intervention may be required"

Lessons Learned

  1. Version validation is essential. The number of times we pushed a tag that didn’t match package.json… validate early, fail fast.

  2. concurrency prevents chaos. Without it, deleting and re-pushing a tag runs two workflows simultaneously. One will fail in confusing ways.

  3. npm provenance is free trust. With --provenance, npm cryptographically links your package to your GitHub workflow. Users can verify the package came from your repo.

  4. fail-fast: false is usually what you want. If Node 18 on Ubuntu fails, you still want to know if Node 22 on Windows also fails. Don’t stop at the first failure.

  5. Automated changelogs require commit discipline. git-cliff is amazing, but only if you use conventional commits. We had to retrofit this onto an existing project—it was painful but worth it.

Case Study 2: Mobile App CI/CD (iOS + Android)

Mobile CI/CD is a special kind of pain. Code signing, provisioning profiles, build times measured in geological eras—it’s enough to make you nostalgic for “works on my machine.”

The Problem

We had an app that shipped to both iOS and Android. The release process looked like this:

  1. Build Android APK (relatively easy)
  2. Build iOS IPA (requires macOS, signing, provisioning profiles, sacrificing a goat)
  3. Upload to Play Store internal testing
  4. Upload to TestFlight
  5. Wait for both to process
  6. Promote to production (manually, from a specific person’s laptop)

The iOS build was the killer. It only worked on our lead developer’s MacBook because that’s where the signing certificates lived. When he went on vacation, nobody could release.

The Painful Evolution

Version 1: Android Only (The Easy Part)

# .github/workflows/android.yml
# Android is almost pleasant

name: Android Build

on:
  push:
    branches: [main]
  pull_request:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-java@v4
        with:
          distribution: 'temurin'
          java-version: '17'
          cache: 'gradle'

      - name: Build debug APK
        run: ./gradlew assembleDebug

      - name: Run tests
        run: ./gradlew test

      - name: Upload APK
        uses: actions/upload-artifact@v4
        with:
          name: debug-apk
          path: app/build/outputs/apk/debug/*.apk

This worked immediately. Android’s tooling on Linux is solid.

Version 2: iOS Attempt #1 (Failure)

# .github/workflows/ios.yml
# Attempt 1: The naive approach
# Spoiler: this did not work

name: iOS Build

on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: macos-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build
        run: xcodebuild -scheme MyApp -configuration Debug
        # Error: Signing requires a development team

Yeah, that didn’t work. Code signing in CI is its own circle of hell.

Version 3: iOS With Code Signing (The Real Deal)

After much suffering (and a lot of Stack Overflow), here’s what actually works:

# .github/workflows/ios.yml
# The version that actually works

name: iOS Build

on:
  push:
    branches: [main]
  pull_request:
  workflow_dispatch:
    inputs:
      deploy:
        description: 'Deploy to TestFlight'
        type: boolean
        default: false

jobs:
  build:
    runs-on: macos-14  # Specific version, not macos-latest
    # macos-latest was macos-12 for ages, then suddenly wasn't
    # Pin to a specific version to avoid surprises

    env:
      DEVELOPER_DIR: /Applications/Xcode_15.2.app/Contents/Developer
      # Pin Xcode version too. Different versions have different behaviors.

    steps:
      - uses: actions/checkout@v4

      # Install the Apple certificate and provisioning profile
      # This is the secret sauce
      - name: Install Apple Certificate
        env:
          BUILD_CERTIFICATE_BASE64: ${{ secrets.BUILD_CERTIFICATE_BASE64 }}
          P12_PASSWORD: ${{ secrets.P12_PASSWORD }}
          KEYCHAIN_PASSWORD: ${{ secrets.KEYCHAIN_PASSWORD }}
        run: |
          # Create variables
          CERTIFICATE_PATH=$RUNNER_TEMP/build_certificate.p12
          KEYCHAIN_PATH=$RUNNER_TEMP/app-signing.keychain-db

          # Decode certificate from base64
          echo -n "$BUILD_CERTIFICATE_BASE64" | base64 --decode -o $CERTIFICATE_PATH

          # Create temporary keychain
          security create-keychain -p "$KEYCHAIN_PASSWORD" $KEYCHAIN_PATH
          security set-keychain-settings -lut 21600 $KEYCHAIN_PATH
          security unlock-keychain -p "$KEYCHAIN_PASSWORD" $KEYCHAIN_PATH

          # Import certificate to keychain
          security import $CERTIFICATE_PATH -P "$P12_PASSWORD" -A -t cert -f pkcs12 -k $KEYCHAIN_PATH
          security list-keychain -d user -s $KEYCHAIN_PATH

      - name: Install Provisioning Profile
        env:
          PROVISIONING_PROFILE_BASE64: ${{ secrets.PROVISIONING_PROFILE_BASE64 }}
        run: |
          PP_PATH=$RUNNER_TEMP/build_pp.mobileprovision
          echo -n "$PROVISIONING_PROFILE_BASE64" | base64 --decode -o $PP_PATH
          mkdir -p ~/Library/MobileDevice/Provisioning\ Profiles
          cp $PP_PATH ~/Library/MobileDevice/Provisioning\ Profiles/

      # Now we can actually build
      - name: Build archive
        run: |
          xcodebuild -scheme MyApp \
            -archivePath $RUNNER_TEMP/MyApp.xcarchive \
            -sdk iphoneos \
            -configuration Release \
            -destination 'generic/platform=iOS' \
            clean archive \
            CODE_SIGN_IDENTITY="Apple Distribution" \
            PROVISIONING_PROFILE_SPECIFIER="${{ secrets.PROVISIONING_PROFILE_NAME }}"

      - name: Export IPA
        run: |
          xcodebuild -exportArchive \
            -archivePath $RUNNER_TEMP/MyApp.xcarchive \
            -exportOptionsPlist ExportOptions.plist \
            -exportPath $RUNNER_TEMP/build

      - name: Upload IPA
        uses: actions/upload-artifact@v4
        with:
          name: ios-ipa
          path: ${{ runner.temp }}/build/*.ipa

      # Only deploy to TestFlight on manual trigger
      - name: Upload to TestFlight
        if: github.event.inputs.deploy == 'true'
        env:
          APPLE_API_KEY: ${{ secrets.APPLE_API_KEY }}
          APPLE_API_ISSUER: ${{ secrets.APPLE_API_ISSUER }}
        run: |
          xcrun altool --upload-app \
            -f $RUNNER_TEMP/build/*.ipa \
            -t ios \
            --apiKey $APPLE_API_KEY \
            --apiIssuer $APPLE_API_ISSUER

      # Clean up keychain
      - name: Cleanup
        if: always()
        run: |
          security delete-keychain $RUNNER_TEMP/app-signing.keychain-db || true
          rm -f $RUNNER_TEMP/build_certificate.p12 || true
          rm -f $RUNNER_TEMP/build_pp.mobileprovision || true

The Combined Workflow

Eventually we unified Android and iOS into a single release workflow:

# .github/workflows/release.yml
# Mobile release workflow - the complete picture

name: Mobile Release

on:
  push:
    tags:
      - 'v*'

jobs:
  android:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-java@v4
        with:
          distribution: 'temurin'
          java-version: '17'
          cache: 'gradle'

      - name: Decode keystore
        env:
          KEYSTORE_BASE64: ${{ secrets.ANDROID_KEYSTORE_BASE64 }}
        run: |
          echo "$KEYSTORE_BASE64" | base64 --decode > app/release.keystore

      - name: Build release APK
        env:
          KEYSTORE_PASSWORD: ${{ secrets.ANDROID_KEYSTORE_PASSWORD }}
          KEY_ALIAS: ${{ secrets.ANDROID_KEY_ALIAS }}
          KEY_PASSWORD: ${{ secrets.ANDROID_KEY_PASSWORD }}
        run: |
          ./gradlew assembleRelease \
            -Pandroid.injected.signing.store.file=$PWD/app/release.keystore \
            -Pandroid.injected.signing.store.password=$KEYSTORE_PASSWORD \
            -Pandroid.injected.signing.key.alias=$KEY_ALIAS \
            -Pandroid.injected.signing.key.password=$KEY_PASSWORD

      - name: Upload to Play Store
        uses: r0adkll/upload-google-play@v1
        with:
          serviceAccountJsonPlainText: ${{ secrets.GOOGLE_PLAY_SERVICE_ACCOUNT }}
          packageName: com.example.myapp
          releaseFiles: app/build/outputs/apk/release/*.apk
          track: internal

      - uses: actions/upload-artifact@v4
        with:
          name: android-release
          path: app/build/outputs/apk/release/*.apk

  ios:
    runs-on: macos-14
    env:
      DEVELOPER_DIR: /Applications/Xcode_15.2.app/Contents/Developer
    steps:
      # ... (same iOS steps as above)

  create-release:
    needs: [android, ios]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/download-artifact@v4
        with:
          path: artifacts

      - name: Create Release
        uses: softprops/action-gh-release@v2
        with:
          files: |
            artifacts/android-release/*.apk
            artifacts/ios-ipa/*.ipa
          generate_release_notes: true

Lessons Learned

  1. Pin your macOS and Xcode versions. macos-latest is a trap. It changes without warning, and suddenly your build breaks because Apple deprecated an API between Xcode versions.

  2. Base64-encode your certificates. GitHub secrets can’t handle binary files directly. Encode them: base64 -i Certificate.p12 | pbcopy, then store the result as a secret.

  3. Always clean up your keychain. The if: always() cleanup step prevents secrets from leaking if the build fails partway through.

  4. macOS runners are expensive. At the time of writing, macOS runners cost 10x what Linux runners cost. Cache aggressively, and consider running quick checks on Linux first before spinning up macOS.

  5. TestFlight processing takes forever. Don’t wait for it in your workflow. Upload and move on. Apple will email you when it’s ready (usually 10-30 minutes later).

  6. The initial setup is brutal, but it’s worth it. Getting code signing working in CI took us about two days. But now anyone on the team can release, and we haven’t had a “works on Dave’s laptop” incident since.

Case Study 3: Monorepo with Multiple Services

Monorepos are great until your CI takes 45 minutes because you changed a typo in the README and it rebuilt everything.

The Problem

We had a monorepo with:

  • A React frontend
  • A Node.js API
  • A Python data processing service
  • Shared TypeScript types
  • Infrastructure as Code (Terraform)
  • Documentation

Every push triggered every workflow. Changed the frontend? Full backend test suite. Fixed a typo in docs? Terraform plan. It was miserable.

The Solution: Path-Based Filtering and Composite Actions

Composite Action for Common Setup

First, we extracted common setup into a composite action:

# .github/actions/setup-node/action.yml
# Reusable Node.js setup

name: 'Setup Node.js'
description: 'Sets up Node.js with caching'

inputs:
  working-directory:
    description: 'Working directory'
    default: '.'

runs:
  using: 'composite'
  steps:
    - uses: actions/setup-node@v4
      with:
        node-version-file: '${{ inputs.working-directory }}/.nvmrc'
        cache: 'npm'
        cache-dependency-path: '${{ inputs.working-directory }}/package-lock.json'

    - name: Install dependencies
      shell: bash
      working-directory: ${{ inputs.working-directory }}
      run: npm ci

Path-Filtered Workflows

# .github/workflows/frontend.yml
# Only runs when frontend code changes

name: Frontend

on:
  push:
    branches: [main]
    paths:
      - 'packages/frontend/**'
      - 'packages/shared-types/**'  # Frontend depends on shared types
      - '.github/workflows/frontend.yml'
  pull_request:
    paths:
      - 'packages/frontend/**'
      - 'packages/shared-types/**'
      - '.github/workflows/frontend.yml'

defaults:
  run:
    working-directory: packages/frontend

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/setup-node
        with:
          working-directory: packages/frontend
      - run: npm run lint

  typecheck:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/setup-node
        with:
          working-directory: packages/frontend
      - run: npm run typecheck

  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/setup-node
        with:
          working-directory: packages/frontend
      - run: npm test -- --coverage
      - uses: codecov/codecov-action@v4
        with:
          directory: packages/frontend/coverage

  build:
    needs: [lint, typecheck, test]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/setup-node
        with:
          working-directory: packages/frontend
      - run: npm run build
      - uses: actions/upload-artifact@v4
        with:
          name: frontend-build
          path: packages/frontend/dist

Backend with Database Testing

# .github/workflows/api.yml

name: API

on:
  push:
    branches: [main]
    paths:
      - 'packages/api/**'
      - 'packages/shared-types/**'
      - '.github/workflows/api.yml'
  pull_request:
    paths:
      - 'packages/api/**'
      - 'packages/shared-types/**'
      - '.github/workflows/api.yml'

defaults:
  run:
    working-directory: packages/api

jobs:
  test:
    runs-on: ubuntu-latest

    # Spin up real databases for integration tests
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_USER: test
          POSTGRES_PASSWORD: test
          POSTGRES_DB: test
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

      redis:
        image: redis:7
        ports:
          - 6379:6379
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/setup-node
        with:
          working-directory: packages/api

      - name: Run migrations
        run: npm run db:migrate
        env:
          DATABASE_URL: postgres://test:test@localhost:5432/test

      - name: Run tests
        run: npm test
        env:
          DATABASE_URL: postgres://test:test@localhost:5432/test
          REDIS_URL: redis://localhost:6379

Coordinated Deployment

# .github/workflows/deploy.yml
# Deploys whatever changed

name: Deploy

on:
  push:
    branches: [main]

jobs:
  # Figure out what changed
  changes:
    runs-on: ubuntu-latest
    outputs:
      frontend: ${{ steps.filter.outputs.frontend }}
      api: ${{ steps.filter.outputs.api }}
      data: ${{ steps.filter.outputs.data }}
    steps:
      - uses: actions/checkout@v4
      - uses: dorny/paths-filter@v3
        id: filter
        with:
          filters: |
            frontend:
              - 'packages/frontend/**'
            api:
              - 'packages/api/**'
            data:
              - 'packages/data-processing/**'

  deploy-frontend:
    needs: changes
    if: needs.changes.outputs.frontend == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/setup-node
        with:
          working-directory: packages/frontend
      - run: npm run build
      - name: Deploy to Cloudflare Pages
        uses: cloudflare/wrangler-action@v3
        with:
          apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }}
          command: pages deploy packages/frontend/dist --project-name=myapp

  deploy-api:
    needs: changes
    if: needs.changes.outputs.api == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to Fly.io
        uses: superfly/flyctl-actions/setup-flyctl@master
      - run: flyctl deploy --config packages/api/fly.toml
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

  # Make sure frontend and API are compatible
  integration-test:
    needs: [deploy-frontend, deploy-api]
    if: always() && (needs.deploy-frontend.result == 'success' || needs.deploy-api.result == 'success')
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/setup-node
        with:
          working-directory: packages/e2e
      - run: npm test
        env:
          BASE_URL: https://staging.myapp.com

Lessons Learned

  1. Composite actions are a game-changer. We went from copying 15 lines of setup code everywhere to uses: ./.github/actions/setup-node. Way easier to maintain.

  2. Path filters need to include workflow files. If you change .github/workflows/frontend.yml, you want the frontend workflow to run so you can test your changes.

  3. dorny/paths-filter is more flexible than native path filtering. The native paths: filter is all-or-nothing per workflow. The action gives you outputs you can use in conditionals.

  4. Services are real containers. When you add a services: section, you get actual Postgres, Redis, etc. No mocking required for integration tests.

  5. Dependency graphs are tricky. Our frontend depends on shared-types, so changes to shared-types should trigger frontend CI. Getting these dependencies right took iteration.

  6. Coordinated deployments need integration tests. If you deploy frontend and API separately, you need to verify they work together. We learned this the hard way when a breaking API change got deployed without the corresponding frontend update.

Case Study 4: Infrastructure as Code Pipeline

“We should apply this Terraform change” is a sentence that has ruined many a Friday afternoon. IaC pipelines need to be paranoid.

The Problem

We had Terraform managing our entire cloud infrastructure: VPCs, databases, Kubernetes clusters—everything. Changes were applied manually from developer laptops with varying degrees of care. Sometimes the state file got corrupted. Sometimes changes were applied to production when someone thought they were in staging.

The Solution: PR-Based Infrastructure Changes

# .github/workflows/terraform.yml
# Infrastructure as Code pipeline
# Every change requires a PR and approval

name: Terraform

on:
  push:
    branches: [main]
    paths:
      - 'infrastructure/**'
      - '.github/workflows/terraform.yml'
  pull_request:
    paths:
      - 'infrastructure/**'
      - '.github/workflows/terraform.yml'

env:
  TF_VERSION: '1.6.0'
  AWS_REGION: us-east-1

jobs:
  # Run on all environments to catch issues early
  plan:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [dev, staging, prod]
      fail-fast: false

    permissions:
      contents: read
      pull-requests: write  # To post plan comments

    defaults:
      run:
        working-directory: infrastructure/${{ matrix.environment }}

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      # Different credentials per environment
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ secrets[format('{0}_AWS_ACCOUNT_ID', matrix.environment)] }}:role/TerraformRole
          aws-region: ${{ env.AWS_REGION }}

      - name: Terraform Init
        run: terraform init -backend-config="bucket=terraform-state-${{ matrix.environment }}"

      - name: Terraform Validate
        run: terraform validate

      - name: Terraform Plan
        id: plan
        run: |
          terraform plan -no-color -out=tfplan 2>&1 | tee plan.txt

          # Check if there are changes
          if terraform show -json tfplan | jq -e '.resource_changes | length > 0' > /dev/null; then
            echo "has_changes=true" >> $GITHUB_OUTPUT
          else
            echo "has_changes=false" >> $GITHUB_OUTPUT
          fi
        continue-on-error: true

      # Post plan as PR comment
      - name: Comment Plan on PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('infrastructure/${{ matrix.environment }}/plan.txt', 'utf8');

            const output = `### Terraform Plan - ${{ matrix.environment }}

            <details>
            <summary>Click to expand</summary>

            \`\`\`hcl
            ${plan.substring(0, 65000)}
            \`\`\`

            </details>`;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });

      - name: Plan Status
        if: steps.plan.outcome == 'failure'
        run: exit 1

      # Save plan for apply job
      - name: Upload Plan
        if: github.ref == 'refs/heads/main'
        uses: actions/upload-artifact@v4
        with:
          name: tfplan-${{ matrix.environment }}
          path: infrastructure/${{ matrix.environment }}/tfplan
          retention-days: 5

  # Apply to dev automatically
  apply-dev:
    needs: plan
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: dev  # Requires environment protection rules

    defaults:
      run:
        working-directory: infrastructure/dev

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ secrets.DEV_AWS_ACCOUNT_ID }}:role/TerraformRole
          aws-region: ${{ env.AWS_REGION }}

      - name: Terraform Init
        run: terraform init -backend-config="bucket=terraform-state-dev"

      - uses: actions/download-artifact@v4
        with:
          name: tfplan-dev
          path: infrastructure/dev

      - name: Terraform Apply
        run: terraform apply -auto-approve tfplan

  # Staging requires manual approval
  apply-staging:
    needs: apply-dev
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: staging  # Has required reviewers

    defaults:
      run:
        working-directory: infrastructure/staging

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ secrets.STAGING_AWS_ACCOUNT_ID }}:role/TerraformRole
          aws-region: ${{ env.AWS_REGION }}

      - run: terraform init -backend-config="bucket=terraform-state-staging"

      - uses: actions/download-artifact@v4
        with:
          name: tfplan-staging
          path: infrastructure/staging

      - run: terraform apply -auto-approve tfplan

  # Production requires multiple approvals
  apply-prod:
    needs: apply-staging
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production  # Has required reviewers and wait timer

    defaults:
      run:
        working-directory: infrastructure/prod

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ secrets.PROD_AWS_ACCOUNT_ID }}:role/TerraformRole
          aws-region: ${{ env.AWS_REGION }}

      - run: terraform init -backend-config="bucket=terraform-state-prod"

      - uses: actions/download-artifact@v4
        with:
          name: tfplan-prod
          path: infrastructure/prod

      - run: terraform apply -auto-approve tfplan

Drift Detection (Scheduled)

# .github/workflows/terraform-drift.yml
# Detect when reality doesn't match our config

name: Terraform Drift Detection

on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6 AM
  workflow_dispatch:

jobs:
  drift-check:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [dev, staging, prod]

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: '1.6.0'

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ secrets[format('{0}_AWS_ACCOUNT_ID', matrix.environment)] }}:role/TerraformRole
          aws-region: us-east-1

      - name: Check for drift
        id: drift
        working-directory: infrastructure/${{ matrix.environment }}
        run: |
          terraform init -backend-config="bucket=terraform-state-${{ matrix.environment }}"

          # Plan and check for changes
          if terraform plan -detailed-exitcode; then
            echo "No drift detected"
            echo "drift=false" >> $GITHUB_OUTPUT
          else
            exit_code=$?
            if [ $exit_code -eq 2 ]; then
              echo "Drift detected!"
              echo "drift=true" >> $GITHUB_OUTPUT
            else
              echo "Error running plan"
              exit 1
            fi
          fi

      - name: Create issue for drift
        if: steps.drift.outputs.drift == 'true'
        uses: actions/github-script@v7
        with:
          script: |
            const title = `Infrastructure drift detected in ${{ matrix.environment }}`;
            const body = `Terraform detected configuration drift in the ${{ matrix.environment }} environment.

            This usually means someone made a manual change in the AWS console, or an automated process modified resources.

            Please investigate and either:
            1. Update the Terraform config to match reality
            2. Run \`terraform apply\` to restore the expected state

            Run the Terraform workflow manually to see the full diff.`;

            github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: title,
              body: body,
              labels: ['infrastructure', 'drift', '${{ matrix.environment }}']
            });

Lessons Learned

  1. OIDC is better than long-lived credentials. We use role-to-assume instead of storing AWS keys in secrets. GitHub’s OIDC integration means no permanent credentials to rotate.

  2. Plan artifacts are essential. The plan you reviewed in the PR should be the exact plan that gets applied. Without artifacts, the apply might include changes that happened between the plan and apply.

  3. Environments add guardrails. GitHub Environments let you require approvals, add wait timers, and restrict deployments to specific branches. Production requires two approvals and a 10-minute wait.

  4. Drift detection catches shadow IT. Someone always tweaks something in the console “just temporarily.” Daily drift checks catch it before it causes problems.

  5. Posting plans to PRs is magical. Reviewers can see exactly what will change without running Terraform locally. It’s like a really good code diff for infrastructure.

  6. State file access is the security boundary. Anyone who can access the Terraform state file can see all your secrets (database passwords, API keys). Treat state bucket access as carefully as production database access.

Case Study 5: Documentation Site with Preview Deployments

Documentation might not be as exciting as production code, but it’s just as important to keep working. This case study is about a docs site built with Docusaurus, deployed to Vercel.

The Problem

Our docs were always stale. Engineers would update code, ship it, and “update docs later” (which never happened). Docs PRs sat for days because nobody could easily see how the changes looked. Broken links accumulated until someone noticed the 404s in analytics.

The Solution: Preview Deploys and Automated Checks

# .github/workflows/docs.yml
# Documentation CI/CD with preview deployments

name: Documentation

on:
  push:
    branches: [main]
    paths:
      - 'docs/**'
      - '.github/workflows/docs.yml'
  pull_request:
    paths:
      - 'docs/**'
      - '.github/workflows/docs.yml'

defaults:
  run:
    working-directory: docs

jobs:
  # Quick checks first
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
          cache-dependency-path: docs/package-lock.json

      - run: npm ci

      # Check for broken Markdown links
      - name: Check internal links
        run: npm run lint:links

      # Spell check
      - name: Spell check
        run: npm run lint:spelling

      # Check that code examples compile
      - name: Validate code blocks
        run: npm run lint:code-blocks

  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
          cache-dependency-path: docs/package-lock.json

      - run: npm ci
      - run: npm run build

      - uses: actions/upload-artifact@v4
        with:
          name: docs-build
          path: docs/build

  # Preview deployments for PRs
  preview:
    needs: [lint, build]
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest

    permissions:
      pull-requests: write
      deployments: write

    steps:
      - uses: actions/checkout@v4

      - uses: actions/download-artifact@v4
        with:
          name: docs-build
          path: docs/build

      # Deploy to Vercel preview
      - name: Deploy to Vercel
        id: deploy
        uses: amondnet/vercel-action@v25
        with:
          vercel-token: ${{ secrets.VERCEL_TOKEN }}
          vercel-org-id: ${{ secrets.VERCEL_ORG_ID }}
          vercel-project-id: ${{ secrets.VERCEL_DOCS_PROJECT_ID }}
          working-directory: docs/build

      # Comment with preview URL
      - name: Comment Preview URL
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `Preview deployment ready!\n\n${process.env.PREVIEW_URL}\n\nThis preview will be available until the PR is closed.`
            });
        env:
          PREVIEW_URL: ${{ steps.deploy.outputs.preview-url }}

      # Run Lighthouse on preview
      - name: Lighthouse CI
        uses: treosh/lighthouse-ci-action@v11
        with:
          urls: ${{ steps.deploy.outputs.preview-url }}
          configPath: docs/lighthouserc.json
          uploadArtifacts: true

  # Full link validation after preview is up
  validate-links:
    needs: preview
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Check all links including external ones
      - name: Check all links
        uses: lycheeverse/lychee-action@v1
        with:
          args: --verbose --no-progress --exclude-mail './docs/**/*.md'
          fail: true

  # Production deployment
  deploy:
    needs: [lint, build]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - uses: actions/download-artifact@v4
        with:
          name: docs-build
          path: docs/build

      - name: Deploy to Vercel (Production)
        uses: amondnet/vercel-action@v25
        with:
          vercel-token: ${{ secrets.VERCEL_TOKEN }}
          vercel-org-id: ${{ secrets.VERCEL_ORG_ID }}
          vercel-project-id: ${{ secrets.VERCEL_DOCS_PROJECT_ID }}
          vercel-args: '--prod'
          working-directory: docs/build

      # Purge CDN cache
      - name: Purge Cloudflare Cache
        run: |
          curl -X POST "https://api.cloudflare.com/client/v4/zones/${{ secrets.CF_ZONE_ID }}/purge_cache" \
            -H "Authorization: Bearer ${{ secrets.CF_API_TOKEN }}" \
            -H "Content-Type: application/json" \
            --data '{"purge_everything":true}'

      # Check production site
      - name: Smoke test
        run: |
          sleep 30  # Wait for deployment to propagate
          curl -f https://docs.example.com || exit 1
          curl -f https://docs.example.com/getting-started || exit 1

Lighthouse Configuration

{
  "ci": {
    "assert": {
      "assertions": {
        "categories:performance": ["error", { "minScore": 0.9 }],
        "categories:accessibility": ["error", { "minScore": 0.95 }],
        "categories:best-practices": ["error", { "minScore": 0.9 }],
        "categories:seo": ["error", { "minScore": 0.9 }]
      }
    }
  }
}

Lessons Learned

  1. Preview deployments transform docs PRs. Reviewers actually review when they can click a link and see the result. Our docs PR review time dropped from days to hours.

  2. Link checking catches so many problems. Internal links break when pages move. External links break when third parties change their URLs. Automated checking catches these before users do.

  3. Lighthouse catches performance regressions. Someone added an unoptimized 5MB image to the docs. Lighthouse caught it before merge.

  4. Code block validation is underrated. Nothing’s worse than docs with syntax errors in the code examples. We extract code blocks and run them through compilers/linters.

  5. Cache purging is essential. CDNs love to cache aggressively. Without explicit purging, users might see stale docs for hours after a deployment.

  6. Smoke tests catch deployment failures. Just because the deploy succeeded doesn’t mean the site works. A quick curl against key pages catches the obvious problems.

Common Patterns Across All Case Studies

Looking back at these workflows, some patterns keep showing up:

1. Fail Fast, Then Fail Completely

Run quick checks first (linting, validation), then slower checks (tests, builds). If the quick checks fail, don’t waste time on the slow ones.

But when you do run comprehensive checks, use fail-fast: false in matrix builds. You want to know all the failures, not just the first one.

2. Artifact Everything

Plans, builds, test results—upload them all. You’ll want them for debugging, and you’ll want them for later jobs in the workflow.

3. Post to the PR

Put information where developers already are. Post Terraform plans as PR comments. Post preview URLs as PR comments. Post Lighthouse scores as PR comments. Make it impossible to ignore.

4. Environments Add Guardrails

GitHub Environments let you require approvals, add wait timers, restrict to specific branches, and provide environment-specific secrets. Use them.

5. Clean Up After Yourself

Temporary files, keychains, credentials—delete them in a step with if: always(). Don’t assume the workflow will complete normally.

6. Concurrency Controls Prevent Chaos

Without concurrency controls, pushing twice in quick succession runs two workflows simultaneously. One will fail in confusing ways, or worse, both will succeed in incompatible ways.

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

The Reality Check

I want to be honest: getting to these workflows took time. Lots of failed runs, lots of debugging, lots of reading documentation. The first version of each workflow was much simpler—and much less reliable.

The workflows you see in tutorials are usually version 1. The workflows you see in production are version 5 (or 15). Don’t be discouraged if your first attempt doesn’t handle every edge case. Start simple, add complexity as you hit real problems.

And when you do hit problems, capture them in the workflow. That comment that says “DO NOT REMOVE - fixes race condition on Windows”? That’s institutional knowledge. Future you (or future teammates) will thank you.

GitHub Actions isn’t just CI/CD. It’s a platform for encoding your team’s operational knowledge into code. Every weird conditional, every retry loop, every cleanup step—they’re all lessons learned the hard way.

Your workflows are documentation of how things actually work, not how they’re supposed to work in theory.

Next up in this series: niche and unexpected uses—the weird, wonderful, and surprisingly practical things people build with Actions.