Auto-triage pipeline setup runbook¶

The manual, one-off provisioning the auto-triage pipeline needs but cannot do for itself: the GitHub App, the secrets, the Terraform apply, the OTEL export, an end-to-end smoke test, and the rollback procedure.

Do these in order. Start everything in staging and only promote to production once the smoke test passes.

This is the human-only part

Everything here grants the pipeline its capabilities. The pipeline cannot create its own GitHub App, write its own secrets, or apply its own Terraform. Treat this page as the checklist for standing the pipeline up and for auditing it later.

1. Create the two GitHub credentials¶

The pipeline talks to GitHub in two different places, and they need different credentials. Do not try to reuse one for both:

Perci Auto-fix GitHub App (opening the PR). error-autofix.yml mints a fresh App installation token at runtime (actions/create-github-app-token) and opens the draft PR with it. PRs must be opened with this App token, not the default GITHUB_TOKEN, because GITHUB_TOKEN-opened PRs do not trigger downstream CI (backend-pr-checks.yml and the other PR checks would never run).
Dispatch PAT (firing the trigger). The Datadog webhook calls repository_dispatch, which needs a long-lived, fine-grained personal access token on a machine user — not an App installation token. Installation tokens expire after ~1 hour, but the webhook needs a credential that stays valid between alerts.

1a. Perci Auto-fix GitHub App¶

In the percihealth organisation, create a new GitHub App named Perci Auto-fix.
Grant it exactly these least-privilege repository permissions:

Permission	Access	Why
Contents	Read and write	Create branches and commits for the fix.
Pull requests	Read and write	Open and update the draft PR.
Checks	Read	See whether CI passed before reporting back.
Metadata	Read	Mandatory baseline permission for any App.

Grant nothing else. The App has no access to Actions secrets, deployments, admin, or organisation settings. 3. Disable Webhook (the App is used only for token minting, not event delivery). 4. Generate a private key and download the .pem. Note the App ID. 5. Install the App on percihealth/perci-platform-monorepo only, scoped to that single repository.

1b. Dispatch PAT (machine user)¶

On a dedicated machine user (not a personal account), create a fine-grained PAT scoped to percihealth/perci-platform-monorepo with Contents: Read and write (the permission repository_dispatch requires). Grant nothing else.
Set an expiry and add a calendar reminder to rotate it — nothing rotates it for you.
Store it as the GitHub Actions secret DATADOG_AUTOFIX_GITHUB_DISPATCH_PAT. The deploy workflow passes it to Terraform as datadog_autofix_github_dispatch_pat, which stores it as the Datadog webhook custom variable AUTOFIX_GITHUB_DISPATCH_TOKEN used in the webhook's Authorization: Bearer header. (For a manual apply, pass the same value as TF_VAR_datadog_autofix_github_dispatch_pat.)

2. Set GitHub Actions secrets and the kill switch¶

Add these as repository secrets under Settings -> Secrets and variables -> Actions:

Secret	Purpose
`ANTHROPIC_API_KEY`	The key the headless Claude Code agent uses to run the model in our runner.
`JIRA_API_TOKEN`	Token for creating and commenting on PPL tickets.
`JIRA_BASE_URL`	`https://percihealth.atlassian.net`.
`JIRA_USER_EMAIL`	The service account email the Jira token belongs to.
`AUTOFIX_APP_ID`	The Perci Auto-fix App ID from step 1a.
`AUTOFIX_APP_PRIVATE_KEY`	The contents of the App's `.pem` private key.
`DATADOG_AUTOFIX_GITHUB_DISPATCH_PAT`	The machine-user dispatch PAT from step 1b. The deploy workflow passes it to terraform apply as `TF_VAR_datadog_autofix_github_dispatch_pat`.

Add these as repository variables (not secrets):

Variable	Purpose
`AUTOFIX_ENABLED`	Master switch for opening fix PRs. `true` enables the auto-fix hand-off; anything else disables it. Triage and ticketing still run regardless — only the PR-opening stage is gated.
`DATADOG_AUTOFIX_STAGING_ENABLED`	`true` makes the staging deploy create the staging Error Tracking monitors + dispatch webhook, to validate the listen -> triage path. Defaults off.
`STAGING_FIX_ENABLED`	`true` lets non-production-origin (e.g. staging) errors auto-fix — the staging fix-path test. Leave `false` normally; staging errors then triage to tickets only.

Kill switch first

Set AUTOFIX_ENABLED to false until the smoke test in step 5 passes. Triage and ticket creation are safe to run early; opening PRs is not.

3. Apply the Terraform monitors and dispatch webhook¶

The Error Tracking monitors (New Issue + High Impact, per service), the observability dashboard, and the repository_dispatch webhook live in the datadog-observability module (autofix_monitors.tf, autofix_dashboard.tf, autofix.tf). They are gated on create_autofix_pipeline, wired to datadog_autofix_enabled (default false in staging, intended true in production).

Supply the dispatch PAT from step 1b as datadog_autofix_github_dispatch_pat (via a .tfvars, a TF_VAR_… env var, or your secret manager — never commit it), and set datadog_autofix_enabled = true for the environment you are enabling.

cd infrastructure/terraform/environments/staging   # production once verified
export TF_VAR_datadog_autofix_enabled=true
export TF_VAR_datadog_autofix_github_dispatch_pat=…   # the machine-user PAT from step 1b
terraform init
terraform plan -out tfplan
terraform apply tfplan

Confirm in Datadog (EU site, datadoghq.eu) that the monitors exist and that each monitor message references the dispatch webhook as @webhook-perci-autofix-dispatch. Promote to infrastructure/terraform/environments/production only once staging is verified.

The PAT passes through Terraform state

datadog_autofix_github_dispatch_pat is marked sensitive, but its value is written to Terraform state (the gcs backend). Keep that bucket access-controlled and encrypted, and rotate the PAT if state access is ever in doubt.

Datadog credentials

Terraform authenticates to Datadog with DATADOG_API_KEY and DATADOG_APP_KEY against DATADOG_SITE=datadoghq.eu. These already exist for the platform; reuse them rather than minting new ones.

4. Configure Claude Code OTEL export to Datadog EU¶

So the pipeline reports its own cost and reliability, export Claude Code's OpenTelemetry to the Datadog EU OTLP endpoint. Set these in the auto-fix workflow environment:

Variable	Value
`CLAUDE_CODE_ENABLE_TELEMETRY`	`1`
`OTEL_METRICS_EXPORTER`	`otlp`
`OTEL_TRACES_EXPORTER`	`otlp`
`OTEL_EXPORTER_OTLP_PROTOCOL`	`http/protobuf`
`OTEL_EXPORTER_OTLP_ENDPOINT`	The Datadog EU OTLP intake endpoint.
`OTEL_EXPORTER_OTLP_HEADERS`	`dd-api-key=<DATADOG_API_KEY>` (from the existing secret).

Verify a run shows up in Datadog with per-run token usage and latency before relying on the budget guard.

5. End-to-end smoke test¶

Prove the whole path with a seeded error before enabling auto-fix for real.

With AUTOFIX_ENABLED still false, emit a seeded test error into the staging service so Datadog Error Tracking opens a new issue. Tag it so it is obviously a test.
Confirm the New Issue monitor fires and the Datadog webhook calls repository_dispatch.
Confirm error-triage.yml runs, the scoring rubric scores the seeded error, and a PPL ticket is created and routed to the expected team.
Re-emit the same error and confirm the dedupe step comments on the existing ticket instead of opening a second one.
Set AUTOFIX_ENABLED to true. Trigger one auto-fixable seeded error and confirm error-autofix.yml opens a draft PR into develop using the App token, and that backend-pr-checks.yml runs on that PR (this is the proof the App token, not GITHUB_TOKEN, opened it).
Confirm the OTEL run appears in Datadog.

Close the seeded ticket and the test PR when done.

5b. Validate in staging (recommended before prod)¶

Staging is the safe place to exercise the pipeline. Two flags scope it:

DATADOG_AUTOFIX_STAGING_ENABLED=true (then redeploy staging) creates the staging monitors + dispatch webhook. Staging errors now triage into PPL tickets — no PRs, because auto-fix is production-only by default.
To test the fix path on demand, use the workflow_dispatch entry point on the Datadog Error Triage workflow (Actions tab): supply a real staging Error Tracking issue_id and set force_fix=true. With AUTOFIX_ENABLED=true, that single run goes end-to-end: triage -> branch -> agent -> verify gate -> draft PR into develop.

No reproduction or monitor threshold is needed — you point it at an existing staging issue id and triage enriches from the real Datadog issue. Dedupe: re-running the same id comments on the existing ticket; clear the datadog-issue-<id> label or use a fresh id to re-test. Close the draft PR and ticket afterwards. Backend errors validate most reliably (their traces symbolicate); confirm staging Flutter source maps are uploaded before expecting a Flutter fix.

6. Rollback and kill switch¶

If the pipeline misbehaves, in increasing order of severity:

Stop new auto-fix PRs. Set the AUTOFIX_ENABLED repository variable to false. Triage and ticketing keep running; no PRs open. This is the fast, reversible stop.
Stop triage as well. Mute or disable the Datadog monitors (or revert the Terraform monitor changes with terraform apply) so no repository_dispatch events fire.
Revoke capability. If the App is implicated (PR opening), rotate or remove AUTOFIX_APP_PRIVATE_KEY or uninstall the Perci Auto-fix App. If the dispatch PAT is implicated (the trigger), revoke the machine-user PAT — the webhook then fails to authenticate and no repository_dispatch events fire.

Any auto-fix PR already open is harmless: it is a draft, it cannot merge itself, and it follows PR reviews like any other PR. Close it if it is unwanted.

Auto-triage pipeline design: what this runbook provisions.
PR reviews: the review gate every auto-fix PR passes through.
Branching & releases: auto-fix PRs target develop, never main.