Auto-triage pipeline setup runbook¶
The manual, one-off provisioning the auto-triage pipeline needs but cannot do for itself: the GitHub App, the secrets, the Terraform apply, the OTEL export, an end-to-end smoke test, and the rollback procedure.
Do these in order. Start everything in staging and only promote to production once the smoke test passes.
This is the human-only part
Everything here grants the pipeline its capabilities. The pipeline cannot create its own GitHub App, write its own secrets, or apply its own Terraform. Treat this page as the checklist for standing the pipeline up and for auditing it later.
1. Create the two GitHub credentials¶
The pipeline talks to GitHub in two different places, and they need different credentials. Do not try to reuse one for both:
- Perci Auto-fix GitHub App (opening the PR).
error-autofix.ymlmints a fresh App installation token at runtime (actions/create-github-app-token) and opens the draft PR with it. PRs must be opened with this App token, not the defaultGITHUB_TOKEN, becauseGITHUB_TOKEN-opened PRs do not trigger downstream CI (backend-pr-checks.ymland the other PR checks would never run). - Dispatch PAT (firing the trigger). The Datadog webhook calls
repository_dispatch, which needs a long-lived, fine-grained personal access token on a machine user — not an App installation token. Installation tokens expire after ~1 hour, but the webhook needs a credential that stays valid between alerts.
1a. Perci Auto-fix GitHub App¶
- In the percihealth organisation, create a new GitHub App named Perci Auto-fix.
- Grant it exactly these least-privilege repository permissions:
| Permission | Access | Why |
|---|---|---|
| Contents | Read and write | Create branches and commits for the fix. |
| Pull requests | Read and write | Open and update the draft PR. |
| Checks | Read | See whether CI passed before reporting back. |
| Metadata | Read | Mandatory baseline permission for any App. |
Grant nothing else. The App has no access to Actions secrets, deployments, admin, or
organisation settings.
3. Disable Webhook (the App is used only for token minting, not event delivery).
4. Generate a private key and download the .pem. Note the App ID.
5. Install the App on percihealth/perci-platform-monorepo only, scoped to that
single repository.
1b. Dispatch PAT (machine user)¶
- On a dedicated machine user (not a personal account), create a fine-grained PAT
scoped to
percihealth/perci-platform-monorepowith Contents: Read and write (the permissionrepository_dispatchrequires). Grant nothing else. - Set an expiry and add a calendar reminder to rotate it — nothing rotates it for you.
- Store it as the GitHub Actions secret
DATADOG_AUTOFIX_GITHUB_DISPATCH_PAT. The deploy workflow passes it to Terraform asdatadog_autofix_github_dispatch_pat, which stores it as the Datadog webhook custom variableAUTOFIX_GITHUB_DISPATCH_TOKENused in the webhook'sAuthorization: Bearerheader. (For a manual apply, pass the same value asTF_VAR_datadog_autofix_github_dispatch_pat.)
2. Set GitHub Actions secrets and the kill switch¶
Add these as repository secrets under Settings -> Secrets and variables -> Actions:
| Secret | Purpose |
|---|---|
ANTHROPIC_API_KEY |
The key the headless Claude Code agent uses to run the model in our runner. |
JIRA_API_TOKEN |
Token for creating and commenting on PPL tickets. |
JIRA_BASE_URL |
https://percihealth.atlassian.net. |
JIRA_USER_EMAIL |
The service account email the Jira token belongs to. |
AUTOFIX_APP_ID |
The Perci Auto-fix App ID from step 1a. |
AUTOFIX_APP_PRIVATE_KEY |
The contents of the App's .pem private key. |
DATADOG_AUTOFIX_GITHUB_DISPATCH_PAT |
The machine-user dispatch PAT from step 1b. The deploy workflow passes it to terraform apply as TF_VAR_datadog_autofix_github_dispatch_pat. |
Add these as repository variables (not secrets):
| Variable | Purpose |
|---|---|
AUTOFIX_ENABLED |
Master switch for opening fix PRs. true enables the auto-fix hand-off; anything else disables it. Triage and ticketing still run regardless — only the PR-opening stage is gated. |
DATADOG_AUTOFIX_STAGING_ENABLED |
true makes the staging deploy create the staging Error Tracking monitors + dispatch webhook, to validate the listen -> triage path. Defaults off. |
STAGING_FIX_ENABLED |
true lets non-production-origin (e.g. staging) errors auto-fix — the staging fix-path test. Leave false normally; staging errors then triage to tickets only. |
Kill switch first
Set AUTOFIX_ENABLED to false until the smoke test in step 5 passes. Triage and
ticket creation are safe to run early; opening PRs is not.
3. Apply the Terraform monitors and dispatch webhook¶
The Error Tracking monitors (New Issue + High Impact, per service), the observability
dashboard, and the repository_dispatch webhook live in the datadog-observability
module (autofix_monitors.tf, autofix_dashboard.tf, autofix.tf). They are gated on
create_autofix_pipeline, wired to datadog_autofix_enabled (default false in
staging, intended true in production).
Supply the dispatch PAT from step 1b as datadog_autofix_github_dispatch_pat (via a
.tfvars, a TF_VAR_… env var, or your secret manager — never commit it), and set
datadog_autofix_enabled = true for the environment you are enabling.
cd infrastructure/terraform/environments/staging # production once verified
export TF_VAR_datadog_autofix_enabled=true
export TF_VAR_datadog_autofix_github_dispatch_pat=… # the machine-user PAT from step 1b
terraform init
terraform plan -out tfplan
terraform apply tfplan
Confirm in Datadog (EU site, datadoghq.eu) that the monitors exist and that each monitor
message references the dispatch webhook as @webhook-perci-autofix-dispatch. Promote to
infrastructure/terraform/environments/production only once staging is verified.
The PAT passes through Terraform state
datadog_autofix_github_dispatch_pat is marked sensitive, but its value is written
to Terraform state (the gcs backend). Keep that bucket access-controlled and encrypted,
and rotate the PAT if state access is ever in doubt.
Datadog credentials
Terraform authenticates to Datadog with DATADOG_API_KEY and DATADOG_APP_KEY
against DATADOG_SITE=datadoghq.eu. These already exist for the platform; reuse them
rather than minting new ones.
4. Configure Claude Code OTEL export to Datadog EU¶
So the pipeline reports its own cost and reliability, export Claude Code's OpenTelemetry to the Datadog EU OTLP endpoint. Set these in the auto-fix workflow environment:
| Variable | Value |
|---|---|
CLAUDE_CODE_ENABLE_TELEMETRY |
1 |
OTEL_METRICS_EXPORTER |
otlp |
OTEL_TRACES_EXPORTER |
otlp |
OTEL_EXPORTER_OTLP_PROTOCOL |
http/protobuf |
OTEL_EXPORTER_OTLP_ENDPOINT |
The Datadog EU OTLP intake endpoint. |
OTEL_EXPORTER_OTLP_HEADERS |
dd-api-key=<DATADOG_API_KEY> (from the existing secret). |
Verify a run shows up in Datadog with per-run token usage and latency before relying on the budget guard.
5. End-to-end smoke test¶
Prove the whole path with a seeded error before enabling auto-fix for real.
- With
AUTOFIX_ENABLEDstillfalse, emit a seeded test error into the staging service so Datadog Error Tracking opens a new issue. Tag it so it is obviously a test. - Confirm the New Issue monitor fires and the Datadog webhook calls
repository_dispatch. - Confirm
error-triage.ymlruns, the scoring rubric scores the seeded error, and a PPL ticket is created and routed to the expected team. - Re-emit the same error and confirm the dedupe step comments on the existing ticket instead of opening a second one.
- Set
AUTOFIX_ENABLEDtotrue. Trigger one auto-fixable seeded error and confirmerror-autofix.ymlopens a draft PR intodevelopusing the App token, and thatbackend-pr-checks.ymlruns on that PR (this is the proof the App token, notGITHUB_TOKEN, opened it). - Confirm the OTEL run appears in Datadog.
Close the seeded ticket and the test PR when done.
5b. Validate in staging (recommended before prod)¶
Staging is the safe place to exercise the pipeline. Two flags scope it:
DATADOG_AUTOFIX_STAGING_ENABLED=true(then redeploy staging) creates the staging monitors + dispatch webhook. Staging errors now triage into PPL tickets — no PRs, because auto-fix is production-only by default.- To test the fix path on demand, use the workflow_dispatch entry point on the
Datadog Error Triage workflow (Actions tab): supply a real staging Error Tracking
issue_idand setforce_fix=true. WithAUTOFIX_ENABLED=true, that single run goes end-to-end: triage -> branch -> agent -> verify gate -> draft PR intodevelop.
No reproduction or monitor threshold is needed — you point it at an existing staging issue
id and triage enriches from the real Datadog issue. Dedupe: re-running the same id comments
on the existing ticket; clear the datadog-issue-<id> label or use a fresh id to re-test.
Close the draft PR and ticket afterwards. Backend errors validate most reliably (their
traces symbolicate); confirm staging Flutter source maps are uploaded before expecting a
Flutter fix.
6. Rollback and kill switch¶
If the pipeline misbehaves, in increasing order of severity:
- Stop new auto-fix PRs. Set the
AUTOFIX_ENABLEDrepository variable tofalse. Triage and ticketing keep running; no PRs open. This is the fast, reversible stop. - Stop triage as well. Mute or disable the Datadog monitors (or revert the Terraform
monitor changes with
terraform apply) so norepository_dispatchevents fire. - Revoke capability. If the App is implicated (PR opening), rotate or remove
AUTOFIX_APP_PRIVATE_KEYor uninstall the Perci Auto-fix App. If the dispatch PAT is implicated (the trigger), revoke the machine-user PAT — the webhook then fails to authenticate and norepository_dispatchevents fire.
Any auto-fix PR already open is harmless: it is a draft, it cannot merge itself, and it follows PR reviews like any other PR. Close it if it is unwanted.
Related pages¶
- Auto-triage pipeline design: what this runbook provisions.
- PR reviews: the review gate every auto-fix PR passes through.
- Branching & releases: auto-fix PRs target
develop, nevermain.