Datadog auto-triage and auto-fix pipeline¶
How a production error reported by Datadog becomes a triaged Jira ticket and, where safe, an open pull request, with a human always reviewing before anything merges.
This page is the canonical design. The manual provisioning steps (GitHub App, secrets, Terraform apply, OTEL export, smoke test) live in the setup runbook.
Goal¶
Datadog already tells us when something breaks. The gap is the work between an alert firing and a fix landing: someone has to notice it, decide whether it matters, find the right ticket or open a new one, and then write the fix. This pipeline automates the deterministic parts of that work and leaves every judgement that needs a human to a human.
Specifically it aims to:
- Turn a high-signal Datadog Error Tracking issue into a Jira ticket on the PPL project, deduplicated against issues we already know about.
- Route the ticket to the owning team using team ownership and a deterministic scoring rubric, not a model's opinion.
- For a bounded class of low-risk errors, open a draft pull request that targets
developand follows the PR review and branching rules like any other change. - Never merge without human review, and stay inside an explicit cost and blast-radius budget with a single kill switch.
Flow¶
flowchart TD
dd["Datadog Error Tracking<br/>(New Issue / High Impact monitor)"] -->|"webhook (machine-user PAT)"| dispatch["repository_dispatch<br/>(error-triage)"]
dispatch --> triage[".github/workflows/error-triage.yml"]
triage --> score["scripts/triage/ scoring rubric"]
score -->|"below threshold"| drop["log and stop<br/>(no ticket)"]
score -->|"at or above threshold"| dedupe["scripts/triage/ dedupe<br/>against open PPL issues"]
dedupe -->|"duplicate"| comment["comment on existing ticket"]
dedupe -->|"new"| jira["create PPL ticket<br/>routed by team ownership"]
jira -->|"auto-fixable class<br/>and AUTOFIX_ENABLED"| fix[".github/workflows/error-autofix.yml"]
fix --> action["Claude Code headless<br/>(edit-only, self-hosted runner)"]
action -->|"workflow commits + verifies,<br/>App token opens"| pr["draft PR -> develop"]
pr --> human["human review<br/>(PR reviews + CODEOWNERS)"]
jira -->|"not auto-fixable"| human
The three stages¶
1. Listen¶
Datadog Error Tracking is the source of truth for what broke. An issue is the dedup
unit: Datadog groups events that share a root cause into one issue, which moves through
the states FOR REVIEW, REVIEWED, RESOLVED, IGNORED, and EXCLUDED.
Two kinds of Terraform-managed monitor watch for the conditions worth acting on, one pair per service:
- New Issue: an issue we have not seen before appears.
- High Impact: an existing issue crosses an impact threshold (volume, affected users, or a service we care about).
These monitors live in the datadog-observability Terraform module
(infrastructure/terraform/modules/datadog-observability/autofix_monitors.tf). When a
monitor fires, its message routes to a Datadog webhook
(@webhook-perci-autofix-dispatch) that calls the GitHub repository_dispatch API with an
error-triage event type. The webhook authenticates with a long-lived machine-user PAT —
a GitHub App installation token cannot be used here because it expires hourly. See the
setup runbook.
repository_dispatch payload limits
A repository_dispatch client_payload is capped at 10 top-level keys and
65535 characters total. The workflow sends identifiers (issue id, service,
monitor id, impact counts) and lets the triage stage fetch the full detail from the
Datadog API, rather than trying to inline a stack trace into the payload.
2. Triage¶
.github/workflows/error-triage.yml runs on the repository_dispatch event and drives
the deterministic TypeScript in scripts/triage/:
- Fetch detail. The issue object does not carry a stack trace. The triage script
reads the issue from the Error Tracking API
(
https://api.datadoghq.eu/api/v2/error-tracking/issues/{id}) and then fetches a linked event from the Logs or RUM API to get the stack trace and request context. - Score. Apply the scoring rubric below. The score is a pure function of the fetched fields, so the same issue always scores the same way.
- Decide. Below the threshold, log and stop. At or above it, continue.
- Deduplicate. Search open PPL issues for a match (by Datadog issue id stored on the ticket, then by fingerprint). A match gets a comment with the new occurrence; no match gets a new ticket.
- Route. Map the affected service to an owning team via team ownership and set the ticket's component and assignee group accordingly.
The ticket records the Datadog issue id so future runs can find it, closing the loop between Datadog issue states and Jira ticket lifecycle.
3. Fix¶
For the bounded class of errors we trust to fix automatically, the triage stage triggers
.github/workflows/error-autofix.yml. That workflow creates a PPL-<key>/… branch off
develop, then runs Claude Code headless (claude -p) on our self-hosted runner in
edit-only mode: the agent reproduces the failure, writes a failing test, and applies a
minimal fix, but performs no git operations of its own. The workflow then runs the verify
gate, commits, and opens a draft pull request against develop.
The PR is then an ordinary PR: it follows PR reviews, needs CODEOWNERS approval for the areas it touches, and must pass CI before a human merges it. Nothing the pipeline produces merges itself.
PRs must be opened with the GitHub App token
A PR opened with the default GITHUB_TOKEN does not trigger downstream workflows
(this is a deliberate GitHub safeguard against recursive runs). The auto-fix workflow
therefore opens its PR with the Perci Auto-fix GitHub App token so that
backend-pr-checks.yml and the other PR checks actually run. See the
setup runbook for the App and its scopes.
Scoring rubric¶
The rubric is a deterministic points total. An issue is triaged into a ticket only when it reaches the threshold, and it is only eligible for auto-fix when it additionally falls into an auto-fixable class. Keeping this as a table (not a prompt) means the decision is reproducible and reviewable.
| Signal | Source field | Points |
|---|---|---|
Issue state is FOR REVIEW |
Error Tracking issue state | +3 |
| Affects a production service (members, clinicians, backend) | service tag | +3 |
| High Impact monitor fired (not just New Issue) | monitor id | +3 |
| Affected users above the impact threshold | impact user count | +2 |
| Error rate rising over the window | issue trend | +2 |
| First seen within the last 24 hours | issue first-seen timestamp | +1 |
Issue state is IGNORED or EXCLUDED |
Error Tracking issue state | disqualify |
| Already linked to an open PPL ticket | dedupe match | route to comment, not new ticket |
Thresholds (tunable, see PPL-2653):
- Ticket when the total is 5 or more.
- Auto-fix eligible when the total is 8 or more and the error matches an auto-fixable class (see scope guards).
Scope guards¶
The pipeline is deliberately narrow. Auto-fix only runs when all of these hold:
- The
AUTOFIX_ENABLEDrepository variable istrue(the kill switch). - The error falls into an allow-listed, low-risk class (for example a null-safety guard, a missing-field default, or a known dependency bump), never schema, auth, payments, or data-migration code.
- The change stays within a bounded diff size and touches only paths owned by a single team, so CODEOWNERS review is unambiguous.
- The score reaches the auto-fix threshold.
Anything outside these guards stops at a Jira ticket for a human to pick up. The model
never merges, never force-pushes, and never targets main.
Engine choice¶
The fix stage runs Claude Code in headless mode (claude -p) on our self-hosted
blacksmith runner, in edit-only mode: the agent modifies the working tree on a branch
the workflow already created, while the workflow owns every git operation (commit, push,
draft PR) and the verify gate. We drive the agent with the prompt from
.github/agents/datadog-autofix.agent.md plus the injected run context, a bounded
--max-turns, --permission-mode acceptEdits, and an explicit --allowedTools allowlist
that excludes git mutation so the agent cannot commit, push, or open a PR itself.
Self-hosted, not Anthropic-managed
The agent executes inside our own GitHub Actions runner, using our
ANTHROPIC_API_KEY. It is not an Anthropic-hosted service reaching into the repo. The
repository, secrets, and network egress stay under our control, which is what lets us
keep PHI-adjacent code inside our boundary.
Why headless over claude-code-action. The official action can also run unattended
(automation mode), but it manages branches, commits, and PR creation itself; combining that
with our own commit/verify/PR steps would risk two actors racing to open the PR. Headless
gives a single, clear owner of version control — the workflow — which keeps exactly one
draft PR per fix and is simpler to reason about. We can revisit the action later if we want
its built-in PR and structured-output handling.
Observability¶
The pipeline reports on itself so we can see whether it is helping or just spending money.
Claude Code emits OpenTelemetry; we export those traces and metrics to Datadog EU via
OTLP (the same EU site, datadoghq.eu, the rest of the platform uses). That gives us
per-run cost, token usage, latency, and success or failure for every triage and fix run,
alongside a budget guard that trips the kill switch if spend exceeds the configured
ceiling. The OTEL export configuration is in the setup runbook.
Phasing¶
| Phase | Scope |
|---|---|
| P0 | Terraform monitors in staging, Datadog webhook to repository_dispatch, and the triage stage creating routed, deduplicated PPL tickets. No auto-fix. |
| P1 | GitHub App, secrets, and kill switch in place; auto-fix workflow opens draft PRs for one narrow error class behind AUTOFIX_ENABLED. |
| P2 | OTEL observability and budget guard live; widen the auto-fixable classes as confidence grows. |
| P3 | Flutter (members and clinicians) RUM error class added once source-map upload is reliable; tune thresholds (PPL-2653). |
Design to ticket map¶
| Design section | PPL ticket |
|---|---|
| Terraform monitors (New Issue / High Impact) | PPL-2644 |
Datadog webhook to repository_dispatch |
PPL-2645 |
| Scoring rubric and team routing | PPL-2646 |
| Dedupe and Datadog issue / Jira ticket lifecycle | PPL-2647 |
| GitHub App and secrets / kill switch | PPL-2648 |
error-triage.yml workflow |
PPL-2649 |
error-autofix.yml workflow |
PPL-2650 |
| Observability, kill switch, and budget guard | PPL-2651 |
| Flutter RUM error class | PPL-2652 |
| Threshold and rubric tuning | PPL-2653 |
All of the above sit under epic PPL-2643.
Related pages¶
- PR reviews: the rules every auto-fix PR must satisfy.
- Branching & releases: why auto-fix PRs target
develop, nevermain. - Team ownership: the routing source for triaged tickets.
- Setup runbook: the manual provisioning this design depends on.