Skip to content

Datadog auto-triage and auto-fix pipeline

How a production error reported by Datadog becomes a triaged Jira ticket and, where safe, an open pull request, with a human always reviewing before anything merges.

This page is the canonical design. The manual provisioning steps (GitHub App, secrets, Terraform apply, OTEL export, smoke test) live in the setup runbook.

Goal

Datadog already tells us when something breaks. The gap is the work between an alert firing and a fix landing: someone has to notice it, decide whether it matters, find the right ticket or open a new one, and then write the fix. This pipeline automates the deterministic parts of that work and leaves every judgement that needs a human to a human.

Specifically it aims to:

  • Turn a high-signal Datadog Error Tracking issue into a Jira ticket on the PPL project, deduplicated against issues we already know about.
  • Route the ticket to the owning team using team ownership and a deterministic scoring rubric, not a model's opinion.
  • For a bounded class of low-risk errors, open a draft pull request that targets develop and follows the PR review and branching rules like any other change.
  • Never merge without human review, and stay inside an explicit cost and blast-radius budget with a single kill switch.

Flow

flowchart TD
  dd["Datadog Error Tracking<br/>(New Issue / High Impact monitor)"] -->|"webhook (machine-user PAT)"| dispatch["repository_dispatch<br/>(error-triage)"]
  dispatch --> triage[".github/workflows/error-triage.yml"]
  triage --> score["scripts/triage/ scoring rubric"]
  score -->|"below threshold"| drop["log and stop<br/>(no ticket)"]
  score -->|"at or above threshold"| dedupe["scripts/triage/ dedupe<br/>against open PPL issues"]
  dedupe -->|"duplicate"| comment["comment on existing ticket"]
  dedupe -->|"new"| jira["create PPL ticket<br/>routed by team ownership"]
  jira -->|"auto-fixable class<br/>and AUTOFIX_ENABLED"| fix[".github/workflows/error-autofix.yml"]
  fix --> action["Claude Code headless<br/>(edit-only, self-hosted runner)"]
  action -->|"workflow commits + verifies,<br/>App token opens"| pr["draft PR -> develop"]
  pr --> human["human review<br/>(PR reviews + CODEOWNERS)"]
  jira -->|"not auto-fixable"| human

The three stages

1. Listen

Datadog Error Tracking is the source of truth for what broke. An issue is the dedup unit: Datadog groups events that share a root cause into one issue, which moves through the states FOR REVIEW, REVIEWED, RESOLVED, IGNORED, and EXCLUDED.

Two kinds of Terraform-managed monitor watch for the conditions worth acting on, one pair per service:

  • New Issue: an issue we have not seen before appears.
  • High Impact: an existing issue crosses an impact threshold (volume, affected users, or a service we care about).

These monitors live in the datadog-observability Terraform module (infrastructure/terraform/modules/datadog-observability/autofix_monitors.tf). When a monitor fires, its message routes to a Datadog webhook (@webhook-perci-autofix-dispatch) that calls the GitHub repository_dispatch API with an error-triage event type. The webhook authenticates with a long-lived machine-user PAT — a GitHub App installation token cannot be used here because it expires hourly. See the setup runbook.

repository_dispatch payload limits

A repository_dispatch client_payload is capped at 10 top-level keys and 65535 characters total. The workflow sends identifiers (issue id, service, monitor id, impact counts) and lets the triage stage fetch the full detail from the Datadog API, rather than trying to inline a stack trace into the payload.

2. Triage

.github/workflows/error-triage.yml runs on the repository_dispatch event and drives the deterministic TypeScript in scripts/triage/:

  1. Fetch detail. The issue object does not carry a stack trace. The triage script reads the issue from the Error Tracking API (https://api.datadoghq.eu/api/v2/error-tracking/issues/{id}) and then fetches a linked event from the Logs or RUM API to get the stack trace and request context.
  2. Score. Apply the scoring rubric below. The score is a pure function of the fetched fields, so the same issue always scores the same way.
  3. Decide. Below the threshold, log and stop. At or above it, continue.
  4. Deduplicate. Search open PPL issues for a match (by Datadog issue id stored on the ticket, then by fingerprint). A match gets a comment with the new occurrence; no match gets a new ticket.
  5. Route. Map the affected service to an owning team via team ownership and set the ticket's component and assignee group accordingly.

The ticket records the Datadog issue id so future runs can find it, closing the loop between Datadog issue states and Jira ticket lifecycle.

3. Fix

For the bounded class of errors we trust to fix automatically, the triage stage triggers .github/workflows/error-autofix.yml. That workflow creates a PPL-<key>/… branch off develop, then runs Claude Code headless (claude -p) on our self-hosted runner in edit-only mode: the agent reproduces the failure, writes a failing test, and applies a minimal fix, but performs no git operations of its own. The workflow then runs the verify gate, commits, and opens a draft pull request against develop.

The PR is then an ordinary PR: it follows PR reviews, needs CODEOWNERS approval for the areas it touches, and must pass CI before a human merges it. Nothing the pipeline produces merges itself.

PRs must be opened with the GitHub App token

A PR opened with the default GITHUB_TOKEN does not trigger downstream workflows (this is a deliberate GitHub safeguard against recursive runs). The auto-fix workflow therefore opens its PR with the Perci Auto-fix GitHub App token so that backend-pr-checks.yml and the other PR checks actually run. See the setup runbook for the App and its scopes.

Scoring rubric

The rubric is a deterministic points total. An issue is triaged into a ticket only when it reaches the threshold, and it is only eligible for auto-fix when it additionally falls into an auto-fixable class. Keeping this as a table (not a prompt) means the decision is reproducible and reviewable.

Signal Source field Points
Issue state is FOR REVIEW Error Tracking issue state +3
Affects a production service (members, clinicians, backend) service tag +3
High Impact monitor fired (not just New Issue) monitor id +3
Affected users above the impact threshold impact user count +2
Error rate rising over the window issue trend +2
First seen within the last 24 hours issue first-seen timestamp +1
Issue state is IGNORED or EXCLUDED Error Tracking issue state disqualify
Already linked to an open PPL ticket dedupe match route to comment, not new ticket

Thresholds (tunable, see PPL-2653):

  • Ticket when the total is 5 or more.
  • Auto-fix eligible when the total is 8 or more and the error matches an auto-fixable class (see scope guards).

Scope guards

The pipeline is deliberately narrow. Auto-fix only runs when all of these hold:

  • The AUTOFIX_ENABLED repository variable is true (the kill switch).
  • The error falls into an allow-listed, low-risk class (for example a null-safety guard, a missing-field default, or a known dependency bump), never schema, auth, payments, or data-migration code.
  • The change stays within a bounded diff size and touches only paths owned by a single team, so CODEOWNERS review is unambiguous.
  • The score reaches the auto-fix threshold.

Anything outside these guards stops at a Jira ticket for a human to pick up. The model never merges, never force-pushes, and never targets main.

Engine choice

The fix stage runs Claude Code in headless mode (claude -p) on our self-hosted blacksmith runner, in edit-only mode: the agent modifies the working tree on a branch the workflow already created, while the workflow owns every git operation (commit, push, draft PR) and the verify gate. We drive the agent with the prompt from .github/agents/datadog-autofix.agent.md plus the injected run context, a bounded --max-turns, --permission-mode acceptEdits, and an explicit --allowedTools allowlist that excludes git mutation so the agent cannot commit, push, or open a PR itself.

Self-hosted, not Anthropic-managed

The agent executes inside our own GitHub Actions runner, using our ANTHROPIC_API_KEY. It is not an Anthropic-hosted service reaching into the repo. The repository, secrets, and network egress stay under our control, which is what lets us keep PHI-adjacent code inside our boundary.

Why headless over claude-code-action. The official action can also run unattended (automation mode), but it manages branches, commits, and PR creation itself; combining that with our own commit/verify/PR steps would risk two actors racing to open the PR. Headless gives a single, clear owner of version control — the workflow — which keeps exactly one draft PR per fix and is simpler to reason about. We can revisit the action later if we want its built-in PR and structured-output handling.

Observability

The pipeline reports on itself so we can see whether it is helping or just spending money. Claude Code emits OpenTelemetry; we export those traces and metrics to Datadog EU via OTLP (the same EU site, datadoghq.eu, the rest of the platform uses). That gives us per-run cost, token usage, latency, and success or failure for every triage and fix run, alongside a budget guard that trips the kill switch if spend exceeds the configured ceiling. The OTEL export configuration is in the setup runbook.

Phasing

Phase Scope
P0 Terraform monitors in staging, Datadog webhook to repository_dispatch, and the triage stage creating routed, deduplicated PPL tickets. No auto-fix.
P1 GitHub App, secrets, and kill switch in place; auto-fix workflow opens draft PRs for one narrow error class behind AUTOFIX_ENABLED.
P2 OTEL observability and budget guard live; widen the auto-fixable classes as confidence grows.
P3 Flutter (members and clinicians) RUM error class added once source-map upload is reliable; tune thresholds (PPL-2653).

Design to ticket map

Design section PPL ticket
Terraform monitors (New Issue / High Impact) PPL-2644
Datadog webhook to repository_dispatch PPL-2645
Scoring rubric and team routing PPL-2646
Dedupe and Datadog issue / Jira ticket lifecycle PPL-2647
GitHub App and secrets / kill switch PPL-2648
error-triage.yml workflow PPL-2649
error-autofix.yml workflow PPL-2650
Observability, kill switch, and budget guard PPL-2651
Flutter RUM error class PPL-2652
Threshold and rubric tuning PPL-2653

All of the above sit under epic PPL-2643.