Skip to content

Patrol autofix

How a failed Patrol E2E run becomes an investigated, fixed draft PR — without a human sitting in the slow edit-and-rerun loop.

Most Patrol failures are minor test-side issues (a waitUntilVisible hitting a release animation, a dropdown overlay off-screen) rather than real product bugs. Fixing them is slow only because the feedback loop is: edit the test → push → wait minutes for the Patrol job → see if it's green → repeat. This pipeline removes the human from that loop: on any Patrol failure it investigates whether the failure is a flake or a real break, and either opens a draft PR with a test fix or reports the broken functionality (and attempts a minimal product fix).

It reuses the security model of the Datadog auto-fix pipeline: the agent runs edit-only, the workflow owns git, and the draft PR is opened with a GitHub App token so downstream CI runs. Nothing merges without human review.

Flow

flowchart TD
  patrol[".github/workflows/flutter-patrol-tests.yml<br/>(Patrol job, per app)"] -->|"job fails"| autofix[".github/workflows/patrol-autofix.yml<br/>(workflow_call, per app)"]
  autofix --> artifacts["download failing run's<br/>logs + Playwright report"]
  artifacts --> resolve["resolve failing titles -> Dart targets"]
  resolve -->|"no failures for this app"| noop["no-op (matrix leg for a passing app)"]
  resolve -->|"failures"| debug["launch debug app<br/>(marionette VM service)"]
  debug --> agent[".github/actions/run-claude-agent<br/>patrol-fix agent (edit-only)"]
  agent --> gate["verify gate:<br/>re-run failing target (workers=1)"]
  gate -->|"workflow commits + App token opens"| pr["draft PR -> develop"]
  pr --> human["human review"]

Components

  • Trigger — an autofix job in flutter-patrol-tests.yml, needs: [setup, patrol], runs when needs.patrol.result == 'failure'. It is matrixed over the same apps; the reusable workflow no-ops for an app whose run had no failures (matrix legs can't tell which app failed, so each filters itself).
  • Reusable workflow.github/workflows/patrol-autofix.yml (workflow_call). Mints the GitHub App token, checks out the PR head, downloads the failing run's <app>-test-logs artifact, resolves failing test titles back to Dart target files, branches off develop, sets up the Patrol toolchain, launches a debug app for marionette, runs the agent, re-runs the failing target as the gate, and opens the draft PR.
  • Composite action.github/actions/run-claude-agent runs Claude Code headless in edit-only mode (strips the agent file's frontmatter, appends a run-context data block, claude -p --permission-mode acceptEdits). Shared with the Datadog pipeline so the "run the agent" step has one source of truth.
  • Agent.github/agents/patrol-fix.agent.md (prompt) + .github/agents/patrol-fix.mcp.json (marionette MCP). Classifies flake vs real break, applies the smallest fix reusing integration_test_shared helpers and the existing page objects, validates with a scoped patrol re-run, and writes a JSON verdict the workflow reads.

How the agent decides flake vs. real break

The agent reads the failing step from the log and Playwright report, then probes the live debug app with marionette (connect, get_interactive_elements, tap, …). If the flow works interactively, it's a flake and the fix belongs in the test; if the feature is genuinely broken, it's a real break and the agent attempts a minimal first-party product fix plus a written report. Known flake root causes and their fixes are encoded in the agent prompt (dashboard card waitUntilVisible, provider-read-during-build, off-screen dropdown overlay, double-dispose, un-awaited sign-out).

The authoritative gate is the workflow re-running the failing target with --web-workers=1. The agent's own re-run is just for fast iteration; the workflow's gate decides whether the PR is labelled needs-human.

Outcomes & labels

  • Flake, gate passes — draft PR to develop, labels patrol-autofix, auto-fix.
  • Real break — draft PR with the minimal fix + a report, additionally labelled needs-human; the report is also commented on the originating PR.
  • Gate fails / inconclusive / no change — draft PR (if any change) labelled needs-human; the job fails so the failure stays visible.

Configuration

No new secrets. Reuses the Datadog pipeline's ANTHROPIC_API_KEY, AUTOFIX_APP_ID, AUTOFIX_APP_PRIVATE_KEY, and the optional OTEL vars. The autofix job runs on every Patrol failure; to pause it, disable or remove the autofix job in flutter-patrol-tests.yml.

Boundaries

  • Edit-only agent; first-party files only (no generated clients, lockfiles, vendored code).
  • Fixes always target develop (consistent with the Datadog pipeline). A flake fix won't auto-unblock the originating release/* PR — the reviewer routes/cherry-picks it.
  • Never writes to main, release/*, or hotfix/*, and never merges.