Testing & the CI/CD release gate¶

This page is the source of truth for how the Flutter apps (perci-platform-members and perci-platform-clinicians) are tested and what must be green before a change can be released. The goal is continuous delivery: a change merges and releases as soon as the gate is green, multiple times a day.

Where QA testing happens (feature-branch first)¶

We work in feature branches: a long-lived epic branch is cut at the epic level, and each piece of implementation branches off it and PRs back into the epic branch (see Epic (feature) branches). Opening a PR with the preview label spawns a preview environment (members, clinicians, backend) for that branch.

Test the feature in its own branch's preview, before it merges. This is the in Test stage of the Delivery workflow; do not wait until the change is on develop / staging to verify it. By the time work reaches On Staging, that step is for regression against everything else, not first-pass feature testing.

The only exception: infrastructure that can't run in a preview

If a change depends on infrastructure that genuinely cannot be stood up in the feature branch's preview (for example a new queue, cron, external webhook, or environment-specific config), test it on staging after merge. This is the exception, not the norm, so call it out on the ticket to explain why it skipped feature-branch testing.

Test types¶

Both apps use the same four layers, and the same tooling (patrol, golden_toolkit, mockito).

Layer	Tool	Lives in	Runs in CI
Unit (domain/data)	`flutter_test`	`test/features//domain`, `/data`	every PR (blocking)
Widget	`flutter_test` + `ProviderScope` overrides	`test/features/**/presentation`	every PR (blocking)
Golden	`golden_toolkit` (`@Tags(['golden'])`)	next to the widget, in `goldens/`	every PR (blocking)
E2E	`patrol` (Chrome/web)	`patrol_test/`	pre-release / on-label (separate)

Mocking convention¶

Prefer hand-rolled fakes that implement the domain repository interface for unit/provider/widget tests - they are explicit, fast, and need no codegen.
Use mockito (@GenerateMocks + build_runner) for new tests where call verification or mocking a concrete SDK/generated client adds real value.
Providers are tested with a ProviderContainer (or ProviderScope) overriding the repository/datasource provider with a fake. For autoDispose providers, hold a listener before awaiting so the provider is not disposed mid-load.

Shared harness¶

packages/perci_platform_test_shared is the single source of truth for the fiddly, app-agnostic test setup: the silent network-image HTTP layer, the Firebase Analytics fake, the package_info / secure_storage / datadog channel mocks, golden_toolkit configuration, device presets and the Firebase core mocks. Each app keeps a thin test/.../golden_harness.dart that calls GoldenHarnessBase.baseGlobalSetUp() then wires its own Firebase init, auth manager and FFAppState. Patrol widget wrappers stay per-app (they embed each app's root widget).

The release gate (`.github/workflows/ci-flutter.yml`)¶

The two required PR status checks are Flutter checks passed (from ci-flutter.yml) and Backend checks passed (from ci-backend.yml). Both names are pinned in the branch rulesets — the workflow files can be renamed, the aggregate job names must not be.

A PR to develop or main must pass:

Code generation - melos run build_runner (openapi + freezed + riverpod).
Analyze (errors block) - flutter analyze --no-fatal-infos --no-fatal-warnings. Errors fail the build. Warnings/infos are reported but not yet fatal - they are a ratcheting backlog (see below). Flip to fatal-warnings once the count hits zero.
Unit + widget tests - melos run test (excludes goldens), with coverage.
Coverage threshold - total line coverage must be >= FLUTTER_MIN_COVERAGE (a repo variable). Ratchet this up toward 80%; never lower it.
Goldens - a separate blocking job (see Goldens).

Patrol E2E runs in a separate workflow, not on every PR (see E2E).

Coverage baseline & ratchet¶

Set the repo variable FLUTTER_MIN_COVERAGE to the current measured floor, then raise it as the backlog is burned down. The immediate purpose of the gate is non-regression (coverage may not drop); the long-term target is 80% on hand-written code, reached by ratcheting.

The floor follows the coverage actually present on the target branch. The foundation lands at develop's current ~9.54%, so FLUTTER_MIN_COVERAGE starts at 9 and is ratcheted to 11 once the stacked test PRs (993 tests) are on develop, then upward toward 80%. Raise it after each coverage-improving PR; never lower it below the branch's real coverage.

Scope	Line coverage
Merged (what the gate checks)	11.54% (7695/66664)
perci-platform-members	13.85% raw / 13.26% testable
perci-platform-clinicians	9.66% raw / 10.10% testable

Well-covered feature areas already: members video_call 94%, scans_and_tests 47%, checkout 39%, code_signup 31%; clinicians payments, documents, screening, member_dashboard. The gap to 80% is almost entirely legacy FlutterFlow UI (onboarding, main_pages, appointments, new_a_p_p) - best covered by widget + golden + patrol rather than unit tests, and tracked in the backlog below.

Datadog coverage reporting¶

Both suites upload their merged lcov report to Datadog Code Coverage (EU site) as well as Codecov:

PRs - the datadog-ci coverage upload step in ci-flutter.yml and ci-backend.yml. On a pull_request event Datadog attributes the upload to GITHUB_HEAD_REF, so these only ever produce feature-branch coverage.
develop - .github/workflows/coverage-develop.yml re-runs both suites on push to develop and uploads, giving Datadog the default-branch baseline it diffs PR coverage against. Without it the develop branch view is empty.

The develop workflow deliberately runs tests only (no lint/build/threshold gate, no Codecov, no PR comment) and does not reuse the required check names, so it can never block a merge. Its env scaffolding mirrors the PR jobs; keep the two in sync or the baseline stops being comparable.

Every upload carries a flag, flutter or backend-functions, set on the DataDog/coverage-upload-github-action step. Flags are what code-coverage.datadog.yml scopes carryforward to, and what the coverage gates will be scoped to. Most PRs touch one side of the repo only, so without them a backend-only PR reads as if all Flutter coverage had vanished. An unflagged upload merges into one undifferentiated pile that cannot be gated per stack. If a new package starts uploading, give it its own flag rather than reusing one of these.

code-coverage.datadog.yml (repo root) also maps paths to the Datadog services they already report under, so coverage lines up with APM, RUM and DORA, and ignores generated Dart output. It carries no gates yet: thresholds have to be read off the per-flag figures, which only exist once flags are in use. Coverage is tracked but nothing in Datadog blocks a merge - the blocking gates today are the vitest coverage.thresholds and the FLUTTER_MIN_COVERAGE step above.

Coverage paths must be repo-relative. Datadog files a report by the paths in its SF: lines, so a package-relative report lands in a phantom top-level folder in the File Explorer. Flutter is handled by the merge step, which prefixes each package dir. Backend needs the same treatment: vitest runs inside apps/perci-platform-backend/functions and emits SF:src/..., so both backend jobs rewrite the paths into coverage/datadog/lcov.info before uploading (datadog-ci has no path-prefix flag). Codecov keeps reading the untouched report. Any new package that starts uploading coverage needs the same rewrite.

Analyze warning ratchet¶

melos analyze currently reports ~360 warnings/infos across the workspace, almost all pre-existing in legacy FlutterFlow code (perci_library_9rk85z) and a few in older test infra. There are no error-severity issues, so the error-only gate passes today. Burn the warning count down (a chunk is auto-fixable via dart fix --apply), then make warnings fatal in the gate.

Goldens¶

Golden tests run on every PR for both apps via flutter test --tags golden (the golden job in ci-flutter.yml). The job is blocking: a failing golden fails the required Flutter checks passed gate and prevents merge.

Cross-platform: render in boxes, not real fonts¶

Real fonts rasterise differently per OS (Windows DirectWrite, macOS CoreText, Linux FreeType), so golden PNGs made on one machine never match another - a mixed Win/Mac/Linux team plus Linux CI can't share real-font baselines. So our goldens do not load real fonts: the harness never calls loadAppFonts(), so Flutter's test environment renders all text in the Ahem font (every glyph a fixed square). Ahem output is identical on every platform, so a baseline generated on any dev machine matches CI exactly. (This is the same trick as Alchemist's "CI mode"; we do it directly rather than add the dependency.)

Consequence: goldens verify layout, sizing, colour and structure - not readable text (text shows as boxes). Text content is asserted by widget tests. A small 3% tolerance comparator (test/flutter_test_config.dart) absorbs residual sub-pixel anti-aliasing at box/shape edges.

Regenerate on any OS with flutter test <path> --tags golden --update-goldens - boxes are platform-independent, so the result matches CI. The Flutter - Update Goldens workflow does the same on CI and opens a PR.
Not golden-tested (non-deterministic regardless of fonts): live camera/video widgets (the old meeting_room / waiting_room goldens) and the animated WelcomePage were dropped; a couple of ultra-narrow scenarios are skip-ed where the wider Ahem glyphs tip a flex-less row into overflow (covered by their wider siblings + widget tests).

E2E (patrol)¶

Patrol E2E runs on three lanes, all selecting tests by tag:

Lane	Workflow	When	Backend	Gate?
PR preview (web)	`e2e-web.yml`	preview label, Patrol CI changes, `release/`/`hotfix/` PRs into `main`	the PR's own preview backend (falls back to staging)	fails the preview run
Nightly (web)	`e2e-web-nightly.yml`	02:30 UTC nightly + every PR into `main` + manual	real staging	advisory, deliberately not a required check
Native (Android)	`e2e-native-manual.yml`	manual dispatch, any app, any branch	real staging via Firebase Test Lab	manual

Tag vocabulary¶

Lane selection is tag-driven. The vocabulary lives in one place — TestTags in packages/integration_test_shared — and raw-string vocabulary tags are rejected by a lint step in ci-flutter.yml (a typo'd raw tag would silently drop a test out of its lane). Zephyr case ids ('PPL-TNN') stay raw.

Tag	Meaning	Selected by
`webSafe`	Honest-green, web-runnable, standalone (no seed step, no native device, non-destructive on shared staging)	`e2e-web.yml` and `e2e-web-nightly.yml` (`--tags webSafe`)
`native`	Needs a real Android device (Stripe CardField, WebRTC, permissions, file_picker)	`e2e-native-manual.yml` (via `target` or `tags`)
`smoke` / `regression` / `desktop` / `tablet` / `mobile`	Suite descriptors; selectable ad hoc via the manual lanes' `tags` inputs	—

Add webSafe to a case only once it is confirmed green and standalone; seed-dependent cases join when a seed pre-step lands. patrol bakes --tags into the generated bundle at build time, so each lane's build compiles exactly the matching cases.

The web lanes¶

The per-PR lane runs against Chrome (web) via Playwright and is slow + device-bound, so it does not run on every PR. Both apps share e2e-web.yml, which inspects the PR's changed files and runs a matrix job per affected app: a change under packages/ or the root pubspec.yaml runs both, an app-only change runs just that app, and the generated clinical BFF spec runs clinicians. It runs on release/* and hotfix/* PRs into main, whenever the Patrol CI definition itself changes, and on demand via the preview label.

preview-pr.yml calls it (workflow_call) once its backend preview job has settled, and passes the resulting BFF URLs in — so a PR carrying backend changes is tested against its own preview backend rather than shared staging. A PR with no preview still runs Patrol on the gates above, against staging; a failed backend deploy skips Patrol, leaving the preview run red rather than testing against the previous revision that is still serving the preview host.

The nightly lane (e2e-web-nightly.yml) runs the same webSafe subset for both apps against real staging, and re-runs it on every release/hotfix PR into main as an advisory signal.

The native lane¶

Some flows cannot run in headless Chrome at all — Stripe's CardField is an Android platform view, WebRTC and GetStream need real media permissions, file_picker needs a real file system. Those cases are tagged native and run on real Android devices in Firebase Test Lab via e2e-native-manual.yml (manual dispatch: either app, any branch, a single target or a tags selector, with optional FTL sharding for parallelism). Many native cases need bespoke staging seeding before dispatch — see each app's patrol_test/tools/.

Every user story should land with an automated E2E that walks its flow, so the patrol suite grows into a full regression net. That suite is the release regression gate: because it runs on release/* / hotfix/* PRs into main, a green run is what lets us ship confident there are no regressions, without a manual re-test pass. Adding the E2E is part of the Definition of Done.

The shared harness lives in patrol_test/ (patrol_setup.dart, patrol_widget_wrapper.dart, helpers/clinician_session.dart, pages/). patrol test discovers patrol_test/ by default. Clinician flows covered: sign-in (+ forgot-password), sign-out, members-list search, members-list filter, member details (contact + demographic), member medical record (documents + screening sections), appointments (tabs), messages, payments, and the Learn hub (open + search). Section/tab flows are permission-gated and skip cleanly when a role lacks access. These are authored + analyze-verified; runtime execution is the patrol CI job.

Autofix on failure¶

When a Patrol run fails, an agentic follow-up (.github/workflows/e2e-web-autofix.yml) investigates instead of waiting for a human's slow edit-and-rerun loop. It decides whether the failure is a flaky test (e.g. a waitUntilVisible hitting a refresh animation) or a real functionality break — probing live UI state with marionette and re-running only the failing target as the authoritative gate — then opens a draft PR with the fix (test-side for a flake; a minimal product fix plus a report for a real break). See Patrol autofix for the design.

Running locally¶

melos bootstrap            # resolve deps (first time / after pubspec changes)
melos run build_runner     # generate openapi + freezed + riverpod
melos run analyze          # static analysis
melos run test             # unit + widget (excludes goldens)

# single app, with coverage
cd apps/perci-platform-members && fvm flutter test --coverage --exclude-tags golden
# goldens for one file
fvm flutter test <path> --tags golden            # compare
fvm flutter test <path> --tags golden --update-goldens   # regenerate

Coverage backlog (path to "all functionality covered")¶

Comprehensive coverage is delivered by ratcheting FLUTTER_MIN_COVERAGE, not in a single pass. Remaining gaps are tracked as tickets; priority order:

Release-critical flows: auth, booking, payments, care flows, chat, documents.
Per-feature domain + data layers (cheap, high-value unit coverage).
Presentation/provider logic.
Legacy FlutterFlow widgets - covered by golden + patrol where unit testing is impractical.