Skip to content

Datadog Backend Observability (Firebase Functions Gen2)

This document defines backend Datadog setup for: - Logs: Cloud Logging -> Pub/Sub -> Dataflow -> Datadog - Traces: post-deploy Cloud Run instrumentation via datadog-ci

Infrastructure (Terraform)

Terraform creates the GCP + Datadog log pipeline in: - infrastructure/terraform/modules/datadog-observability - infrastructure/terraform/environments/staging - infrastructure/terraform/environments/production

Required Terraform workspace variables: - datadog_logs_api_key (sensitive)

Required Terraform workspace environment variables for Datadog provider: - DATADOG_API_KEY - DATADOG_APP_KEY

CI/CD Instrumentation

Backend deploy workflows run Datadog instrumentation after Firebase Functions deploy: - .github/workflows/backend-deploy-staging.yml - .github/workflows/backend-deploy-production.yml

The command used is: - npx -y @datadog/datadog-ci cloud-run instrument ... --tracing true

Current services instrumented: - api - bff - bff_clinical - bff_clinical_media

If a new HTTP function is added, update both workflows to include its service name.

Verification Checklist

Run after staging deploy, then production deploy.

  1. Confirm Dataflow is healthy
  2. GCP Console -> Dataflow -> job datadog-export-job-staging / datadog-export-job-prod
  3. Status should be Running with no sustained errors.

  4. Confirm logs arrive in Datadog

  5. Query in Datadog Logs:
  6. service:perci-platform-backend env:staging
  7. service:perci-platform-backend env:production

  8. Confirm traces arrive in Datadog APM

  9. Service catalog should show perci-platform-backend.
  10. Check recent traces for endpoints under api, bff, bff_clinical, bff_clinical_media.

  11. Confirm log/trace correlation

  12. Open a trace span and verify related logs are linked.
  13. Open a log entry and verify dd.trace_id and dd.span_id exist in payload.

Rollout Gates

Use this sequence for safe rollout:

  1. Apply Terraform in staging.
  2. Deploy staging backend and verify checklist above.
  3. Keep staging stable for at least one deploy cycle.
  4. Apply Terraform in production.
  5. Deploy production backend and verify checklist above.

Failure Recovery

If instrumentation fails in CI: 1. Re-run backend deploy workflow. 2. Run manual command with dry-run first: - npx -y @datadog/datadog-ci cloud-run instrument --project <project> --region europe-west2 --service <service> --tracing true --env <env> --version <sha> --dry-run 3. If needed, temporarily disable Datadog instrumentation step and proceed with deploy, then remediate in a follow-up run.

If log forwarding fails: 1. Check Dataflow worker logs. 2. Validate Secret Manager access for Dataflow worker SA. 3. Validate sink writer has roles/pubsub.publisher on datadog-export-topic.