JiwaAI
Blog
โ†All posts
observability
logging
cloud-run
engineering

Tracing WhatsApp Onboarding Across Nine Waves

Jiwa AI Teamยท

The Problem: Invisible Pipelines

When a user sends /daftar on WhatsApp, our system kicks off a multi-stage onboarding pipeline. It exchanges Instagram OAuth tokens, scrapes websites, analyzes brands, matches influencers, generates calendars, creates images, writes captions, and delivers everything back to WhatsApp. Nine waves of async work, spanning three separate HTTP requests.

Until now, we had almost no visibility into successful flows. Our logging was error-only console.error() calls scattered across handlers. When something went wrong, we'd see a single error line with no context about which phone triggered it, which wave failed, or how long earlier waves took. Successful onboardings were completely invisible.

Phone as Correlation Key

Traditional distributed tracing uses request IDs or trace tokens. Our pipeline has a simpler natural key: the phone number. Every stage of the /daftar flow knows the phone number, from the initial webhook through OAuth callback to the long-running onboarding pipeline.

We enhanced our logger with a child() method that binds context to every subsequent log call:

const log = logger.child({ phone, flow: "daftar" });
log.info("onboard-wa request received", { instagramHandle });
// Every log from this point includes phone and flow labels

When the business record is created mid-pipeline, we enrich the logger further:

log = log.child({ businessId: business.id });

Now every log entry from that point forward carries phone, flow, and businessId โ€” queryable in GCP Logs Explorer with a single filter.

Wave-Level Timing

The nine waves in our pipeline have very different performance characteristics. Brand analysis takes 10-15 seconds. Image generation can take over a minute. We needed to see these timings without littering the code with manual Date.now() bookkeeping.

The timedWave() helper wraps any async function with start/end logging and duration tracking:

const [scraped, instagramData] = await log.timedWave("wave1:scrape+ig", () =>
  Promise.all([scrapeWithSubpages(url), igFetchPromise]),
);

This produces two log entries per wave โ€” one when it starts, one when it completes (or fails) โ€” with durationMs in the metadata. In Logs Explorer, filtering by labels.wave != "" shows a clean timeline of wave performance for any onboarding.

GCP-Compatible Structured JSON

Cloud Run captures stdout as JSON and parses specific fields. The key changes to make logs queryable:

  • severity instead of level โ€” GCP recognizes DEBUG, INFO, WARNING, ERROR and color-codes them in the console
  • logging.googleapis.com/labels โ€” fields placed here become filterable labels in the Logs Explorer sidebar
  • Always console.log() โ€” using console.error() causes GCP to double-tag severity, since stderr is auto-classified as ERROR

One subtle gotcha: we initially used console.error() for error-level logs (matching the old pattern). GCP would tag these as ERROR twice โ€” once from our severity field, once from the stderr stream. Routing everything through console.log() and letting the severity field control the level fixed this.

Fixing Silent Delivery Failures

While adding logging, we discovered why some users only received one image preview instead of all their posts. Our WhatsApp delivery loop had a 500ms client-side delay between sends, but Fonnte (our WhatsApp API) was rate-limiting responses that our code silently ignored.

The fix was two-fold: use Fonnte's native delay parameter ("2-5" for 2-5 second random server-side spacing) and keep a small 150ms client-side delay to avoid overwhelming their inbound API. The server-side delay handles WhatsApp delivery pacing; the client-side delay prevents API-level rate limiting.

Querying in Practice

With these changes deployed, tracing a /daftar flow is a single Logs Explorer query:

labels.phone="628123456789"
labels.flow="daftar"

This returns every log entry from webhook receipt through OAuth callback through all nine waves to final delivery โ€” correlated by phone number, with wave-level timing and structured error context.

For performance analysis across all onboardings: labels.flow="daftar" jsonPayload.message=~"completed" shows wave completion times. For error monitoring: labels.flow="daftar" severity>=ERROR catches failures across the entire pipeline.

The best part: none of this required a logging library. The enhanced logger is under 100 lines, built on console.log() and JSON โ€” exactly what Cloud Run expects.