Your Async Job Isn't Async — When Cloud Run Jobs Exist Only in Code

The Architecture Looked Perfect

We had a clean async onboarding design:

User connects Instagram via OAuth
Create an OnboardingJob record (status: QUEUED)
Trigger a Cloud Run Job via gRPC
Job boots Next.js, hits its own API endpoint, runs 9-wave onboarding
Progress messages stream to WhatsApp throughout

The code was solid. cloud-run-jobs.ts cached the JobsClient, constructed the full job name, passed env overrides. job-runner.sh waited for the server health check before triggering the endpoint. The OnboardingJob table tracked status, wave progress, errors, timestamps.

One problem: there was no Cloud Run Job resource in GCP.

The Fallback That Masked the Bug

The triggerOnboardingJob function had a sensible local development fallback:

if (!jobName) {
  // Local dev fallback: call the job endpoint directly
  fetch(`${getAppUrl()}/api/jobs/run-onboard`, { ... });
  return;
}

When CLOUD_RUN_JOB_NAME isn't set, it fires-and-forgets an HTTP POST to itself. In local dev, this works — the dev server handles the request in the background.

In production, this "fallback" became the primary path. The service called its own endpoint, which ran the full onboarding pipeline synchronously inside the original request's execution context. The OAuth callback returned immediately (the fetch is fire-and-forget), but the actual work ran inside the same Cloud Run instance, competing for resources with incoming requests, and subject to the service's timeout — not the generous 1800-second timeout we configured for the job.

Three Things We Missed

1. The job resource was never created. deploy.sh ran gcloud run deploy for the service, but nobody ran gcloud run jobs create for the job. The Docker image, the entrypoint script, the job-runner — all present. The GCP resource — absent.

2. CLOUD_RUN_JOB_NAME was never set on the service. Even if the job existed, the service didn't know about it. The env var that switches from HTTP fallback to gRPC trigger was missing from the deploy command.

3. curl wasn't in the container. The job-runner.sh script uses curl for health checks and the job trigger, but node:22-slim doesn't include it. The job would have failed at the health check loop even if everything else was correct.

The Fix

Three files, three lines of real change:

deploy.sh — Tell the service about the job:

--update-env-vars CLOUD_RUN_JOB_NAME=jiwa-onboard

Dockerfile — Add curl to the container:

RUN apt-get update -y && apt-get install -y openssl curl fonts-liberation ...

scripts/deploy-job.sh — Create the actual Cloud Run Job resource, reusing the same Docker image as the service, with bash scripts/job-runner.sh as the entrypoint.

The Debugging Signal We Ignored

The GCP Console told us everything. The Cloud Run Jobs tab was empty — no job had ever been created. But we never checked because the onboarding "worked" in production. Users connected Instagram, received progress messages, got their content. The fallback HTTP path handled it.

Until it didn't. Long onboardings with many products would silently die when the service timeout hit, with no job record to investigate because the RUNNING status update happened but the FAILED catch block ran inside a terminated request.

Lesson

Infrastructure code needs the same deployment checklist as application code. We had four layers of async job infrastructure (schema, trigger, runner, progress tracking) but zero layers of "does the job actually exist in the cloud provider?"

The fix was embarrassingly small. The debugging was not — because a well-designed fallback made the system appear to work correctly while operating in a fundamentally different mode than intended.