Why We Moved Onboarding to Cloud Run Jobs

The 300-Second Wall

When a user sends /daftar on WhatsApp, our pipeline scrapes their website, analyzes their brand with Claude, matches AI influencers, generates a content calendar, creates images with Flux via Fal AI, scores quality, and delivers previews back to WhatsApp. Nine waves of work, six external APIs, and dozens of parallel operations.

All of it ran inside a single HTTP request with a 300-second timeout.

Most of the time, it worked. The pipeline completed in 2-4 minutes, the user received their content previews, and everyone was happy. But "most of the time" isn't good enough when your product's first impression depends on it.

How Timeouts Kill Silently

The failure mode was insidious. A user sends /daftar, connects their Instagram, gets a "we're creating your content" message, and then... nothing. No error message. No preview images. No indication of what went wrong.

What happened behind the scenes: the pipeline reached Wave 8 (image generation) with 30 seconds left on the clock. Fal AI needs 60-120 seconds per image. The soft timeout fired, the pipeline threw an error, Cloud Run killed the request, and the WhatsApp notification about the failure raced against the container shutdown. Sometimes the failure message got through. Sometimes it didn't.

We had a 270-second soft timeout designed to gracefully handle this — bail out before Cloud Run's hard 300-second limit. But "gracefully bailing out of image generation" is an oxymoron. Either you generate the images or you don't. There's no partial credit for half a Flux render.

The Real Problem: HTTP Requests Aren't Jobs

The architecture was wrong at a fundamental level. Onboarding isn't a request-response interaction — it's a background job. The user doesn't sit at a loading screen waiting for a response. They get a WhatsApp message saying "we're working on it" and go about their day. The pipeline sends them progress updates as it runs.

Using an HTTP request as the execution context for a 5-minute background job meant we inherited constraints we didn't need: timeouts, concurrency limits shared with web traffic, and a process lifecycle tied to a network connection.

Cloud Run Jobs: Same Container, No Timeout

Cloud Run Jobs run the same Docker image as Cloud Run services, but without the HTTP request lifecycle. A job starts a container, runs it to completion, and exits. The timeout ceiling goes from 5 minutes to 24 hours.

The migration was simpler than expected because of one key decision: the job container boots the same Next.js server and calls the pipeline endpoint on localhost. This means every @/lib/* import, every Prisma query, every Fal AI call works exactly as before. Zero pipeline code was restructured.

Cloud Run Job starts
  -> node server.js (boots Next.js on :8080)
  -> curl POST localhost:8080/api/jobs/run-onboard
  -> pipeline runs with no timeout
  -> container exits

The job-runner.sh script is 30 lines of bash. It waits for the server to be healthy, sends the request, and exits with the HTTP status code. Cloud Run Jobs handles retry — if the job fails, it automatically re-runs once.

The Job Table

To bridge the HTTP world (OAuth callback) and the job world (Cloud Run Job), we added an OnboardingJob table:

OnboardingJob
  id, phone, instagramToken, instagramHandle, language
  status: QUEUED -> RUNNING -> COMPLETED | FAILED
  currentWave: 0-9
  businessId, error
  startedAt, completedAt

The OAuth callback creates a job record and triggers the Cloud Run Job via the Jobs API. The job reads its parameters from the database, not from environment variable overrides or message payloads. This avoids Cloud Run Jobs' 100KB payload limit and keeps sensitive data (Instagram tokens) in the database where they belong.

Per-Image Progress: The Unexpected Win

With the timeout pressure gone, we could add something we'd always wanted: per-image WhatsApp progress updates.

Before, the user saw three vague messages:

"Analyzing your brand..."
"Creating content calendar..."
"Generating images... almost done!"

That third message was a lie. "Almost done" meant "we're starting the slowest part and praying it finishes before the timeout."

Now the user sees:

"Learning about your business..."
"Analyzing your brand..." + analysis summary
"Preparing your products..."
"Calendar ready!" + calendar summary
"Creating image 1 of 6..."
"Creating image 2 of 6..."
"Creating image 3 of 6..."
...
"All 6 images ready! Sending previews..."

Each wave updates currentWave in the job table, and each image completion fires a WhatsApp message. The user knows exactly what's happening at every moment. If the pipeline takes 8 minutes instead of 4, they can see it's on image 5 of 6 — not wondering if it crashed.

What Didn't Change

The pipeline itself — performOnboarding() in onboard.ts — is architecturally identical. Nine waves, the same parallel/sequential structure, the same p-limit concurrency controls. We removed three things:

SOFT_TIMEOUT_MS = 270_000 — no longer needed with a 30-minute budget
timeBudgetExceeded() — the quality gate no longer skips retries under time pressure
maxDuration = 300 — the Next.js route config that enforced the Cloud Run timeout

Removing the time pressure had a subtle quality benefit: the quality gate now always runs full retries. Previously, if the pipeline was slow, it would skip re-generating low-scoring images to save time. Now every image gets the full quality treatment.

Cost and Performance

The concern with any "just give it more time" solution is cost. Cloud Run Jobs use the same per-second billing as Cloud Run services. The pipeline takes the same amount of wall-clock time — we didn't make it slower, we just removed the ceiling. The only added cost is ~15-30 seconds of container cold start per job, which costs roughly $0.001.

The job container gets its own 2 vCPU / 2GB allocation, which means onboarding no longer competes with web traffic for resources. During peak hours, this isolation actually improves reliability for both the onboarding pipeline and the dashboard.

Local Development

The Cloud Run Jobs client (@google-cloud/run) is only used in production. When the CLOUD_RUN_JOB_NAME environment variable isn't set, triggerOnboardingJob() falls back to a direct HTTP call to the local Next.js server — exactly the old behavior. Local development works without a Cloud Tasks emulator, a GCP project, or any cloud infrastructure.

The Pattern

If you have a Next.js API route that:

Takes longer than your platform's request timeout
Sends progress updates to the user through a side channel
Doesn't need to return data to the HTTP caller

...it's a job, not a request. Cloud Run Jobs let you use the same container, the same code, and the same deployment pipeline. The only new artifact is a shell script that boots your server and curls localhost.