How Concurrency Limits Reduced Our Pipeline Failures by 70%

The Symptom: Random Timeouts Under Load

Our content generation pipeline worked great with one user. Two users, fine. But the moment five or six businesses onboarded simultaneously, everything started breaking — random timeouts, rate limit errors from Claude and Fal AI, database connection exhaustion, and occasionally the entire Cloud Run instance would restart from memory pressure.

The confusing part: each individual API call was fast. Claude responded in 2-3 seconds. Fal AI generated images in 20-30 seconds. Database queries took milliseconds. So why was the pipeline timing out at 270 seconds?

The Root Cause: Unbounded Parallelism

We had optimized for speed by parallelizing everything. When analyzing 10 products, we'd fire 10 Claude vision calls simultaneously. When generating 6 post images, all 6 Fal AI calls launched at once. When saving 30 posts to the database, 30 concurrent inserts hit the connection pool.

For a single user, this was fast. For five concurrent users, this meant:

50 simultaneous Claude API calls (5 users x 10 products) — hitting rate limits
30 simultaneous Fal AI generations — exceeding account quotas
150 concurrent database inserts — against a pool of 10 connections

Each rate limit error triggered a retry. Each retry added more concurrent calls. The system was amplifying its own failures.

The Fix: Three Lines That Changed Everything

The solution was embarrassingly simple: p-limit — a tiny npm package that creates a concurrency semaphore.

import pLimit from "p-limit";

const limitClaude = pLimit(3);  // Max 3 concurrent Claude calls
const limitFalAI = pLimit(3);   // Max 3 concurrent image generations
const limitDb = pLimit(5);      // Max 5 concurrent DB writes

Instead of Promise.all(products.map(async (p) => analyzeProduct(p))), we write Promise.all(products.map((p) => limitClaude(() => analyzeProduct(p)))). Same result, same parallelism, but capped at 3 concurrent calls instead of unbounded.

The work still happens in parallel — it's just queued. With 10 products and a limit of 3, the first 3 start immediately, and each subsequent one starts as soon as a slot frees up. Total wall-clock time increases by maybe 20%, but reliability increases by an order of magnitude.

Where We Applied Limits

We audited every Promise.all and Promise.allSettled in the pipeline and categorized them:

Claude API calls (limit: 3) — Product visual analysis, quality scoring, caption generation, calendar planning. Claude's API has per-account rate limits, and exceeding them means 429 errors and exponential backoff delays that compound across the pipeline.

Fal AI image generation (limit: 3) — Post images, carousel covers, product composites. Each generation takes 20-40 seconds and consumes GPU resources. Unbounded parallelism doesn't make them faster — it just gets you queued on Fal's side instead of yours.

Database writes (limit: 5) — Post creation, product updates, business record saves. Our Postgres connection pool has 8 connections per instance. Leaving headroom for reads means capping writes at 5.

Storage uploads (limit: 5) — Supabase image uploads. These are fast but each one buffers a full image in memory. 30 concurrent uploads of 2MB images means 60MB of memory pressure.

Web scraping (limit: 3) — Subpage fetches during business analysis. We were fetching all internal links in parallel, which could mean 20 simultaneous HTTP requests to a single domain — essentially DDoS-ing the business we're trying to analyze.

The Retry Budget Pattern

Concurrency limits solved the parallel explosion, but we had a second problem: cascading retries.

Our image generation pipeline has three quality gates — text detection, product fidelity, and realism scoring. Each gate can trigger a retry (regenerate the image and re-check). In the worst case, a single post could trigger 8 Fal AI calls and 10 Claude vision calls just through retry cascading.

We added a retry budget: a shared counter across all quality gates with a default cap of 4 retries per post.

let gateRetriesUsed = 0;
const maxRetries = options?.maxGateRetries ?? 4;

// Text detection gate
if (textDetected && gateRetriesUsed < maxRetries) {
  const retryUrl = await retryWithTimeout(context, 8000);
  gateRetriesUsed++;
}

// Product fidelity gate — skipped if budget exhausted
if (fidelityScore < 3 && gateRetriesUsed < maxRetries) {
  // ...
}

The budget forces the system to prioritize: if text detection used 2 retries, the fidelity and realism gates only get 2 retries between them. This prevents the combinatorial explosion where each gate independently decides to retry.

We also added an 8-second timeout on each retry attempt. If the image service is slow, we'd rather use the imperfect image than wait and risk timing out the entire pipeline.

Fetch Timeouts: The Silent Killer

While auditing for concurrency issues, we discovered a subtler problem: several fetch() calls in our storage layer had no timeout. The code would download an image URL, process it with Sharp, and upload to Supabase — but if the source server was slow, the fetch would hang indefinitely.

We added a 30-second AbortController timeout and a 10MB size guard to every fetch in the storage pipeline:

const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30_000);
try {
  const response = await fetch(imageUrl, { signal: controller.signal });
  const size = parseInt(response.headers.get("content-length") || "0");
  if (size > 10 * 1024 * 1024) return imageUrl; // Skip, return original
  // ... process
} finally {
  clearTimeout(timeout);
}

On failure, the function returns the original URL instead of crashing. The image won't be optimized or resized, but the post still gets created.

Database Connection Pool Tuning

The default pg.Pool in Node.js creates up to 10 connections. We were using the default, which sounds reasonable until you consider that Cloud Run can auto-scale to multiple instances, each with its own pool.

With 5 instances at 10 connections each, that's 50 connections to Supabase's Postgres. Under load with 10 instances, 100 connections — potentially hitting Supabase's limit.

We explicitly set max: 8 with sensible timeouts:

const pool = new pg.Pool({
  connectionString: process.env.DIRECT_URL,
  max: 8,
  idleTimeoutMillis: 30_000,
  connectionTimeoutMillis: 5_000,
});

The connectionTimeoutMillis is critical — without it, a query that can't get a connection waits forever. With it, the query fails fast, the caller can handle the error, and the pipeline continues with whatever data it has.

Results

After deploying these changes, we measured the impact across 200 onboarding sessions:

Pipeline timeout rate: 23% → 4% (down 83%)
Claude API 429 errors: ~15/hour → 0-1/hour
Average onboard time: 145s → 165s (13% slower, but actually completes)
Database connection errors: 8/day → 0
Memory OOM restarts: 2-3/day → 0

The pipeline is slightly slower per-request but dramatically more reliable. Users get their content calendar instead of an error message.

The Counterintuitive Lesson

Speed and reliability are not opposites — but unbounded parallelism serves neither. Firing 30 API calls simultaneously doesn't make them 30x faster when the bottleneck is the external service's rate limit. It just converts a "each call takes 3 seconds" problem into a "the whole batch fails after 90 seconds of retries" problem.

Controlled concurrency is faster in practice because it eliminates retry storms, prevents rate limiting, and keeps memory usage predictable. The queue adds a few seconds of wall-clock time but removes minutes of failure recovery.

The right question isn't "how many things can we do in parallel?" It's "how many things can we do in parallel before the system starts fighting itself?"

For us, the answer was 3 Claude calls, 3 Fal AI generations, and 5 database writes. Your numbers will be different, but the pattern is universal: measure where your external dependencies start pushing back, and set your limits just below that threshold.