JiwaAI
Blog
โ†All posts
engineering
architecture

The Six-Download Problem: Why We Cache What We Already Have

Jiwa AI Teamยท

Fast and Wasteful

Our carousel pipeline generates six slides from a single base image. Each slide gets a different text overlay โ€” a hook, credibility line, educational points, and a call to action โ€” composited onto the same underlying photo. To maximize speed, we run all six overlay operations in parallel.

Parallelization felt like the right move. Six independent image processing tasks, no dependencies between them, perfect candidates for concurrent execution. And it worked. But it was slower than it should have been.

The Hidden Bottleneck

Each text overlay operation starts by fetching the source image from our storage service. That's a network request โ€” download the image, buffer it in memory, then process it. When we ran six of these in parallel, we were making six identical network requests for the same image at the same time.

Six round-trips to the CDN. Six downloads of the same bytes. Six copies of the same image sitting in memory. For a one-megabyte image, that's six megabytes of redundant network transfer and six times the latency of a single fetch, all happening concurrently and competing for bandwidth.

The overlay processing itself โ€” analyzing the image for text placement, rendering the SVG, compositing โ€” takes around fifty milliseconds per slide. The image download could take anywhere from two hundred to eight hundred milliseconds depending on CDN latency. The actual work was fast; the setup was slow.

Fetch Once, Process Many

The fix was almost embarrassingly simple: download the image once before the parallel step, then pass the buffer to each overlay operation. Instead of each task independently fetching its own copy, they all share a reference to the same pre-loaded data.

We added an optional parameter to the overlay function that accepts a pre-fetched image buffer. When provided, it skips the download entirely and works directly with the buffer. The carousel pipeline fetches once, then fans out the six overlay operations with the shared buffer.

Total network requests went from six to one. Memory usage dropped by roughly five-sixths for the base image. And the wall-clock time for carousel generation dropped noticeably โ€” in some cases shaving several seconds off the total.

The General Principle

This is a pattern that shows up everywhere in parallel processing: shared inputs to parallel tasks should be resolved before the fan-out, not inside each branch. It's the same reason you'd load a configuration file once at startup rather than reading it from disk in every request handler.

The trap is that parallelization makes the problem invisible at first. When tasks run concurrently, the redundant work happens simultaneously, so it doesn't look like sequential waste. You see "six tasks took 800ms each in parallel" and think "that's just how long they take." You don't immediately realize that 700ms of each task is the same download.

When Parallel Isn't Parallel Enough

There's an irony here. We parallelized for performance, but the parallelization introduced redundant work that negated much of the benefit. The tasks were compute-independent but resource-dependent โ€” they all needed the same input from the same external source.

True parallelization isn't just about running things concurrently. It's about ensuring that concurrent tasks aren't secretly serialized behind a shared bottleneck. In our case, that bottleneck was the CDN connection. Six concurrent downloads from the same origin can hit connection limits, rate limiting, or simply compete for the same network pipe.

A Checklist for Parallel Work

Before fanning out parallel tasks, it's worth asking: do these tasks share any inputs? If yes, resolve those inputs first. Do they share any outputs? If yes, coordinate writes. Do they share any external resources? If yes, consider whether the external resource can handle the concurrency.

The fastest parallel pipeline is one where each branch does only unique work, and all shared work happens exactly once before the split. It sounds obvious written down, but it's easy to miss when each individual task looks self-contained and correct. Ours certainly did โ€” every slide rendered perfectly. It just downloaded the same image six times to get there.