Teaching AI Where to Put the Text

The Problem with Fixed Positions

Put text in the top-left corner and you cover someone's face. Put it at the bottom and you obscure the product. Center it and it clashes with the busiest part of the image. When you're generating hundreds of Instagram posts with AI-created backgrounds, every image is different — and a fixed text position fails on at least a third of them.

We tried the obvious solution first: always place text at the bottom-left with a dark background panel. It worked better than random placement, but it still produced awkward results. A beach sunset with all its drama at the bottom horizon would get covered. A product flat-lay with items in the lower-left would have text competing with the subject.

Reading the Image Before Writing On It

The insight was simple: before placing text, ask the image where it has room. Not with another AI model call — that would add cost and latency. Instead, we use basic image statistics that any image processing library can compute in milliseconds.

The idea borrows from photography composition. Photographers naturally place text in areas with low visual detail — solid color regions, smooth gradients, soft bokeh backgrounds. These areas have low variance in their pixel values. A busy, detailed region has high variance — lots of different colors and edges packed together.

We test five candidate positions on every image: bottom-left, bottom-center, top-left, top-center, and center. For each position, we extract the rectangular region where text would be placed and compute the standard deviation of color values across all three color channels. The position with the lowest combined standard deviation wins — it's the calmest, most uniform area, and therefore the best place for readable text.

A Slight Bias Toward Natural

Pure statistical optimization would sometimes choose the top of the image, which looks unnatural for Instagram content. Viewers expect captions and overlay text toward the bottom — it's where Instagram's own UI places text, and it's where the eye naturally finishes scanning an image.

So we added a subtle bias: bottom positions get a small bonus in the scoring. It's not enough to override a genuinely better region at the top, but when two positions are close in score, bottom wins. The result feels intentional rather than algorithmic.

Contrast That Adapts

Finding the right position is only half the problem. The text also needs to be readable against whatever colors are in that region. A white caption on a bright sky is invisible. Dark text on a shadowy background disappears.

After choosing the position, we sample the average brightness of that specific region. If the background is light, we render dark text with a subtle light stroke. If it's dark, we flip to white text with a dark stroke. The stroke itself uses a two-pass rendering technique — the outline is drawn first with rounded joins for smoothness, then the fill is drawn on top for crisp edges.

For cases where even adaptive color isn't enough — highly varied backgrounds with both light and dark patches — we fall back to a semi-transparent dark panel behind the text. This guarantees readability on literally any image, while still looking cleaner than a solid black bar.

Why Not Use an AI Vision Model?

We considered it. Send each image to a vision model, ask it to identify the best text region, and use those coordinates. It would probably work well, but the economics don't make sense at scale. Each vision call costs five to twenty times more than the statistical approach, adds hundreds of milliseconds of latency, and requires a network round-trip that can fail.

The statistical method runs locally, takes under fifty milliseconds, never fails, and produces results that are good enough for ninety-five percent of images. For the remaining five percent, the background panel fallback catches them. Perfect is the enemy of shipped.

The Compound Effect

This matters more than it seems. When a brand reviews their generated content calendar and every single post has clean, readable text that doesn't cover the subject — they trust the system. When one post has invisible text and another has text over a face, they question everything.

Visual consistency is trust. And trust is what makes someone approve a post for publishing rather than requesting a manual revision. Every avoided revision is time saved for the business owner and compute saved for us. A fifty-millisecond image analysis step pays for itself many times over.