Safe Zones, Not Whole Images: Text Placement via Computer Vision
The Text Placement Problem Nobody Talks About
There is a step in AI-generated social content that looks trivial from the outside and turns into a real engineering problem the moment you try to automate it at scale: placing text on images.
You have a photorealistic image โ an influencer holding a protein bar courtside, warm golden-hour light, perfectly composed. Now you need to overlay a short hook. Where do you put it? If you guess wrong, the text sits on the influencer's face, cuts across the product label, or lands on a patch of background so busy it becomes unreadable. The post looks amateur. Engagement drops.
This is the problem we set out to solve โ not with manual cropping or hardcoded coordinates, but with computer vision.
Why Fixed Coordinates Don't Work
The naive approach is to always place text in the same spot: top-left corner, or a fixed lower-third band. It works for templated content where every image is designed around that rule. It fails completely for AI-generated images, where composition varies unpredictably.
One image might have the influencer's face at the top-left. Another might place the product there. A third might have a textured background across the entire upper half. A fixed position that works for one image silently breaks the next fifty.
We generate images across dozens of brand profiles, each with its own influencer, product, and visual theme. Manual review of every text placement was never an option.
How Safe Zone Detection Works
Before any text is placed on a generated image, we run a computer vision analysis pass on that specific image. The goal is to identify regions we call "safe zones" โ areas that are low in visual detail, relatively uniform in color, and free of important subjects like faces, products, and key props.
The analysis looks at the image as a whole before making any placement decision. It returns candidate zones with their coordinates and visual properties. From those candidates, we select the region that offers the highest contrast potential and the least subject matter conflict.
This means every piece of overlay text is positioned based on what is actually in that image โ not based on a template that assumes what might be there.
Contrast Adaptation: Light vs. Dark Backgrounds
Finding the right zone is only half the problem. Once we know where to place text, we need to know how to render it.
A safe zone on a bright background needs dark text. A moody, low-key image needs white text. Get this wrong and you produce invisible text โ technically placed correctly, visually useless.
Our text overlay engine samples the average luminance of the background region where text will land. Light backgrounds get dark text with a light stroke; dark backgrounds get white text with a dark stroke. This decision happens per image, per placement, automatically.
For carousel slides that are heavily text-forward, we apply a semi-transparent background panel behind the text block, ensuring readability regardless of what the base image is doing.
The Six-Word Cap
Safe zone detection ensures text lands in a readable spot. But there is a second constraint we enforce that is equally important: the text itself must be short.
We cap all overlay text at six words.
This is not an arbitrary limit. On Instagram and TikTok, the thumb stops for visuals, not paragraphs. Overlay text that runs long competes with the image instead of amplifying it. It also shrinks the pool of valid safe zones โ longer text needs more horizontal space, which pushes placement failures up.
Six words forces copy to be intentional. "Protein that moves with you." "Padel fuel. No excuses." These land. Three lines of marketing copy do not.
The constraint is enforced at the caption generation stage, before the image pipeline runs, so we never arrive at the overlay step with content that cannot be accommodated.
Why This Matters for Platform Performance
Instagram and TikTok algorithms are not neutral to image quality. Blurry images, low-contrast visuals, and cluttered compositions all correlate with lower organic reach. Text that obscures the hero product or the influencer's face creates a worse viewing experience, which drives lower save rates and shares โ the signals that actually move distribution.
Getting text placement right is not just aesthetic โ it is a distribution concern. A post where overlay text sits cleanly in a sky gradient above an influencer holding a product outperforms the same image with text draped across the product label. The underlying content is identical. The placement determines whether the post looks professional or looks like a mistake.
The Carousel Context
The problem compounds in carousel posts. Each slide in a six-slide carousel is a different image โ different subject matter, different composition, different luminance profile. Safe zone analysis runs independently on each slide.
Hook slides use larger, more dominant typography that may span more of the frame. Content slides use smaller text with background panels to keep educational information legible. CTA slides return to clean, high-contrast placement centered on the product. Running computer vision per slide โ rather than applying one carousel template โ is what allows these different text treatments to each land correctly without manual intervention.
What We Are Building Toward
The current system solves placement and contrast reliably. What we are developing next is more nuanced: safe zone scoring that accounts for brand color data, so text placement actively harmonizes with a brand's palette rather than simply avoiding conflicts. We are also exploring depth-aware analysis โ treating a bokeh background differently from a flat one, since blurred regions offer more reliable safe zones even when they contain some visual content.
AI-generated images will only get more compositionally complex as models improve. At Jiwa AI, the engineering investment in the post-generation layer is how we make sure the final post looks as deliberate as it was designed to.
Jiwa AI builds AI-powered influencer marketing infrastructure for Southeast Asian businesses. Our image pipeline generates brand-aligned social content at scale โ from photorealistic influencer images to platform-optimized text overlays.