JiwaAI
Blog
โ†All posts
cost-optimization
ai
architecture
prompt-engineering

Not Every AI Task Deserves Your Best Model

Jiwa AI Teamยท

The Reflex to Use the Best

When you're building with large language models, there's a natural gravitational pull toward using the most capable model for everything. It feels safe. The reasoning is simple โ€” why risk a worse result when the price difference per call is just a few cents?

But those cents compound. When your system makes six to ten AI calls per user interaction, and each user onboards a business that generates a batch of posts, you start to see the bill climb. We were running every single Claude call through Sonnet โ€” our most capable model โ€” and it was working, but we were paying for horsepower we didn't need on half the tasks.

Sorting Tasks by Intelligence Required

The breakthrough wasn't technical. It was organizational. We sat down and categorized every AI call in our pipeline by the type of thinking it required.

Some tasks are genuinely creative. Writing Instagram captions that sound natural in Bahasa Indonesia, that weave a product into an influencer's real lifestyle, that hit the right emotional tone โ€” that's hard. The difference between a good caption and a great one is the difference between engagement and scrolling past. These tasks earn their keep on a capable model.

But other tasks are structured extraction. Pulling a color palette from CSS hex codes and a brand description? Scoring influencers against a checklist of criteria and returning a JSON array? Generating a content calendar with dates, times, and slot assignments? These are tasks with clear inputs, clear outputs, and well-defined rubrics. A smaller, faster model handles them just as well.

The Two-Tier Approach

We added a simple model selector to our AI wrapper โ€” every call can now specify whether it needs the full Sonnet model or the lighter Haiku model. No complex routing logic, no A/B testing framework. Just a parameter.

Six calls moved to Haiku: brand color analysis, influencer-brand matching, content calendar generation, mood board style analysis, post quality scoring, and carousel slide text generation. Three calls stayed on Sonnet: business profile analysis (the core intelligence extraction), caption writing (user-facing creative work), and image prompt engineering (where prompt quality directly determines visual output).

The decision framework was straightforward. If the output is user-facing creative text, keep it on Sonnet. If it's structured data extraction, scoring, or planning โ€” switch to Haiku.

Here's exactly how every AI call in our pipeline maps out:

AI Task Model Why
Business profile analysis Sonnet Core intelligence โ€” extracts products, keywords, trends from raw website data
Caption + hashtag writing Sonnet User-facing creative text in Bahasa Indonesia / English
Image prompt engineering Sonnet Prompt quality directly determines visual output
Brand color analysis Haiku Structured extraction from CSS hex codes
Influencer-brand matching Haiku Scoring against a defined rubric
Content calendar generation Haiku Slot-filling with dates, times, categories
Mood board style analysis Haiku Style classification with clear criteria
Post quality scoring Haiku Numerical scoring against a checklist
Carousel slide text Haiku Short overlay text, structured 6-slide format

The Full Cost Picture

Before optimizing, here's what every business onboarding actually costs us โ€” every AI call, every image generation, broken down step by step.

Onboarding Cost (per business)

Step Model Vision? Before (Sonnet) After (Haiku)
Business analysis Claude Sonnet Yes (if images) ~$0.02โ€“0.05 ~$0.02โ€“0.05 (kept on Sonnet)
Theme analysis Claude Sonnet โ†’ Haiku No ~$0.01 ~$0.003
Mood board analysis Claude Sonnet โ†’ Haiku Yes (conditional) ~$0.01โ€“0.03 ~$0.003โ€“0.008
Product analysis Claude Sonnet No ~$0.01 ~$0.01 (kept on Sonnet)
Influencer matching Claude Sonnet โ†’ Haiku No ~$0.01 ~$0.003
Calendar generation Claude Sonnet โ†’ Haiku No ~$0.01 ~$0.003
Vibe images (ร—3) fal flux/dev No ~$0.075 ~$0.075 (unchanged)
Total onboarding ~$0.12โ€“0.18 ~$0.08โ€“0.12

Post Generation Cost (per batch of ~6 posts)

Step Model Before After
Captions + hashtags (1 call) Claude Sonnet ~$0.02 ~$0.02 (kept on Sonnet)
Quality scoring (1 call) Claude Sonnet โ†’ Haiku ~$0.01 ~$0.003
Image gen ร—6 (flux/dev) fal.ai ~$0.15 ~$0.15 (unchanged)
Face gen (flux-pulid, if UGC) fal.ai ~$0.04/each ~$0.04/each (unchanged)
Product composite (flux-general) fal.ai ~$0.035/each ~$0.035/each (unchanged)
Background removal (birefnet) fal.ai ~$0.01/each ~$0.01/each (unchanged)
Total ~6 posts ~$0.20โ€“0.40 ~$0.19โ€“0.39

Carousel Slides (per carousel post)

Step Model Before After
Slide content (1 call) Claude Sonnet โ†’ Haiku ~$0.01 ~$0.003
Text overlay (Sharp, local) No AI Free Free
Total per carousel ~$0.01 ~$0.003

Post Customization (per edit)

Step Model Before After
Caption rewrite Claude Sonnet ~$0.01 ~$0.01 (kept on Sonnet)
Image regen (if needed) fal.ai ~$0.025โ€“0.04 ~$0.025โ€“0.04 (unchanged)
Total per customization ~$0.01โ€“0.05 ~$0.01โ€“0.05

Summary per Business (full onboarding + first batch)

Component Before (All Sonnet) After (Mixed) Savings
Onboarding analysis ~$0.12โ€“0.18 ~$0.08โ€“0.12 ~33%
First 6 posts generation ~$0.20โ€“0.40 ~$0.19โ€“0.39 ~5%
Total per business ~$0.32โ€“0.58 ~$0.27โ€“0.51 ~15โ€“25%

The Claude-only savings are much more dramatic โ€” roughly fifty percent โ€” but image generation (fal.ai) dominates total cost and stays unchanged. The pricing difference between model tiers tells the story:

Model Input Cost Output Cost
Claude Sonnet $3.00 / 1M tokens $15.00 / 1M tokens
Claude Haiku $0.80 / 1M tokens $4.00 / 1M tokens

At scale, those fractions compound. A thousand onboardings per month saves fifty to seventy dollars on Claude alone โ€” and that's before factoring in the speed improvements from Haiku's faster response times.

Why This Isn't Just About Cost

The savings are real, but there's a subtler benefit: speed. Haiku responds faster than Sonnet. For tasks in the onboarding pipeline that run sequentially, shaving a second off each call adds up. Users see their content calendar and influencer matches faster.

There's also a resilience argument. During peak usage or rate limit scenarios, having some calls on a different model tier distributes load. If one model is experiencing latency, the other might not be.

The Tasks That Surprised Us

Quality scoring was the call we debated longest. It evaluates whether a generated caption feels authentic โ€” is the product placement natural or forced? Does the influencer's voice come through? You'd think this requires sophisticated judgment.

It turns out, when you give a smaller model a well-structured rubric with clear criteria and examples, it scores almost identically to the larger model. The rubric does the heavy lifting, not the model size. This is a pattern worth remembering: a clear prompt can compensate for a smaller model.

Calendar generation was the opposite surprise โ€” we expected it to be trivial, but the first few Haiku attempts occasionally produced invalid product IDs or missed the content category distribution rules. We solved this by tightening the prompt constraints, and now it performs reliably.

The Principle

The lesson generalizes beyond our specific use case. If you're building AI-powered products, resist the default of routing everything through your most expensive model. Audit each call. Ask: what kind of thinking does this actually require? Creative generation, nuanced analysis, and open-ended reasoning deserve your best model. Structured extraction, scoring against criteria, and slot-filling work fine with a lighter one.

The best model for the job isn't always the most capable one. Sometimes it's the one that's fast, cheap, and exactly smart enough.