Not Every AI Task Deserves Your Best Model

The Reflex to Use the Best

When you're building with large language models, there's a natural gravitational pull toward using the most capable model for everything. It feels safe. The reasoning is simple — why risk a worse result when the price difference per call is just a few cents?

But those cents compound. When your system makes six to ten AI calls per user interaction, and each user onboards a business that generates a batch of posts, you start to see the bill climb. We were running every single Claude call through Sonnet — our most capable model — and it was working, but we were paying for horsepower we didn't need on half the tasks.

Sorting Tasks by Intelligence Required

The breakthrough wasn't technical. It was organizational. We sat down and categorized every AI call in our pipeline by the type of thinking it required.

Some tasks are genuinely creative. Writing Instagram captions that sound natural in Bahasa Indonesia, that weave a product into an influencer's real lifestyle, that hit the right emotional tone — that's hard. The difference between a good caption and a great one is the difference between engagement and scrolling past. These tasks earn their keep on a capable model.

But other tasks are structured extraction. Pulling a color palette from CSS hex codes and a brand description? Scoring influencers against a checklist of criteria and returning a JSON array? Generating a content calendar with dates, times, and slot assignments? These are tasks with clear inputs, clear outputs, and well-defined rubrics. A smaller, faster model handles them just as well.

The Two-Tier Approach

We added a simple model selector to our AI wrapper — every call can now specify whether it needs the full Sonnet model or the lighter Haiku model. No complex routing logic, no A/B testing framework. Just a parameter.

Six calls moved to Haiku: brand color analysis, influencer-brand matching, content calendar generation, mood board style analysis, post quality scoring, and carousel slide text generation. Three calls stayed on Sonnet: business profile analysis (the core intelligence extraction), caption writing (user-facing creative work), and image prompt engineering (where prompt quality directly determines visual output).

The decision framework was straightforward. If the output is user-facing creative text, keep it on Sonnet. If it's structured data extraction, scoring, or planning — switch to Haiku.

Here's exactly how every AI call in our pipeline maps out:

AI Task	Model	Why
Business profile analysis	Sonnet	Core intelligence — extracts products, keywords, trends from raw website data
Caption + hashtag writing	Sonnet	User-facing creative text in Bahasa Indonesia / English
Image prompt engineering	Sonnet	Prompt quality directly determines visual output
Brand color analysis	Haiku	Structured extraction from CSS hex codes
Influencer-brand matching	Haiku	Scoring against a defined rubric
Content calendar generation	Haiku	Slot-filling with dates, times, categories
Mood board style analysis	Haiku	Style classification with clear criteria
Post quality scoring	Haiku	Numerical scoring against a checklist
Carousel slide text	Haiku	Short overlay text, structured 6-slide format

The Full Cost Picture

Before optimizing, here's what every business onboarding actually costs us — every AI call, every image generation, broken down step by step.

Onboarding Cost (per business)

Step	Model	Vision?	Before (Sonnet)	After (Haiku)
Business analysis	Claude Sonnet	Yes (if images)	~$0.02–0.05	~$0.02–0.05 (kept on Sonnet)
Theme analysis	Claude Sonnet → Haiku	No	~$0.01	~$0.003
Mood board analysis	Claude Sonnet → Haiku	Yes (conditional)	~$0.01–0.03	~$0.003–0.008
Product analysis	Claude Sonnet	No	~$0.01	~$0.01 (kept on Sonnet)
Influencer matching	Claude Sonnet → Haiku	No	~$0.01	~$0.003
Calendar generation	Claude Sonnet → Haiku	No	~$0.01	~$0.003
Vibe images (×3)	fal flux/dev	No	~$0.075	~$0.075 (unchanged)
Total onboarding			~$0.12–0.18	~$0.08–0.12

Post Generation Cost (per batch of ~6 posts)

Step	Model	Before	After
Captions + hashtags (1 call)	Claude Sonnet	~$0.02	~$0.02 (kept on Sonnet)
Quality scoring (1 call)	Claude Sonnet → Haiku	~$0.01	~$0.003
Image gen ×6 (flux/dev)	fal.ai	~$0.15	~$0.15 (unchanged)
Face gen (flux-pulid, if UGC)	fal.ai	~$0.04/each	~$0.04/each (unchanged)
Product composite (flux-general)	fal.ai	~$0.035/each	~$0.035/each (unchanged)
Background removal (birefnet)	fal.ai	~$0.01/each	~$0.01/each (unchanged)
Total ~6 posts		~$0.20–0.40	~$0.19–0.39

Carousel Slides (per carousel post)

Step	Model	Before	After
Slide content (1 call)	Claude Sonnet → Haiku	~$0.01	~$0.003
Text overlay (Sharp, local)	No AI	Free	Free
Total per carousel		~$0.01	~$0.003

Post Customization (per edit)

Step	Model	Before	After
Caption rewrite	Claude Sonnet	~$0.01	~$0.01 (kept on Sonnet)
Image regen (if needed)	fal.ai	~$0.025–0.04	~$0.025–0.04 (unchanged)
Total per customization		~$0.01–0.05	~$0.01–0.05

Summary per Business (full onboarding + first batch)

Component	Before (All Sonnet)	After (Mixed)	Savings
Onboarding analysis	~$0.12–0.18	~$0.08–0.12	~33%
First 6 posts generation	~$0.20–0.40	~$0.19–0.39	~5%
Total per business	~$0.32–0.58	~$0.27–0.51	~15–25%

The Claude-only savings are much more dramatic — roughly fifty percent — but image generation (fal.ai) dominates total cost and stays unchanged. The pricing difference between model tiers tells the story:

Model	Input Cost	Output Cost
Claude Sonnet	$3.00 / 1M tokens	$15.00 / 1M tokens
Claude Haiku	$0.80 / 1M tokens	$4.00 / 1M tokens

At scale, those fractions compound. A thousand onboardings per month saves fifty to seventy dollars on Claude alone — and that's before factoring in the speed improvements from Haiku's faster response times.

Why This Isn't Just About Cost

The savings are real, but there's a subtler benefit: speed. Haiku responds faster than Sonnet. For tasks in the onboarding pipeline that run sequentially, shaving a second off each call adds up. Users see their content calendar and influencer matches faster.

There's also a resilience argument. During peak usage or rate limit scenarios, having some calls on a different model tier distributes load. If one model is experiencing latency, the other might not be.

The Tasks That Surprised Us

Quality scoring was the call we debated longest. It evaluates whether a generated caption feels authentic — is the product placement natural or forced? Does the influencer's voice come through? You'd think this requires sophisticated judgment.

It turns out, when you give a smaller model a well-structured rubric with clear criteria and examples, it scores almost identically to the larger model. The rubric does the heavy lifting, not the model size. This is a pattern worth remembering: a clear prompt can compensate for a smaller model.

Calendar generation was the opposite surprise — we expected it to be trivial, but the first few Haiku attempts occasionally produced invalid product IDs or missed the content category distribution rules. We solved this by tightening the prompt constraints, and now it performs reliably.

The Principle

The lesson generalizes beyond our specific use case. If you're building AI-powered products, resist the default of routing everything through your most expensive model. Audit each call. Ask: what kind of thinking does this actually require? Creative generation, nuanced analysis, and open-ended reasoning deserve your best model. Structured extraction, scoring against criteria, and slot-filling work fine with a lighter one.

The best model for the job isn't always the most capable one. Sometimes it's the one that's fast, cheap, and exactly smart enough.