Cutting LLM Cost 95% Without Losing Accuracy

When ApplyPass started scaling, the OpenAI bill started scaling with it — faster than revenue. We were running GPT-4 on every job listing we ingested, hundreds of thousands a month. The model was excellent. It was also the single largest line item in our infrastructure cost.

We moved the highest-volume workloads to a model roughly 20× cheaper and kept accuracy within noise. This is how — and, more importantly, the method, which outlives any particular pair of model names.

A note on model names: this write-up says "GPT-4" and "GPT-3.5" because that's what we actually ran. The names will age. The method — eval-gated downgrade — applies just as well to "Opus → Haiku" or "the frontier model → the cheap one" today. Read it as a pattern, not a recipe.

The instinct to resist

The obvious move when an LLM bill hurts is to just swap the model string and hope. Change gpt-4 to something cheaper, redeploy, watch the bill drop.

The bill does drop. The problem is you have no idea what else dropped with it. Cheaper models fail differently, not uniformly — they don't get 8% worse at everything; they fall off a cliff on specific inputs and stay fine on the rest. If you can't see which inputs, you're not optimizing cost, you're trading a known cost for an unknown quality regression. That trade has a way of coming due during an incident.

So the first thing we built wasn't a cheaper pipeline. It was a way to measure one.

Step 1: build the eval harness first

You cannot migrate what you cannot measure. Before touching the model, we built an evaluation dataset and a harness to score against it.

The dataset was a few hundred real job listings, each with a ground-truth classification we trusted — some hand-labeled, some from outputs we'd manually audited. The harness ran any model/prompt combination against that set and reported per-field accuracy.

async def evaluate(model: str, prompt_fn) -> EvalReport:
    results = []
    for case in dataset:
        prediction = await classify(model, prompt_fn, case.input)
        results.append(score(prediction, case.expected))
    return EvalReport(
        model=model,
        exact_match=mean(r.exact_match for r in results),
        per_field=aggregate_by_field(results),   # where does it actually fail?
        failures=[r for r in results if not r.exact_match],
    )

The per_field breakdown mattered more than the headline number. "GPT-3.5 is 86% accurate" tells you nothing actionable. "GPT-3.5 matches GPT-4 on title and location, but drops 14 points on seniority and mishandles compensation ranges" tells you exactly what to fix.

This harness became permanent infrastructure. Every prompt change, every model change, every fine-tune since has been gated by it.

Step 2: find the gap, don't guess it

With the harness, the migration stopped being a gamble and became a worklist.

We ran the cheaper model against the eval set with the existing GPT-4 prompt. It scored well overall — and the per-field report showed the damage was concentrated. A few specific fields carried almost all the regression. The rest were already at parity.

That reframes the whole project. You're no longer "making a worse model as good as a better one." You're closing three or four specific gaps. That's a tractable engineering task with a definition of done.

Step 3: close the gap — prompt first, then fine-tune

We closed the gap in two passes, cheapest lever first.

Prompt optimization. A prompt written for GPT-4 is often under-specified — GPT-4 fills the gaps with inference that a smaller model won't. So we made the implicit explicit for the failing fields: tighter definitions, an enum of allowed values instead of free text, one or two few-shot examples drawn from the failure cases themselves. Cheap to try, instant to measure, and it recovered a solid chunk of the regression on its own.

It also wasn't enough. Prompt work has a ceiling.

Fine-tuning for the rest. For the fields that prompting couldn't fix, we fine-tuned. The training set wasn't synthetic — it was our own audited GPT-4 outputs, which made fine-tuning effectively a distillation: teach the cheap model to imitate the expensive one, but only where it was actually failing.

This is also where structured outputs earned their keep. The classifier returns data through a function-call schema, not free-form text — so "accuracy" means field values, never parsing. (I wrote that pattern up separately in Structured Outputs > Prompting.) A fine-tuned smaller model holding a strict schema is a genuinely strong, cheap classifier.

The combination — tightened prompt plus a targeted fine-tune — brought the high-volume classifier from the low 80s to 95% field accuracy, at parity with the GPT-4 baseline on the fields that mattered.

Step 4: roll out behind the harness

We did not flip a flag and walk away. The rollout was staged:

Shadow. Run the new pipeline alongside the old one on live traffic, serve the old result, log both. Compare in aggregate — real inputs surface distribution skew an eval set never will.
Canary. Once shadow agreed with production, route a small slice of real traffic to the new pipeline and watch the downstream metrics, not just the model score.
Cutover. Widen the slice gradually. Keep the old path one config change away for as long as it's cheap to keep.

At every stage the eval harness was the gate. Nothing advanced without passing it.

Results

Metric	Before (GPT-4)	After (fine-tuned, cheaper)
Cost per 1K classifications	baseline	~5% of baseline
Field accuracy (high-volume classifier)	baseline	within noise / matched
Job-filter accuracy	~80%	~95%
Parse failures	rare	rare (schema-enforced)

The ~95% cost reduction was real and it held. But the number I'd actually point to is the eval harness, because it's the thing that made the cost number safe — and it's the thing still paying dividends on every change since.

The method, not the models

Strip out the model names and the procedure is fully general:

Build the eval harness before you optimize. Per-field, per-segment — not one headline number.
Measure the cheap model with the old prompt to localize the regression. The damage is almost always concentrated.
Close the gap cheapest-lever-first: prompt, then few-shot, then a targeted fine-tune. Distill from your own audited outputs.
Roll out staged — shadow, canary, cutover — with the harness as the gate at every step.

Frontier models keep getting better and a tier of "good enough, far cheaper" keeps forming underneath them. Eval-gated downgrade is how you harvest that gap on purpose, again and again, without betting the product on a hope.

Working on LLM cost at scale? I'd love to compare notes — find me on LinkedIn.