I reduced token usage for my app by 80%

So I built this thing called Ask Mandi which basically let's you ask questions about mandi (Indian agricultural market) prices in plain English and get answers back. Super simple concept. Users ask stuff like "where can i find some potatoes in mumbai?" and it figures out the SQL, queries the database, gives them an answer.

Everything felt great until I added a token usage badge to each answer that shows how much the query cost.

That's when I realised...I fucked up!

The realisation

I asked "where are apples cheapest today?" and the badge casually showed me I just burned through 3000 tokens.

Three. Thousand. Tokens.

To find "cheap" apples. The irony wasn't lost on me.

My first thought was that vibe coding had finally caught up with me. I figured the coding agent must've created some infinite loop or something stupid.

But on further inspection I realised nope this was all me. I did this.

What the hell was happening

Here's what I had built:

A massive SQL builder prompt more than 1500 tokens long that took user questions and generated SQL queries

So even for simple questions which returned only a few rows the token usage would be >1500

I was also using gpt-5.1 at the time which...is not cheap

The SQL generator kept doing the laziest thing possible: SELECT * to fetch all rows

Then feeding all that data to the summariser model to make sense of it

So even when the final answer was just "Apples are cheapest at Patti APMC in Punjab at ₹45/kg" which should ideally return just 1 row, in reality the model would write a query that returned a bunch of rows just in case. And this was just for a simple question.

My first idea: use a cheaper model

Easy fix right? Downgrade the models.

My theory was I could use a dumber model with better prompts to get the same result. Like maybe gpt-5-mini isn't as smart but if I'm super specific about what I want maybe it'll work?

So I switched to gpt-5-mini for SQL and gpt-5-nano for summaries. The cost did go down but the latency... boy was it high. Like "Seth Rogen in Pineapple Express" high.

At first I thought it was Supabase or the MCP or network issues but I finally figured out it was just the gpt-5 models being unusably slow for anything real time (even with reasoning_effort: "low").

So I switched again, to models I thought I'd never have to use again:

gpt-4.1-mini for SQL

gpt-4.1-nano for summaries

And surprisingly the app felt fast for the first time. 5-7 seconds TTFT instead of 30+. Much better.

This was just a band aid fix. I was procrastinating on the real problem since it needed a lot of trial and error to get right.

The actual problem: I was feeding the models way too much shit

My SQL generator wasn't writing the "best" query. It was writing the "safest" query. So it was fetching a lot of data that wasn't actually needed. Then my summarizer had to chew through this entire mess to get to the answer.

So the real fix wasn't "find the magical cheap model." The fix was: stop fetching unnecessary data in the first place.

Time to kill SELECT \*

I got way more aggressive with the SQL prompt rules. Like actually aggressive:

Default to latest date only

Return 1-3 rows unless user specifically asks for more

For trends, return daily aggregates (avg/min/max) instead of raw rows

Never SELECT \*

Only grab columns that actually matter for the answer

I gave it the goal of returning the smallest possible result set to get to the answer.

Using TOON

One thing I did right from the start was use TOON. I had heard a lot about it on Twitter and finally had a project that I could use it in. It's basically a compressed format that's much smaller than JSON.

I converted the query results to TOON instead of sending it as raw JSON and I was surprised to see a 50-55% token reduction on the summarization step. With literally zero difference in answer quality.

Without TOON my summarization costs would've been even more insane.

Caching: the best model is no model

Ask Mandi refreshes data every day at 3:30pm IST. So if you ask the same question before the next refresh the answer should be identical.

Which means I can just cache it.

I cache answers until the next data refresh. After that the cache expires naturally.

So repeat questions are basically free and the app got faster in ways no model/prompt upgrade can compete with.

TL;DR

I was spending ~3000 tokens to answer "where are apples cheapest today?" Which is absolutely insane.

The fix:

Use cheaper models with better prompts

Make the SQL queries only return the absolutely necessary data

USE TOON

Don't pay twice for the same question

If you're building an LLM app and your bills are scary I promise there's a decent chance you're just feeding the model too much data. Start there.