I reduced token usage for my app by 80%
So I built this thing called Ask Mandi which basically let's you ask questions about mandi (Indian agricultural market) prices in plain English and get answers back. Super simple concept. Users ask stuff like "where can i find some potatoes in mumbai?" and it figures out the SQL, queries the dataset, gives them an answer.
Everything felt great until I added one tiny feature: a token usage badge that shows how much each query costs.
That's when I realised...I fucked up!
The realisation
I asked "where are apples cheapest today?" and the badge casually showed me I just burned through 3000 tokens.
Three. Thousand. Tokens.
To find cheap apples.
My first thought was that vibe coding finally caught up with me. I figured the coding agent must've created some infinite loop or something stupid.
But on further inspection I realised nope - this was all me. I did this.
What the hell was happening
Here's what I'd built:
- A massive SQL builder prompt that took user questions and tried to generate queries
- Using gpt-5.1 for it which...is not cheap
- The SQL generator kept doing the laziest thing possible:
SELECT *
- Fetching tons of rows just in case
- Then feeding all that data to a summariser to "make sense of it"
So even when the final answer was just "apples are cheapest at Patti APMC in Punjab at ₹45/kg", the model had to read through mountains of data to get there.
My first idea: use a cheaper model
Obvious fix right? Downgrade the models.
My theory was I could use a dumber model with better prompts to get the same result. Like maybe gpt-5-mini isn't as smart but if I'm super specific about what I want maybe it'll work?
So I switched to gpt-5-mini for SQL and gpt-5-nano for summaries. The cost did go down but the latency... boy was it high. Like "Seth Rogen in Pineapple Express" high.
At first I thought it was Supabase or the MCP or network issues or Mercury in retrograde. Finally figured out it was just the gpt-5 models being unusably slow for anything real time.
So I switched again to models:
- gpt-4.1-mini for SQL
- gpt-4.1-nano for summaries
And suddenly the app felt alive again. 5-7 seconds instead of 30+. Much better.
But here's the thing - this wasn't actually the real problem. Just a band aid.
The actual problem: I was feeding the models way too much shit
My SQL generator wasn't writing the "best" query. It was writing the "safest" query. So it was fetching a lot of data that wasn't actually needed. Then my summarizer had to chew through this entire mess to get to the answer.
So the real fix wasn't "find the magical cheap model." The fix was: stop fetching unnecessary data in the first place.
Time to kill SELECT \*
I got way more aggressive with the SQL prompt rules. Like actually aggressive:
- Default to latest date only
- Return 1-3 rows unless user specifically asks for more
- For trends, return daily aggregates (avg/min/max) instead of raw rows
- Never SELECT \*
- Only grab columns that actually matter for the answer
I gave it the goal of returning the smallest possible result set to get to the answer.
Using TOON
_EZQCQR6La.png&w=1080&q=75)
Even after tightening SQL, I still had a chunk of tokens going to the summarizer because I was sending it JSON.
JSON is... wordy. Every key repeats. Every quote matters. Every bracket is a tiny tax.
I'd heard a lot about TOON (basically a compressed format) and finally tried it. Converted query results to TOON instead of raw JSON.
~50-55% token reduction on the summarization step. With literally zero difference in answer quality.
Caching: the best model is no model
Ask Mandi refreshes data every day at 3:30pm IST. So if you ask the same question before the next refresh the answer should be identical.
Which means I can just cache it.
I cache answers until the next data refresh. After that the cache expires naturally.
Repeat questions became basically free. And the app got faster in ways no model upgrade can compete with.
TL;DR
I was spending ~3000 tokens to answer "where are apples cheapest today?" Which is absolutely insane.
The fix:
- Use cheaper models with better prompts
- Make the SQL queries actually reasonable
- USE TOON
- Don't pay twice for the same question
If you're building an LLM app and your bills are scary I promise there's a decent chance you're just feeding the model too much data. Start there.
