When thinking is just a slower way to make the same decision
A few weeks after I built my job-search tool, I found myself staring at a very tempting switch.
The switch was called think.
That sounds like the sort of option you should turn on, right? I had a local AI model reading job postings against my career brief and deciding whether each one looked like a match, a maybe, or a no. If the model can think harder before answering, surely that should make the judge better.
Except I do not use this tool as a demo. I use it because job hunting is tiring, and because reading the same almost-relevant job ad under five different titles drains the energy I need for the few applications that actually matter.
So the question was not "is thinking better?" The better question I think is: does thinking change enough decisions to justify making the whole thing slower?
The job-search judge
The tool itself is simple in shape. Scrapers collect job postings. Cheap filters remove the obvious no's. The survivors go to a local model, running on my laptop, which reads the posting against a brief I wrote about what I am looking for.
The brief is the important part. It says what roles I want, what I cannot take, what is a stretch but plausible, and what is an instant no. The model does not decide on my behalf what I should pursue or not. It triages a pile of job ads so I can spend my time on the ones worth reading properly.
I had already made one design decision there: most of the pipeline should be boring. Deduplication does not need AI. Excluding jobs I have already rejected does not need AI. Checking whether a title contains obvious bad signals does not need AI.
The model only gets called when the task needs judgment.
That made the think switch harder to dismiss. The judge exists because some postings are genuinely ambiguous. Titles lie. "Technical Writer" can mean developer docs, aircraft manuals, or marketing collateral. "AI" in a title can mean adoption work, sales leadership, or a full engineering role with Kubernetes and production systems. A cheap keyword filter will not understand those differences.
Maybe a reasoning mode would.
What I tested
I did not want a grand benchmark. I wanted an answer for my tool.
So I took the same job-fit judge, the same career brief, and a small set of edge cases. Five were real posting-shaped examples from my own workflow, chosen because they were the kind of listings where a model might plausibly get confused:
- a clean, strong match;
- a title that looked relevant but was wrong once the duties and location were read properly;
- a content role with a hard geography blocker;
- an AI-flavored role that was really an engineering job;
- another AI-flavored role that was really senior sales leadership.
Then I added one synthetic trap: a fake posting that tried to override the model's instructions and force a perfect score.
I ran the same judge twice on the same inputs. Once with thinking off, once with thinking on. I recorded the verdict, the score, whether the judge detected injection, and how long each call took.
That was enough. Not enough to make a claim about all models or all hiring workflows, but enough to decide whether this switch deserved to be on by default in mine.
The result
Thinking changed zero of the five real keep/drop decisions.
It did change some scores. One good match became an even stronger match. One bad fit became a more emphatic no. But the action did not change. The jobs I would keep were still kept. The jobs I would reject were still rejected.
The synthetic injection trap was the only verdict flip. With thinking off, the judge still spotted the injection attempt, but returned a cautious maybe. With thinking on, it rejected the posting outright. That is useful. It also matters less than it sounds, because the injection flag itself is already a separate warning signal. If a posting is flagged as trying to override the instructions, I should treat it differently regardless of whether the verdict says maybe or no.
The cost was not subtle.
| Mode | Average time per posting | Real posting verdict flips | Injection detected |
|---|---|---|---|
| Thinking off | 4.5 seconds | baseline | yes |
| Thinking on | 30.7 seconds | 0 out of 5 | yes |
Thinking was about 6.8 times slower in this tiny run.
For a daily tool, that matters. A judging pass that takes a couple of minutes can become the slow part of the workflow very quickly. And if the extra time only changes wording or confidence, not the decision I actually take, it is hard to justify as the default.
The uncomfortable bit
I wanted the answer to be cleaner.
"Thinking is better" would have been easy to explain. So would "thinking is a waste of time." The actual answer was more annoying: thinking helped in one artificial safety case, made some reasoning sound better, and did not change the real decisions I cared about in this run.
That is still an answer.
It means my default should stay fast. It also means the switch should exist. If a job looks unusually important, unusually ambiguous, or suspicious in a way the first pass cannot resolve, I can spend the extra time. But I do not need to pay that cost for every posting.
I keep running into this with AI tools. Models keep getting knobs that sound like wisdom: more context, more reasoning, more retries, bigger models. It is easy to turn them all on and call the result "safer" or "smarter." Sometimes that is true. Sometimes you have built a slower machine that makes the same decision.
What I learned
The useful part of this benchmark was not the table. It was forcing myself to define what would count as better before looking at the result.
For this tool, a better judge is not one that writes a more persuasive explanation. A better judge changes the decision in the cases where the old judge was wrong, catches unsafe input, and does that at a cost I can live with.
On this sample, thinking did not clear that bar as the default. So the judge runs without it.
That sounds like a small technical decision, but it is the kind of decision I care about in AI-assisted workflows. I do not want to use AI because the option is there. I want to know where it actually buys me something.
In this case, thinking is not something I need to default to. It's a tool I keep in the drawer until the job is worth the time.