Confident Hallucinations in Small Language Models: Why They Happen...

Neural Highlight Active

An exploration of the mechanics behind confident hallucinations in smaller language models and the reasons larger models tend to be more reliable.

Small language models can sound astonishingly sure of themselves while being completely wrong. They will cite papers that don’t exist, assert historical “facts” that never happened, or invent API parameters with the poise of a seasoned engineer. This combination—fabricated content delivered with unwavering confidence—feels like a personality flaw. In reality it is an emergent property of how these systems are trained, what they are optimized to do at inference time, and what they lack: internal redundancy, broad coverage, and robust mechanisms for expressing uncertainty.

To understand confident hallucinations, it helps to strip away the illusion that a model “knows” things in the human sense. A language model is, at its core, a next-token predictor trained to continue text in a way that matches patterns in its training distribution. When it answers a question, it is not retrieving a fact from a database unless it has explicit retrieval tools; it is producing the most probable continuation given the prompt and its learned statistical associations. If the question is underspecified, uncommon, or outside the model’s capacity, the model still has to output something. And the decoding process we typically use to generate that output rewards the appearance of coherence far more than the virtue of silence.

The first ingredient: training teaches fluency more directly than truth

Pretraining is usually done with a simple objective: predict the next token. This objective is extremely effective at teaching grammar, style, and the “shape” of good answers. It is much less direct at teaching truthfulness. Truth is present in the data, of course, but the model is rewarded for producing text that looks like the text it has seen. If the training set contains many passages where an authoritative tone accompanies factual statements, the model learns that authoritative tone is a strong signal of “good continuation.”

That dynamic matters because the model is not trained with an explicit penalty for producing a plausible but incorrect completion in cases where the training data doesn’t provide a clear anchor. When the input prompt resembles contexts where an explanation, a citation, or a specific figure normally follows, the model will often produce one—even if it has no grounded basis for it in its internal representations. In other words, the training objective makes “sounding like an answer” easier to learn than “only answering when sure.”

Smaller models feel this more acutely because they compress the training distribution into fewer parameters. Compression forces them to store broader patterns—tone, rhetorical structure, common co-occurrences—at the expense of fine-grained distinctions and long-tail facts. When they are uncertain, they fall back on the patterns they do represent well: confident explanatory prose, canonical-sounding references, and familiar narrative arcs. The output is often internally consistent, which to human ears reads as confidence, but internal consistency is not the same thing as external accuracy.

The second ingredient: inference turns uncertainty into a single decisive story

Even if a model has some uncertainty internally, generation typically requires choosing a specific token at each step. Common decoding methods—greedy decoding, beam search, or low-temperature sampling—tend to collapse uncertainty into the highest-probability continuation. That continuation is not “what the model believes” so much as “what the model thinks is the most likely text to come next given this prompt.”

If you ask, “What year was X established?” and the model doesn’t have a stable representation of X, it may still have strong priors that years should look like four-digit numbers, that institutions often start in certain eras, and that an authoritative answer tends to state a year without hedging. Under low temperature, the model will select a crisp year rather than dithering among many possibilities. This is a key point: confidence in the style of the output can be an artifact of decoding, not a reflection of real epistemic certainty.

Small models are especially vulnerable here because their probability distributions are often less well-calibrated. Calibration, loosely, is the property that when a model assigns high probability to an answer, it tends to be correct at that same rate. Smaller models can be simultaneously uncertain in a semantic sense while still producing sharp token probabilities due to simplified internal representations. They “snap” to a plausible completion because their space of alternatives is less richly structured. Bigger models, by contrast, can maintain more nuanced representations that keep multiple hypotheses alive longer, making it easier—though not guaranteed—for the distribution to reflect genuine ambiguity.

The third ingredient: imitation of authority is an easy shortcut

Human writing has strong conventions for authority. We tend to write definitive statements when we are presenting knowledge, and we hedge only in certain genres (academic writing, careful journalism, legal language). A model trained on a broad web-scale corpus sees many contexts where questions are followed by decisive answers. It also sees that confident answers are rewarded socially in text: they conclude threads, they get quoted, they read “complete.”

When a model lacks the knowledge to answer, it can still satisfy the format of an answer: claim, justification, maybe a citation. This is not deception in the human sense; it is pattern completion. But it leads to a specific pathology: the model is more likely to invent a concrete detail than to admit uncertainty, because concrete details are often what the training distribution suggests should come next.

Small models take this shortcut more often because they have fewer resources to represent “I don’t know” as a robust, context-appropriate completion across many domains. They might have seen disclaimers in the data, but unless instruction tuning strongly reinforces abstention, the default completion for many question-like prompts is an answer-like passage. Without enough capacity, the model can’t reliably learn when to switch modes from “explain” to “decline.”

The fourth ingredient: instruction tuning and RLHF can amplify apparent confidence

Many deployed models are not just pretrained; they are instruction-tuned and often optimized with reinforcement learning from human feedback (RLHF) or related methods. These stages are designed to make outputs more helpful, more polite, and more aligned with user intent. But “helpful” is a dangerous word when the system lacks grounding. If human raters tend to prefer answers that are fluent and direct over answers that are cautious or uncertain, the optimization process can push the model toward confident-sounding responses.

This does not mean RLHF is bad; it often reduces harmful content and improves usability. The trade-off is that it can inadvertently teach the model that producing something is better than refusing. Smaller models, again, are at a disadvantage: they have less latent knowledge and weaker reasoning capabilities, so the pressure to be helpful can translate into more fabrication. If you can’t truly answer but you’re optimized to avoid blankness, you “fill in” the missing pieces.

Larger models can also be affected by this incentive, but they can more often meet the demand for helpfulness with real content: they have seen more patterns, retained more facts, and can reason through more steps. So the same “be helpful” pressure results in fewer outright inventions.

The fifth ingredient: lack of robust self-evaluation in smaller models

A subtle but crucial difference between small and large models is the ability to perform internal consistency checks that correlate with reality. Bigger models can do more than just produce text; they can often simulate verification behaviors: paraphrasing the question, checking whether a claim contradicts earlier parts of the response, or recognizing that a request requires a specific source. None of this is guaranteed, but the capacity enables it.

Small models often cannot afford those extra cognitive moves. They may struggle with multi-step reasoning, with keeping track of constraints, or with recognizing when a question requires specialized knowledge. As a result, once they start down a plausible-sounding path, they have fewer opportunities to notice that the path is ungrounded. The generated response gains momentum: each token conditions the next, and a fabricated premise becomes a scaffold for further fabrication. By the end, the model has built a coherent mini-world and narrates it confidently because, locally, each step followed naturally from the previous one.

This “momentum” effect is a hallmark of hallucination. It is not just that one fact is wrong; it’s that an early mistake becomes a seed that the model elaborates into an entire explanation. Larger models are better at resisting this because they can represent more constraints simultaneously and are more likely to detect that a premise is shaky, prompting them to hedge or redirect.

Why scaling reduces confident hallucinations (but doesn’t eliminate them)

The most straightforward reason larger models hallucinate less is that they simply know more. With more parameters and often more or better-curated training data, they memorize and generalize more factual associations. Many hallucinations in small models are not deep epistemic failures; they are coverage failures. The model was never reliably exposed to the fact, or it could not store it distinctly. Scaling improves coverage and reduces the number of prompts that push the model out of distribution.

But there is more going on than memory. Scaling changes the geometry of representations. Larger models learn richer latent spaces where subtle differences in prompts map to different internal states. That makes it easier to distinguish “this looks like a question about a real paper” from “this is asking for a citation I don’t have.” It also makes it easier to maintain uncertainty: if the model’s internal state can represent multiple plausible continuations, the output distribution can reflect that ambiguity more faithfully.

There is also an often-overlooked advantage: larger models tend to be trained and fine-tuned with more sophisticated pipelines. They receive more instruction data, more safety tuning, more rigorous evaluation, sometimes better decoding defaults, and—crucially—more engineering around refusal and uncertainty. Some of the reduction in confident hallucination is therefore not purely emergent from scale; it is a product of the ecosystem that grows around large flagship models.

Even so, big models still hallucinate. When asked for obscure citations, private information, rapidly changing facts, or information beyond their cutoff, they can revert to the same pattern-completion behavior. What changes with scale is the frequency and the brittleness: larger models are less often forced into pure guesswork, and when they are, they are somewhat more likely to signal uncertainty, ask clarifying questions, or provide a safer generic answer.

The hidden role of “answerability” and the cost of abstention

One way to reframe the problem is to ask: how does a model decide whether a question is answerable? Humans do this constantly. We notice when we’re outside our knowledge, and social norms allow us to say “I’m not sure.” For language models, abstention is not a natural outcome of next-token prediction. It must be learned as a pattern: the model must see many examples where declining is the appropriate continuation, and the fine-tuning must reward that behavior.

Small models face a paradox here. To abstain correctly, the model must recognize the boundaries of its competence—an ability that itself requires competence. If it cannot reliably detect that a topic is rare or specialized, it will not trigger refusal. Bigger models, with better representations, can learn a more accurate notion of “this prompt is asking for something I might not have.” That makes abstention more precise and less annoying, so developers are more willing to enable it. With small models, abstention can be overly broad—refusing too much—which leads teams to tune it down, inadvertently increasing confident hallucinations.

Why confidence feels “absolute” in small models

The absolutism often comes from a mismatch between two things: the model’s limited epistemic grounding and the language it has learned to produce. Natural language is a high-bandwidth interface for certainty. Even mild statements can sound decisive when written in a declarative form. Add to that the fact that many systems do not expose uncertainty estimates to users, and you get a binary impression: either it answered or it didn’t. When it answers, the prose is smooth—because fluency is exactly what the model is trained to optimize—so readers infer confidence.

Small models intensify this mismatch because their errors are less subtle. They may fabricate names, dates, or mechanisms that fit the shape of the requested information. These errors often come packaged in textbook-like exposition, because the model has learned that textbook-like exposition is a common continuation of explanatory prompts. The result is “absolute confidence” as a rhetorical artifact: not the internal emotion of a machine, but the outward style of a completion that resembles authoritative writing.

Mitigations: what actually helps (and why it’s harder for small models)

Reducing confident hallucinations is possible, but the levers matter. Retrieval-augmented generation (RAG) can ground answers in sources, but it requires robust retrieval and careful prompting so the model doesn’t invent citations when retrieval fails. Tool use (search, calculators, databases) reduces guessing, but small models may struggle to decide when to call tools or how to interpret results. Better instruction tuning with explicit “don’t guess” examples helps, yet too much refusal makes the model feel unhelpful, and small models can overshoot.

There is also a deceptively simple mitigation: ask for uncertainty explicitly and structure the response so that “I don’t know” is a valid completion. Yet this too works better with larger models, which can more reliably produce calibrated hedges and discriminate between what they know and what they’re inferring.

Ultimately, confident hallucinations in small models are not a single bug; they are the intersection of an objective function that rewards plausible continuations, decoding strategies that force decisiveness, a training distribution rich in authoritative prose, and limited capacity to represent boundaries and perform self-checks. Bigger models reduce the problem because they have more knowledge, richer representations, and often better tuning and infrastructure around them. But the underlying lesson remains: language models are optimized to produce text that looks right. Getting them to produce text that is right—or to say plainly when they can’t—is an extra layer of engineering, and scale simply makes that engineering easier to succeed.