An engineering-focused look at how data, objectives, architectures, evaluation, and product constraints now limit AI progress more than raw compute—and what to do about it.
AI has a well-earned reputation for riding hardware curves. For a decade, bigger clusters, faster interconnects, and clever kernels translated into measurable leaps. But if you’ve built or deployed modern AI systems—especially LLM-based products—you’ve likely felt a shift: you can throw more GPUs at the problem and still hit the same failure modes. Models still hallucinate, forget instructions, struggle with long-horizon planning, and behave unpredictably under distribution shift. Meanwhile, many real-world improvements come from system design: better data curation, better objectives, retrieval pipelines, tool use, evaluation harnesses, and guardrails.
This isn’t an argument that hardware no longer matters. It does. But in many domains, marginal gains from more compute are increasingly dominated by design bottlenecks—choices about what we train on, what we optimize, how we represent knowledge, and how we verify behavior.
This post unpacks where the bottlenecks actually are, why they persist, and what engineering teams can do to make progress when “just scale it” stops working.
The scaling era plateau: compute still works, but the ROI curve changed
Scaling laws taught us a simple story: increase compute, and loss improves predictably. That story remains broadly true for training loss, and even for many benchmark metrics. The issue is that real-world usefulness depends on more than cross-entropy on next-token prediction.
When organizations say they’re “compute-limited,” they often mean: “we can’t afford the next training run.” But when products fail, they’re often failing on design-limited dimensions:
- Faithfulness (groundedness to sources, non-hallucination)
- Robustness (prompt sensitivity, jailbreak resistance)
- Planning (multi-step tool use, long-horizon tasks)
- Consistency (stable policy, stable formatting, stable reasoning behavior)
- Domain correctness (specialized knowledge, procedural accuracy)
- Observability (knowing why the system did what it did)
More compute can improve some of these, but often inefficiently. You can reduce hallucinations by scaling, yet a careful retrieval + citation constraint may reduce them more, at lower cost, and with better debuggability.
So what’s actually bottlenecking progress? Let’s look at the biggest design constraints.
Bottleneck #1: Objective mismatch—next-token prediction isn’t the product
Most frontier models are trained primarily to predict the next token. That objective is extremely powerful as a general-purpose pretraining signal, but it’s a blunt instrument for downstream behaviors.
Why it matters
If your product requirement is “Answer questions accurately using my company’s documentation,” the objective you want is closer to:
- “Use retrieved context when available”
- “Cite sources”
- “Refuse when uncertain”
- “Prefer precision over eloquence”
- “Ask clarifying questions when ambiguous”
Next-token prediction doesn’t explicitly reward any of that. Reinforcement learning from human feedback (RLHF) and newer preference optimization methods push in that direction, but they are still imperfect proxies.
Common failure mode: the model optimizes plausibility
LLMs are very good at producing answers that look right. That’s not a moral failing; it’s a natural consequence of their training objective.
A design-limited system treats “truth” as a first-class constraint. That requires objectives and scaffolding that can represent truthfulness, uncertainty, and provenance.
What to do instead (practical)
- Train or fine-tune for grounded behavior: supervised fine-tuning on tasks that require quoting/citing, answering “not in context,” or extracting exact spans.
- Use constrained decoding or structured outputs: force JSON schemas, citations arrays, or tool-call-only modes where appropriate.
- Add a verifier: separate “proposer” and “checker” models, or rule-based verification for critical fields.
A simple example pattern is “generate + verify”:
answer = llm.generate(prompt_with_context)
verdict = verifier_llm.generate(f"Check if the answer is supported by context.\n\nCONTEXT:\n{ctx}\n\nANSWER:\n{answer}\n\nReturn: SUPPORTED/UNSUPPORTED and why.")
if "UNSUPPORTED" in verdict:
answer = llm.generate(prompt_with_stricter_instructions)
This isn’t glamorous, but it directly targets the objective mismatch. Compute helps, but design determines whether compute translates into reliable behavior.
Bottleneck #2: Data is the real scarce resource (not tokens, but signal)
Hardware scales throughput. It doesn’t automatically produce high-quality supervision, clean corpora, or domain-specific edge cases. As models get larger, they become more sensitive to subtle data issues:
- contamination (train/test leakage)
- duplicated or low-entropy text
- mislabeled preferences
- inconsistent formatting and style
- missing long-tail scenarios
Why bigger models can worsen data problems
Large models can memorize spurious patterns and amplify them. If your preference data rewards verbosity, the model learns verbosity. If your instruction data has inconsistent refusal patterns, you get inconsistent safety behavior.
For domain systems, the situation is sharper: your “data” might be proprietary policies, ticket histories, runbooks, codebases, and PDFs. If that corpus is messy (and it usually is), the model can’t “compute” its way to correctness.
Design bottleneck: data pipelines and governance
The differentiator becomes the ability to:
- curate
- deduplicate
- structure
- label
- version
- evaluate
This is a software and process problem, not a GPU problem.
Practical moves
- Data contracts: define schemas for “instruction,” “context,” “response,” “citations,” “tools used,” etc.
- Golden sets: maintain small, high-trust evaluation sets that match production tasks.
- Feedback loops: instrument your product so that failures become labeled data.
A useful mental model is: your system is only as smart as your worst high-frequency failure. Fixing that failure is usually a data and evaluation loop issue.
Bottleneck #3: Evaluation is underpowered—what you can’t measure won’t improve
A lot of AI “progress” is benchmark progress. In production, success is “did it solve the user’s problem reliably, safely, and quickly?” The gap between those is often enormous.
Why evaluation becomes the limiting factor
When models are strong, differences show up in:
- rare edge cases
- adversarial prompts
- tool failures
- multi-turn state drift
- subtle policy compliance
- latency/cost constraints
If your evaluation suite can’t detect regressions in those areas, you can’t confidently ship changes. Teams get stuck: they know the system is flaky, but they can’t prove improvements, so they can’t iterate quickly.
What “good eval” looks like for LLM systems
You need multiple layers:
- Unit tests for prompt templates, JSON validity, tool-call formatting
- Scenario tests for end-to-end user flows (multi-turn)
- Regression sets for known failures (hallucination cases, jailbreaks)
- Offline metrics (faithfulness, citation correctness, policy compliance)
- Online metrics (task completion, user satisfaction, escalation rate)
In practice, this means building an evaluation harness like you would for any critical software system.
Here’s a minimal structure for an LLM eval case:
id: "expense_policy_refund_window"
input:
user: "Can I get reimbursed for a hotel I booked last month?"
context_docs:
- "Company Travel Policy v3.2 ... Reimbursements must be submitted within 30 days ..."
expected:
must_include:
- "30 days"
must_cite:
- "Company Travel Policy v3.2"
must_not_include:
- "guarantee"
scoring:
faithfulness: llm_judge
format_valid: json_schema
refusal_behavior: rule_based
The system bottleneck isn’t that GPUs are too slow—it’s that most teams are trying to improve a complex stochastic system without a proper test suite.
Bottleneck #4: Context and memory design—LLMs aren’t databases
A common misunderstanding is to treat bigger context windows as a substitute for knowledge management. Even with 200k+ token contexts, you still face:
- retrieval quality issues
- conflicting sources
- prompt injection risks
- attention dilution (“lost in the middle” effects)
- token budget trade-offs with cost and latency
The real constraint: information architecture
In enterprise settings, correctness often depends on which documents you show, how you chunk them, and how you rank them. The hardest part isn’t the model—it’s the system that decides what the model sees.
RAG is not a feature; it’s a product subsystem
A robust retrieval-augmented generation (RAG) design includes:
- ingestion (parsing PDFs/HTML, preserving tables, metadata)
- chunking strategy (semantic, structural, overlap)
- embeddings and indexing (vector DB + metadata filters)
- retrieval (hybrid search, reranking)
- context assembly (dedupe, diversify, order)
- grounding and citation enforcement
- defenses (prompt injection filtering, source allowlists)
Compute helps you run rerankers and bigger models, but the bottleneck is designing the pipeline so that the right context is retrieved and safe to use.
Example: hybrid retrieval with reranking
A strong baseline pattern:
- BM25 (keyword) retrieve top 200
- Vector retrieve top 200
- Union + metadata filter + dedupe
- Cross-encoder rerank top 50
- Select top 8–12 chunks for final prompt
That architecture often beats “bigger model, bigger context” while being more controllable and cheaper.
Bottleneck #5: Tool use and agency—planning is a system property, not just a model property
For tasks like “book travel,” “close a support ticket,” or “refactor a codebase,” the model must interact with external systems. Failures here aren’t mostly about raw intelligence; they’re about:
- bad tool APIs
- poor state management
- missing affordances (no “undo,” no dry-run)
- unclear tool selection policies
- lack of intermediate validation
- weak error handling and retries
Why “agents” fail in practice
Long-horizon tasks compound errors. If the model makes a small mistake early—wrong assumption, wrong tool parameter—it can spiral. More compute might make it “smarter,” but without scaffolding, the error accumulation remains.
Design patterns that move the needle
- Typed tool schemas: strict JSON schema, enums, required fields
- Tool result validation: verify outputs before continuing
- Plan/execute separation: create a plan, then execute step-by-step with checks
- Idempotent tools: safe retries without double-charging or double-writing
- Transaction boundaries: “preview changes” then “apply changes”
A simple plan/execute template:
SYSTEM: You are an assistant that must use tools. First produce a PLAN with numbered steps. Then execute one step at a time. After each tool call, summarize results and confirm next step.
Even better is to enforce it in code: don’t let the model execute multiple steps without your orchestrator’s approval.
The bottleneck is orchestration design: state machines, tool contracts, and safety rails—software engineering.
Bottleneck #6: Reliability, safety, and governance are product constraints, not model constraints
In many regulated or brand-sensitive environments, the biggest limiter isn’t “can the model answer?” but “can we ship it safely?”
That introduces constraints like:
- PII handling and redaction
- audit logs
- data residency
- access control
- policy compliance
- deterministic behavior in critical flows
Hardware doesn’t solve governance. Design does: permissioning, sandboxing, traceability, and human-in-the-loop workflows.
Practical controls that beat scaling
- Policy-as-code: explicit rules for disallowed content or actions
- Least-privilege tool permissions: the model can’t do what it shouldn’t
- Human approvals: for irreversible actions (payments, deletions, emails)
- Traceable citations: every claim tied to a source when needed
- Content provenance: track which documents influenced outputs
A well-designed system can make a smaller model outperform a bigger one in business acceptability.
Bottleneck #7: Latency and cost shape what “progress” means
In production, a model that’s 3% better but 3× slower might be worse. If the UX requires sub-second responses, you’ll end up using:
- smaller models
- caching
- speculative decoding
- distilled models
- partial streaming
- retrieval and reranking trade-offs
This shifts the bottleneck to architecture decisions: when to call which model, how to cache, how to degrade gracefully.
A common design that improves both quality and cost:
- router model decides complexity
- small model handles most queries
- big model used for hard cases
- tools/RAG invoked only when necessary
This is design-led scaling: compute deployed strategically, not indiscriminately.
Why hardware isn’t the limiting factor anymore (in many workflows)
Putting it together: hardware makes the model faster and bigger, but doesn’t automatically provide:
- the right objective
- the right data signal
- the right evaluation loop
- the right knowledge access
- the right tool scaffolding
- the right governance guarantees
- the right cost/latency envelope
In other words, the bottleneck is increasingly system design—the interfaces between model, data, tools, users, and the organization.
This matches what many teams observe: the biggest wins come from improving the surrounding system rather than swapping the model.
A design-first roadmap: how to unlock progress without waiting for the next GPU cycle
If you’re trying to move an AI product forward today, these steps tend to produce compounding gains.
1) Treat prompts, tools, and retrieval as versioned software artifacts
Prompts aren’t “text”; they’re part of your codebase. Put them under version control, code review them, and test them.
Do the same for:
- chunking configs
- retrieval parameters
- tool schemas
- routing rules
2) Build an evaluation harness early (and keep expanding it)
Start with 50–200 scenarios that reflect real user tasks. Add regression cases weekly. Run them in CI.
If you do nothing else, do this. It turns AI iteration from guesswork into engineering.
3) Invest in data flywheels: turn production failures into training/eval data
Instrument the product:
- capture user corrections
- detect “I don’t know” moments
- log tool failures
- enable thumbs-down with categorical reasons
Then close the loop: new failures become new tests, and sometimes new fine-tuning data.
4) Use multi-model systems intentionally
The best system is often not “one huge model,” but a composition:
- router
- retriever/reranker
- generator
- verifier
- safety classifier
This mirrors how reliable software is built: separation of concerns, defense in depth.
5) Constrain where it matters
If the output must be correct, constrain it:
- citations required
- JSON schema required
- tool-call-only mode
- deterministic templates for critical parts
Creativity is cheap; correctness is expensive. Design for correctness.
The deeper thesis: intelligence is not a scalar, it’s an interface problem
A lot of discourse treats “smarter models” as a single axis. In practice, deployed intelligence is the result of interfaces:
- between model and knowledge (retrieval)
- between model and action (tools)
- between model and humans (UX, clarifications)
- between model and accountability (evaluation, logging, governance)
Hardware mainly improves the model component. The next wave of progress is dominated by the interface layer—what software engineers would call the system architecture.
That’s why AI progress is increasingly bottlenecked by design, not hardware: we’ve reached the point where many hard problems are not about FLOPs, but about specifying, constraining, measuring, and integrating behavior in the real world.
If you want to push the frontier inside your product or organization, act like an engineer building a critical system: define contracts, add tests, instrument everything, and iterate on the parts you can actually control. Compute will keep improving, but design is what turns it into dependable capability.