Blog » Compute » How to Choose LLM Routing Strategies for Production AI

How to Choose LLM Routing Strategies for Production AI

Share this post:

The theoretical case for LLM routing strategies is well understood at this point: send simple queries to cheap models, escalate complex ones to premium models, and capture the cost delta. RouteLLM’s 2025 paper demonstrated 85% cost reduction while maintaining 95% of GPT-4 quality. The Sandbox and Decentraland showed how… wait, wrong notes.

The numbers are real. What’s harder to find written down is why routing implementations frequently underperform their design specs in production—and which failure mode is most dangerous precisely because it’s silent.

The Cost Problem Routing Is Solving

Before going into the mechanics, it helps to quantify what’s at stake. Premium models—GPT-4 class, Claude Opus, comparable tiers—run $30–60 per million tokens. Claude Haiku, GPT-4o-mini, and similar lightweight models sit at a fraction of that. In a production AI deployment handling customer service, internal tooling, or content workflows, the majority of queries are structurally simple: status lookups, templated summaries, short-form classification, yes/no eligibility checks. None of these require a frontier model to handle correctly.

Organizations routing all traffic to a single premium model are routinely overpaying by 40–85% compared to deployments using intelligent routing. That’s not a vendor benchmark—it’s the distribution seen consistently across APAC customer environments where SIRAYA has reviewed AI infrastructure spend. The savings are real. The implementation complexity is also real, and that’s where most teams run into trouble.

The Five Strategies and Their Actual Tradeoffs

Rule-based routing is the simplest form: keyword matching, intent pattern detection, or input length thresholds map requests to specific models. A routing rule that sends all requests under 200 tokens to a cheap model is cheap to execute (sub-millisecond overhead), easy to reason about, and brittle. It fails on short but semantically complex queries, on queries that happen to contain trigger keywords without needing the capability those keywords imply, and it requires ongoing manual maintenance as query patterns evolve.

Semantic/ML-based routing uses a classifier—typically a small BERT-style model or fine-tuned embedding classifier—to assess prompt complexity or domain and route accordingly. More accurate than rules, but adds 10–50ms classification overhead per request. For latency-tolerant workloads this is acceptable; for real-time conversational AI where end-to-end latency targets are under 500ms, that overhead is non-trivial.

Cascade routing is the most commonly implemented strategy and the subject of the rest of this article, because it’s also the one most likely to degrade silently in production.

Latency-based routing routes to whichever model endpoint is fastest given current load—useful for failover and geographic distribution, but provides no cost optimization on its own. It needs to be layered on top of quality-aware routing, not used as a substitute.

Consensus routing sends the same prompt to multiple models and aggregates responses. Genuinely useful for high-stakes decisions—research estimates 7–15 point accuracy improvement over single-model responses, with GPQA-diamond benchmarks jumping from 46.9% to 68.2%. Cost multiplies by the number of models invoked, which makes this appropriate for low-volume, high-value decisions rather than any workload where volume is significant.

Cascade Routing: The Threshold Calibration Problem

Cascade routing follows a simple loop: send the request to the cheapest capable model; if the model’s confidence in its response falls below a threshold, escalate to the next tier; repeat until you reach the premium model or a response clears the threshold.

The design is sound. The calibration is where production deployments diverge from design intent.

Confidence thresholds are almost always tuned against benchmark datasets—MMLU, HellaSwag, internal curated query sets. These datasets have clean, classifiable difficulty distributions. Production query distributions do not. Customers phrase the same underlying question ten different ways. A query that looks structurally simple (short, common vocabulary, no technical terminology) can be semantically complex in context. The reverse is equally common: a long, technically worded query that the lightweight model handles correctly because it matches a well-covered pattern in its training data.

The two failure modes split along opposite axes:

If the threshold is set too low, most queries escalate—sometimes 60–70% of them, versus the 15–25% that a well-calibrated cascade would escalate. You’re paying premium-model prices for the majority of your traffic while also adding the latency cost of the initial cheap-model round trip to every request that escalates. Your cost dashboard shows worse numbers than single-model routing. Your latency p95 degrades. Teams typically catch this failure mode within days because both cost and latency signals are visible.

If the threshold is set too high, queries that genuinely require the premium model don’t escalate. The lightweight model returns a plausible-sounding response with errors. This failure mode is silent. No alerts fire. Response latency looks good. Cost metrics look excellent. The routing appears to be working exactly as intended—until a downstream audit, a user complaint, or a business outcome metric reveals that answer quality has quietly degraded by 20–30% on the subset of queries that needed escalation.

In production environments handling customer-facing queries, the silent failure is the dangerous one. A 30% degradation in answer quality for complex queries may not surface in generic satisfaction scores for weeks.

Why Benchmarks Don’t Calibrate Production Thresholds

Most routing papers and vendor documentation calibrate thresholds against benchmark datasets, then present the resulting cost reduction as a production expectation. This creates a systematic overestimate.

Benchmark datasets are curated to have clear correct answers and measurable difficulty levels. They do not capture the ambiguity, context-dependence, and domain-specificity of real production queries. A financial services application’s query distribution—contract interpretation, regulatory cross-referencing, nuanced scenario analysis—will escalation-behave very differently from a consumer support application’s distribution.

In practice, correct threshold calibration requires shadow mode deployment: run your cascade routing logic in parallel with single-model routing, compare output quality across tiers on your actual production traffic, and set thresholds based on observed escalation behavior rather than benchmark performance. This requires ground truth labels on a meaningful sample of your production queries—which is expensive but is the only way to calibrate a threshold that doesn’t silently degrade over time.

A common pattern SIRAYA sees in APAC deployments: teams implement cascade routing from a well-known open-source stack, apply the paper’s recommended threshold, run for 30 days, and report excellent cost reduction. Six months later, an unrelated product review finds that the model tier handling 75% of queries is incorrect at a rate that would have failed the original acceptance criteria. The threshold was never recalibrated after the initial deployment, and query distribution had drifted.

The Overhead Trap at Low Volume

There is a volume floor below which cascade and semantic routing strategies add cost rather than reduce it.

Semantic routing requires maintaining an embedding model (or calling an embedding API) to classify each prompt. If your deployment handles 10,000 requests per day and your embedding classification call costs $0.0001 per request, the classification overhead is $1/day. Trivial. At 10 million requests per day, that’s $1,000/day in classification overhead—which may still be well below the savings from routing, or may represent a meaningful fraction of your total LLM spend depending on your model tier mix.

For cascade routing, the overhead is the first-tier model call for every request that eventually escalates. If 30% of requests escalate, those requests pay for two model calls—cheap model plus premium model—with the cheap model call adding latency rather than value. At high escalation rates, cascade routing can cost more than direct routing to the premium model.

The break-even point depends on escalation rate, volume, and the price differential between model tiers. At the numbers in the RouteLLM study (85% cost reduction, 26% of requests needing premium models), the math is compelling at virtually any production volume. At a miscalibrated 65% escalation rate with a 2x premium price differential, cascade routing adds overhead without meaningful savings.

What to Actually Observe

Most teams monitoring LLM routing watch cost per request and response latency. These metrics are necessary but not sufficient.

The metric that catches silent quality degradation is escalation rate over time, tracked as a time series. A well-calibrated cascade will have a stable escalation rate—if your query distribution is reasonably consistent, roughly the same fraction of requests should escalate month over month. A rising escalation rate may indicate that query complexity is increasing or that threshold calibration is drifting. A falling escalation rate without an accompanying quality improvement is a warning: either your queries are genuinely getting simpler, or your threshold has shifted and complex queries are no longer escalating.

The second metric is response quality sampling: periodic human review or automated LLM-based evaluation of a stratified sample of responses, specifically filtering for low-escalation queries that pattern-match to known-complex request types. This is operationally expensive, which is why most teams skip it—and why silent quality degradation persists undetected.

At the architecture level: if you are routing across regions (which in APAC is common—Singapore, Tokyo, Sydney, Mumbai all have different model availability and latency profiles), your routing logic needs to account for regional endpoint health independently. A routing strategy that assumes a specific model tier is always available will degrade to the fallback tier silently during regional outages, with no indication that quality expectations are no longer being met for that region’s traffic.

The Decision Most Teams Get Wrong

The most common deployment decision error is treating LLM routing as a configuration step rather than a calibration process. Teams pick a routing strategy, set a threshold, and move on. The routing then ages in place as query distributions shift, model providers update their APIs, and business requirements evolve.

Effective LLM routing strategies require ongoing calibration cycles—minimally quarterly, ideally monthly for high-volume or quality-sensitive workloads. The threshold that was correct at launch is probably not correct six months later. The model tier that handled your edge cases in Q1 may have been updated by the provider in Q3 in ways that shift its competency boundary.

For teams building this from scratch: start with rule-based routing as a baseline, not cascade. Rule-based gives you explicit, observable behavior that you can reason about and trace. Build quality measurement tooling before you build ML-based routing—you need ground truth quality data to calibrate any threshold-based strategy. Introduce cascade routing when you have sufficient production signal to set the threshold against your actual query distribution, not a benchmark’s.

The cost savings are real enough to justify the investment. The calibration discipline is what determines whether those savings arrive at the expense of quality you can’t see degrading.

Share this post:

To learn more about the gambling industry’s insights and technical solutions, subscribe our official Telegram channel

Telegram: @siraya_official

To learn more about the gaming industry’s insights and technical solutions, subscribe our official Telegram channel. You can also contact us for a Free Trial!

See What SIRAYA Can Do For You!

You can become the next great story. Let us show you how!

AI Starts Here — Come See Us at SiGMA Asia 2026

AI × Gaming: Unlocking the Next Growth Engine

Rebuilding Content Productivity with AI