Scaling AURA from Sandbox to Live Servers

Sep 25, 2025

Gated Character Responses: How short-circuiting pipelines can make dialogue faster without sacrificing accuracy ›

Sep 25, 2025

Scaling AURA from Sandbox to Live Servers

When you first looked at the AURA sandbox demo, the focus was on control and canon-strict behavior. Now the big question is: how do we make this feel instant for thousands of concurrent players?

The short answer: design for throughput, and keep each turn as lightweight as possible. Here’s how we’re approaching it:

1. Use the right model (speed > bragging rights)
We don’t need giant 70B models for moment-to-moment gameplay. A fast 7–9B instruct model (like Llama-3-8B or Mistral-7B) runs great on high end GPUs with the right optimizations.

2. One LLM call per turn
The old pipeline called multiple models (topic → context → directive → final). We collapsed that into a single prompt:

Player line
Top 2–3 memory snippets
A profile/style hint

The model only generates the reply. Fewer calls = faster and cheaper.

3. Throughput math (per GPU)

Short replies (≤32 words, ~45 tokens)
Small prompts (≤500 tokens total)
A high-end GPU can push thousands of tokens/sec with batching

Let's presume that one GPU can handle ~4 turns per second. With multi-GPU nodes and scale-out, thousands of players become realistic.

4. Engine tricks that matter

vLLM (CUDA or ROCm) for batching
Quantization (4-bit weights, lighter cache)
Speculative decoding (a tiny draft model speeds up the main one by 20–40%)
Streaming tokens so players see replies instantly

5. Request shaping = control
We hard-cap replies to ~50 to 60 tokens and keep memories really short. RAG is anchored by the current topic so we don’t waste cycles fetching irrelevant context. If nothing useful comes back, we skip RAG entirely.

6. Multi-region deployment
Regional inference clusters (NA/EU/APAC) keep RTT under 80 ms. Auto-scaling is driven by token budgets, not just request counts.

7. Guardrails

Target P50 time-to-first-token <250 ms
P95 full turn under ~1.3 s
If queues build, shorten replies and skip optional lookups

Takeaway
For real-time games, smaller, faster models with smart prompt design beat bigger ones on perceived speed. AURA’s architecture lets us use the fast lane for everyday play, while still keeping a cinematic lane for those rare, unforgettable hero moments.

TL;DR: Quantized 7–9B models on high-end GPUs, one LLM call per turn, short outputs, anchor-gated memory, and streaming. That’s how AURA will scale from a demo into a live service that feels instant for thousands of concurrent players.