Business Problem & Solution · Enterprise RAG Support Platform

Support teams answer the same questions hundreds of times — phrased differently every time, scattered across five domains, and buried in documents that change under their feet. This platform was architected to answer those questions from approved documents only, say "I don't know" when the evidence is weak, and hand off to a human with full context when it matters.

The problem#

ACSI operates at real scale: 550+ employees, 300+ field inspectors as the heaviest user group, and a 10,000+ campsite ecosystem. Operational questions span portal operations, promotions, finance, inspection and editorial workflows, and ICT. The goals were support deflection, answer consistency, 24/7 availability, response traceability, and structured handoff to humans — not a chatbot demo.

The hard constraint that shaped everything: enterprise answers must come from approved documents, not from model memory. Every downstream decision — retrieval design, validation, escalation — follows from that one sentence.

Why not the obvious answers#

Three system-level alternatives were evaluated and rejected before this architecture existed.

The self-hosting call is the contrarian one, so it deserves its number: at roughly 5K–10K queries per day, self-hosted vLLM breaks even against managed API pricing — and past that point the cost curves diverge permanently.

Architecture#

Seven layers, one bounded loop. A FastAPI orchestrator runs a custom state machine — the system can rewrite a query, retry retrieval, ask for clarification, or escalate, but only along explicit, auditable transitions. No autonomous tool use.

Session state lives in Redis (role, domain, last 3–5 turns, clarification and escalation state), which keeps the orchestrator stateless and horizontally scalable on the CPU node pool. Inference runs on a separate GPU node pool — the two scale independently because they have fundamentally different resource profiles.

The retrieval stack#

Retrieval quality was treated as a measured property, not an assumption. Embedding models were selected on Recall@5, MRR, and nDCG@10 against an internal evaluation set that included Dutch-language variants and deliberately ambiguous queries.

Documents are parsed with structure detection (heading hierarchy, tables, policy sections) and chunked structure-aware: 300–700 token targets, 10–20% overlap at structural boundaries only, parent-child relationships preserved, and metadata enrichment (domain, language, version, effective date) on every chunk.

Search is hybrid by design. Dense vectors (Qwen3-Embedding-0.6B) handle semantics; sparse search catches the exact identifiers — portal module names, form codes, policy numbers — that embeddings blur together. Results merge via reciprocal rank fusion, then a cross-encoder reranker (Qwen3-Reranker-0.6B) scores the top 30–50 candidates down to the 5–8 chunks that actually enter the prompt. On internal benchmarks, hybrid search improved Recall@10 by 8–12% on keyword-heavy queries for negligible merge overhead — the cheapest quality win in the entire system.

The corpus itself is versioned: every ingestion run produces a tagged corpus version behind a Qdrant collection alias, so index swaps are zero-downtime and rollback is one alias flip. Documents past their effective date get flagged and can be excluded from retrieval by metadata filter.

Generation under guardrails#

Inference is tiered across three open-source models served by vLLM: a small tier (Qwen3-4B) for FAQ-style queries, a medium tier (Llama-3.3-70B) for procedural answers, and a large MoE tier (Llama 4 Maverick — 17B active of 400B total parameters) for complex synthesis. Roughly 70–85% of traffic resolves on the small tier; the MoE tier delivers large-model quality at a fraction of dense-70B GPU cost, and int8 quantization fits it on a single A100 80GB.

Every generation is forced into a structured contract via guided decoding (Outlines) — the model cannot produce free-form output:

answer-schema.json

{
  "answer": "string — grounded in the supplied evidence only",
  "confidence": "number — 0.0 to 1.0",
  "citations": ["chunk references that must resolve to retrieved evidence"],
  "escalation_needed": "boolean"
}

A five-stage validation chain then runs before anything reaches a user: schema parse, groundedness check (claims vs. evidence), citation verification, confidence estimation, and the final answer/clarify/escalate decision. The original design had no checkpoints between retrieval and response; the validation chain is the single biggest reliability upgrade over it.

When the system says "I don't know"#

Low confidence doesn't produce a worse answer — it produces a different action. The state machine can rewrite the query and retry retrieval with relaxed filters, ask the user a clarifying question, or escalate to a human with the full case file: original query, retrieved evidence, attempted answer, confidence score, conversation context, and retrieval quality signals. Domain-aware routing sends finance questions to finance, ICT to the helpdesk, inspection queries to inspector support.

Running it in production#

Every deployment pins three versions — model_version, prompt_version, corpus_version — in Helm values and logs all three on every request trace. Rollback reverts all three atomically. Model and prompt updates ship by canary: 10% of traffic for 30 minutes with automated quality-metric comparison against baseline, then promote or roll back.

The cluster separates CPU pools (orchestration, retrieval API) from GPU pools (A10/A100, tainted for inference only), with PodDisruptionBudgets on critical services, anti-affinity across zones, and startup probes tuned for model-loading time. Non-production workloads scale to zero outside testing windows — worth 60–80% of non-prod GPU spend. Observability is per-request distributed tracing (OpenTelemetry) plus ML KPIs (retrieval quality, confidence distribution, escalation rate) and business KPIs (deflection, satisfaction).

What changed over time#

The architecture was designed model-agnostic from day one — vLLM serving makes a model upgrade a Helm values change, not an architecture change. That decision was stress-tested repeatedly as the open-source landscape moved:

Embeddings: multilingual-e5-large → BGE-M3 → Qwen3-Embedding-0.6B (instruction-aware, stronger multilingual scores, natural pairing with the Qwen3 reranker)
Reranker: ms-marco-MiniLM (English-only, immediately wrong for a Dutch/English corpus) → BGE-reranker-v2-m3 → Qwen3-Reranker-0.6B
Generators: Mistral-7B / Mixtral / Llama-3 era → Qwen3-4B + Llama-3.3-70B tiers → Llama 4 Maverick MoE on the large tier

Every upgrade went through the offline retrieval benchmark before shipping. The benchmark suite, not the announcement blog post, decided what shipped.

Failure modes designed for#

Weak retrieval gets the agentic treatment: rewrite, relax filters, fall back to sparse keyword match, clarify, escalate. Outdated documents are caught by effective-date metadata and a stale-corpus dashboard rather than discovered by users. Irrelevant-but-similar chunks are filtered by the cross-encoder, and every embedding model change runs retrieval regression tests before it can ship. The system's defining property is that its failure path is explicit at every stage — nothing falls through to "the model will probably handle it."

Representative results#

The hero figures above are design targets from the architecture work — 15–40 QPS at peak, p50 of 2–4 seconds, sub-300ms retrieval, grounding above 90% on benchmarked queries, 99.9% availability for the support API — stated as targets, not claimed as audited production measurements. The patterns proven here (hybrid retrieval, validation chains, tiered serving, confidence-based escalation) are being reimplemented live and measurable in the Hospitality Ecosystem.

Live in the Ecosystem

Hospitality Ecosystem — knowledge & support zoneBuilding