A busy restaurant floor generates service signals constantly — empty glasses, finished plates, tables ready to turn — and during peak hours, human eyes miss them. This platform watches instead: overhead cameras, a staged vision pipeline, per-table state machines, and a rule engine that tells the right staff member the right thing at the right moment. Not a classification demo — a continuous perception-to-action loop running across dozens of tables and venues at once.
The problem#
Campsite hospitality venues run 20–50 tables with seasonal staff who cannot continuously observe every one. Missed signals during peak hours mean slower service, lower guest satisfaction, and reduced table turnover. Three alternatives were rejected: more floor staff (doesn't scale economically in a constrained seasonal labor market), guest-call buzzers (require guest action and never capture proactive moments like refills), and — the architecturally interesting one — a single end-to-end model predicting "table needs service" from raw frames.
Why not one big model#
An end-to-end model collapses detection, state estimation, and business logic into one opaque prediction. You can't debug which stage failed, can't retrain one component independently, and operations staff can't tune a threshold without touching a neural network. It also can't distinguish "refill drinks" from "clear plates" — distinct actions with distinct staff workflows.
Architecture#
Everything is event-driven: frame events and inference results flow over Redis Streams, decoupling capture, inference, state management, and notification delivery so each scales independently. Redis Streams over Kafka was deliberate — at sub-1000 events/second, an ordered log with consumer groups and replay covers every need without Kafka's operational overhead.
The perception pipeline#
Frames are sampled every 10 seconds, not streamed continuously — table-service state changes over minutes, not milliseconds. A 50-table venue processes ~300 frames per minute instead of ~45,000 at video rate: a 150× compute reduction that lets a single T4 GPU carry a pilot venue where continuous inference would demand multiple A100s. Worst-case detection delay is 10 seconds, irrelevant when staff response is measured in minutes.
Each frame runs a three-stage pipeline: YOLO26 (NMS-free, TensorRT FP16) detects tableware; BoT-SORT maintains object identity across frames; dedicated EfficientNetV2-Small classifiers score each glass and plate crop as full, half, or empty — on CPU via ONNX Runtime, at ~2.5ms per crop. Each stage has independent metrics, independent failure modes, and an independent upgrade path.
Tracking failures were designed for rather than wished away: classifiers judge each crop on visual content, not track identity, so a BoT-SORT ID swap during a waiter's reach degrades nothing — the smoothing window absorbs it.
State, smoothing, and hysteresis#
Single-frame decisions are how vision systems cry wolf. Every table has a finite state machine over a rolling 5-frame window: a transition confirms only when 3 of 5 frames agree (or one frame is near-certain, all confidences above 0.95). Hysteresis is asymmetric by design — confirming "glasses empty" takes 3 of 5, but reverting it takes 4 of 5 at confidence ≥0.85, so the state can't oscillate at a boundary. When confidence is genuinely low, the system abstains and routes to a manual-check queue instead of acting on uncertainty.
Deterministic actions, governed notifications#
The decision layer is deliberately not a model. Stable states map to five operational directives — refill drinks, clear plates, prompt reorder, table ready, escalate ambiguous — through version-controlled YAML rules that operations staff can read, audit, and tune:
- action: REFILL_DRINKS
when:
glasses: all_empty
guests_present: true
priority: high
cooldown_minutes: 10
mutually_exclusive_with: [CLEAR_PLATES]
escalate_if_unacknowledged_minutes: 3Notification governance is what separates a useful system from a muted app: per-action cooldowns, deduplication, mutual exclusivity, and priority ranking prevent alert fatigue. Staff acknowledge or dismiss; no response in 3 minutes escalates to the venue manager; subsequent frames confirm the condition actually resolved before the loop closes.
The feedback loop#
A "Wrong Alert" tap is not a complaint — it's training data. Flagged frames flow to a review queue, corrected labels land in DVC-versioned datasets (venue-stratified splits so no venue leaks between train and test), and quarterly retraining cycles consume them. Resolution confirmations provide implicit positive labels for free. Drift is monitored statistically: weekly Kolmogorov–Smirnov tests on detection-confidence distributions trigger investigation before quality visibly degrades.
Running it in production#
Inference runs on AKS GPU node pools — T4 for pilot venues, H100 NVL for production — isolated from CPU control-plane services by taints and tolerations, autoscaled on inference queue depth. Every model artifact is a version-pinned OCI artifact in the container registry; an upgrade is retrain → offline eval → push artifact → canary to one replica → watch the false-trigger rate for an hour → promote. OpenTelemetry traces span the entire loop, frame ingestion to notification delivery, with per-stage latency histograms and a false-trigger dashboard fed by staff feedback.
Representative results#
The hero figures are design targets from the architecture work: detection mAP ≥0.88, classifier accuracy ≥0.92 per class, p95 frame-to-notification under 5 seconds, false-trigger rate under 5%, staff acknowledgment above 80% within a 3-minute SLA, 99.5% availability during operating hours — targets, not audited production measurements. The perception-to-action pattern — staged inference, temporal smoothing, deterministic action rules, governed notifications — is being reimplemented live in the Hospitality Ecosystem.
Live in the Ecosystem