Blog
Durable SMS Surface Design for Production AI Systems
How to build an SMS surface that survives retries, late events, and long-running turn orchestration without losing context.
Start with delivery guarantees, not UX polish
SMS feels simple, but production behavior is dominated by retries, delayed callbacks, and webhook race conditions. If the reliability model is weak, every customer-facing flow degrades under load.
A durable surface starts by guaranteeing that inbound payloads are validated, normalized, and written to storage before orchestration begins.
Map identity and context deterministically
Incoming phone numbers should resolve to stable identity records, then map to experience context with explicit rules. Avoid implicit defaults that hide routing mistakes.
When mapping is explicit, conversation continuity and downstream policy evaluation become predictable across retries and replays.
Keep async work observable and replay-safe
Async response handling should be backed by durable outbox events and worker processing with idempotency keys. This enables safe retries without duplicate user-facing sends.
Operationally, you need visibility into each step: webhook accepted, turn processed, outbound message queued, provider delivery status received.