Voice Stack
Definition
harmony-core’s proprietary voice pipeline. 70ms response latency vs competitor (e.g. elevenlabs-based) ~800ms. Local models for each conversation stage (closing, negotiation, objection-handling) rather than one large monolithic model — enables targeted optimization per stage.
Acquired team built this over ~3 years. This is the engineering asset Saar/Vitali point to when arguing the 14-engineer team can credibly compete against $100M+ funded competitors in 5 months (OQ-012) — the gap is productization, not engineering.
Related decisions
- DEC-005 — Self-learning is the differentiator; voice stack enables targeted improvement per stage
- DEC-006 — Voice stack is what enables the July 1 production target
Sources
- offsite-2026-04-14 — capability discussed
- vision-2026-04-15 — used to defend the timeline against Nadav’s skepticism
Related entities
- vitali — architect
- elevenlabs — the comparison baseline (800ms, customer-prompted, vulnerable to OSS disruption)
- oneai, retell, bland, vapi — competing voice-agent stacks (often ElevenLabs wrappers)
Related concepts
- self-learning-voice-agent — the productized expression
- self-learning-loop — voice stack’s per-stage models make targeted learning possible
- voice-as-feature — voice stack is the feature; brain is the core
Cost economics
- Per-call costs in single-digit cents (potentially eliminating Twilio dependency)
- Lower cost than 11Labs-based competitors