What’s one factor that AI is lifting up even greater than office productiveness? Buyer expectations. At present, 90% of shoppers count on an prompt response once they attain out for service requests, whereas 82% say they’re extra loyal to manufacturers with genuine personalization. It’s not onerous to attach the dots: AI has made it simpler than ever to create genuine, bespoke interactions at scale — cementing real-time “hyper-personalization” as the brand new establishment for buyer engagement.
However like most tech revolutions, this CX renaissance isn’t evenly distributed… but. Whereas text-based AI customer support brokers have gotten ubiquitous, most voice channels nonetheless depend on legacy IVR chatbots and different outdated programs. With an estimated 80% of inbound buyer interactions nonetheless coming by voice, this presents a obtrusive hole for many corporations.
So, what offers? If voice is such a worthwhile channel, why aren’t manufacturers speeding to deploy AI there? The straightforward reply is, constructing AI voice brokers is difficult. From latency, to the inherent messiness of verbal dialog, to the mechanics of agent-to-human handoffs, delivering an awesome voice expertise might be more difficult than it appears. To set your group up for achievement, it’s important to begin with the proper structure.
When 600 milliseconds is an eternity
Think about speaking to a buddy who pauses for a second or two in between each response. No phrases, no physique language, simply awkward silence. It wouldn’t take you very lengthy to suspect one thing is likely to be flawed together with your buddy. Actually, people are so attuned to the pure rhythm of spoken language that the majority of us insert filler phrases like “umm” and “like” to make conversations really feel extra fluid.
In fact, AI brokers aren’t human, and precisely simulating the best way we communicate in actual life presents quite a few technical challenges. For one, LLMs are too massive and sluggish to categorise person intent and return a response with out noticeable lag. Many ASR (automated speech recognition) fashions additionally depend on pauses to determine if a person has completed talking somewhat than semantically understanding when a press release is over. This may add 500-600 milliseconds of latency per flip, which can not seem to be loads, however is greater than sufficient to frustrate customers.
The way it works in Agentforce: To energy Agentforce Voice, we fine-tuned a specialised small language mannequin (SLM) designed to categorise matters as rapidly as doable, vastly decreasing response time. And due to new parallelization, Agentforce doesn’t have to attend for subject classification to complete to kick off data retrieval — that context is already being pulled whereas the subject is recognized. Direct integration with Knowledge Cloud for RAG additional expedites data lookups by returning uncooked chunks which can be contextually related somewhat than summaries, decreasing latency from a number of seconds to about half a second.
TTS caching is one other highly effective optimization we’re rolling out to keep away from having brokers regenerate the identical string of textual content. Generated speech is now cached, so when an agent must repeat the identical string, it may well merely reuse the final one, chopping down the time wanted to output an audio response. Semantic endpointing is one other SLM-powered characteristic we’re including to detect when a person is completed talking, eliminating the necessity to watch for a fixed-duration pause. And bear in mind these “umms” and “likes?” For conditions the place a bit latency is unavoidable, we’re introducing filler noise and different audio indicators to let customers know the agent continues to be engaged on their process.
Pointless level options
If all these latency optimizations didn’t hammer the purpose house, integration is a vital side to standing up a profitable AI voice agent. However latency is only the start. With out tight integration together with your firm’s present tech stack, your voice agent gained’t be capable of sew collectively an entire transcript when calls are forwarded from one other quantity, similar to an present IVR system. This prevents the agent from getting the context it must rapidly and precisely establish the proper matters and actions.
That very same lack of context makes it nearly inconceivable to escalate calls to a human agent with out forcing the shopper to repeat the whole lot they only stated to your AI agent. And and not using a full transcript, you’ll additionally lose the power to carry out any in-depth evaluation similar to sentiment detection. Even one thing as basic as two-factor authentication shall be troublesome to bolt onto a siloed AI voice agent.
The way it works in Agentforce: Agentforce Voice will provide first-class integration with Salesforce Voice, permitting your agent to seamlessly hook up with associate telephony through PSTN or, sooner or later, SIP, in addition to main CCaaS platforms. Prospects can configure present IVR flows to ahead calls to a particular Agentforce telephone quantity, guaranteeing that Service Cloud captures end-to-end conversational context. This isn’t simply invaluable analytics knowledge to drive evaluations, session tracing and debugging — it’s additionally what offers human brokers the visibility they should seamlessly take over an escalation. When a buyer calls in, Agentforce instantly begins a stay transcript that may be monitored in real-time straight by Service Console. Human brokers can bounce in at any level, or automated escalation triggers might be configured for conditions like refund requests.
Agentforce Voice connects seamlessly to associate telephony due to first-class integration with Salesforce Voice.
What makes this all doable is a brand new WebSocket protocol that enables your voice agent to always be listening, changing the turn-based HTTP requests utilized by text-based brokers. WebSockets set up a persistent connection to Agentforce’s Atlas Reasoning Engine, permitting messages to move as they arrive in and exit as they’re generated by the agent. That persistence not solely ensures a wealthy knowledge pipeline, but additionally powers quite a few quality-of-life optimizations that we’ll cowl within the subsequent part.
Better of all? Agentforce Voice can reuse all of your present text-based agent configurations, together with motion and context variables, eliminating the necessity to construct from scratch. You’ll be able to allow voice for any present agent and make tweaks in the identical builder expertise you’re already accustomed to. As soon as voice mode is enabled, you possibly can customise settings just like the agent’s gender, tone and accent, in addition to superior controls that outline how briskly your agent speaks, their emotional vary and the way persistently they persist with the underlying voice mannequin and unique audio used to coach that particular voice choice.
Customise the whole lot from the agent’s tone of voice to how huge their emotional vary is.
Speaking over ourselves
In additional methods than not, the best way we communicate in on a regular basis life is messier than the best way we write. Our most mispelled, grammatically flawed, emoji-laden texts are nonetheless simpler for machines to parse than individuals speaking over one another or veering off in new instructions midthought. We are inclined to not solely interrupt the individual (or AI agent) we’re speaking to, however we incessantly interrupt ourselves, whether or not it’s asking too many questions in a row or just blanking for a second and biding time with a number of “umms.”
Whereas people can navigate these linguistic quirks intuitively, an AI agent takes nothing without any consideration. Each time a buyer says one thing like, “yeah, uh-huh,” the agent has to determine (virtually immediately) if it’s a real interruption that must be addressed, or if it’s merely the shopper acknowledging them. At any time when a buyer asks a number of questions in a row with out ready for a response, the agent not solely has to resolve what order to reply in, but additionally retailer the context of the earlier queries so it may well reply later.
The way it works in Agentforce: Keep in mind these helpful WebSockets? Properly, it seems they’re helpful for far more than simply listening for potential case escalations. If a buyer begins asking a brand new query whereas an agent is in the midst of a response, Agentforce is ready to dynamically shift gears and reply the newest query whereas persevering with to work on the earlier one within the background — turning a number of queries into subtasks and synthesizing a complete remaining response. In distinction, an HTTP text-based agent will gray out the enter field whereas it’s producing a solution. WebSockets enable us to sidestep that limitation.
However how does Agentforce inform the distinction between an interruption and an acknowledgement or filler? Since brokers can’t depend on instinct, we applied a specialised LLM-as-a-judge, in addition to a “short-circuit” that forces the agent to cease talking instantly. If the LLM determines that there’s a real interruption, the short-circuit kicks in to stop the agent from speaking over the person. If it’s simply filler phrases, the agent will stick with it with its response.
Final however not least, we’re rolling out new instruments for entity affirmation and pronunciation. When a person gives their identify or e mail deal with or bank card quantity, Agentforce identifies these as particular fields and reads them again to the shopper to verify spelling.
And a brand new pronunciation dictionary permits organizations to manually outline the best way brokers pronounce particular phrases.
Collectively, these new instruments and options will assist energy a brand new period of AI-first CX, enabling corporations to ship high-quality, hyper-personalized voice interactions at scale. We’re excited to roll out these and plenty of different improvements at Dreamforce 2025.

