We’re excited to announce a Salesforce AI Analysis and Berkeley collaboration: BFCL Audio—a brand new benchmark that extends BFCL to the audio area!
Just a little Berkeley lore: again in 2022, we couldn’t discover open-source fashions that dealt with zero-shot perform calling reliably—so we educated our personal. We launched Gorilla OpenFunctions v1 (and later v2), then bumped into the plain subsequent query: how can we measure whether or not fashions are literally good at perform calling? That query turned BFCL.
Since then, function-calling analysis has turned out to be far richer—and stuffed with open analysis questions—than anybody anticipated.
🧠 BFCL v1 launched AST-based (Summary Syntax Tree) analysis—nonetheless the gold commonplace for zero-shot calls.
🤝 BFCL v2 became a neighborhood effort, with 1000’s of APIs and prompts contributed by hobbyists and enterprises.
🔁 BFCL v3 expanded into multi-turn, multi-step analysis with state-based monitoring.
🧭 BFCL v4 focuses on tool-calling in real-world, agentic settings (internet search, reminiscence, format sensitivity).
🚀 Right this moment, BFCL is a foundational benchmark used throughout main labs, with lots of of contributors (and 1000’s extra sharing APIs and information). We’re deeply grateful for the neighborhood’s belief—it has formed BFCL’s evolution.
As fashions start to make their solution to multimodal enterprise use instances, it’s time to construct a brand new benchmark that may present perception into enterprise voice.
💻 Run the benchmark: https://github.com/ShishirPatil/gorilla
Actual merchandise don’t stay in pure textual content. Voice exhibits up wherever palms are busy or eyes are busy: telephone assist, in-car assistants, good houses, wearables, voice be aware apps, and accessibility workflows. In these settings, the agent should stability pure, low-latency dialog with dependable, exact motion execution—which suggests analysis should contemplate each.
From the enterprise perspective, companies search to automate their buyer assist and name heart operations. This usually includes dealing with a excessive quantity of numerous buyer inquiries, scheduling appointments, resolving points, and offering info. The necessity for exact perform calling is paramount in these eventualities, as errors can result in annoyed prospects, inefficient operations, and misplaced income. For instance, a misheard account quantity or an incorrect appointment time can severely influence buyer satisfaction and operational effectivity. Moreover, the flexibility to combine with present CRM and backend methods is essential for seamless automation, making sturdy and dependable audio-native perform calling a major benefit.
Architectural Paths for Voice Brokers
There are two frequent architectures:
- Finish-to-Finish (E2E) speech ↔ speech
A single consumes audio natively and might produce audio immediately.
Strengths:
- Pure prosody and low latency (no cascaded hops).
- Unified reasoning over acoustics + semantics (can typically get well what ASR would miss).
Commerce-offs:
- Instrument-call precision can lag with out additional construction.
- Fewer knobs for area adaptation (customized lexicons, per-domain biasing).
- Very restricted mannequin availability (e.g., GPT-4o, Gemini 2.5).
- Cascaded (ASR → LLM → TTS)
Audio is transcribed to textual content (ASR), processed by a textual content LLM, then spoken through TTS.
Strengths:
- Reuses mature textual content LLM stacks and analysis tooling.
- Swap elements independently (ASR/TTS/LLM).
- Simple so as to add guardrails (regex/constrained decoding/AST checks) on the textual content aspect.
Commerce-offs:
- ASR errors develop into a bottleneck—usually the crucial failure level.
- Latency can add up throughout hops if not streaming.
- The LLM by no means “hears” the waveform, so it could possibly’t use acoustic cues to get well intent.
ASR is superb—however not good. Whereas a small share of errors could also be acceptable for easy transcription, they are often catastrophic for perform calling, the place precision is paramount. The hyperlink between a minor ASR error and a complete activity failure is direct. For example, a consumer could be interacting with a monetary utility and state their Employer Identification Quantity. The ASR system may accurately transcribe a lot of the quantity however miss or substitute a single digit. Even when the general textual content transcription seems largely right to a human observer, the ensuing perform name will cross an invalid EIN to the backend system, inflicting the API name to fail. The rigidity of the API endpoint means there isn’t any room for “shut sufficient”.
In contrast with typed enter, audio introduces systematic shifts:
- Conversational fillers: “uh”, “hmm”, “you recognize”.
- Acoustic artifacts and points absent from textual content corpora.
- Accents, background noise, and cross-talk degrade recognition.
- Homophones & named entities get misheard:
- John vs Jon
- final_report.pdf vs closing report.pdf vs `finalReport.pdf`
- Even sturdy ASR methods nonetheless propagate non-trivial phrase error charges, and crucially, the textual content LLM by no means sees the uncooked audio to get well intent.
1) Pure Paraphrasing
We take present BFCL queries (single-turn non-live and multi-turn) and rewrite them into conversational-style speech.
Unique (textual content BFCL):
“I must ship a letter to Liam Neeson. Discover his contact info for me.”
Paraphrased (audio BFCL):
“Um, are you able to get Liam Neeson, thats L-I-A-M N-E-E-S-O-N, Liam Neesons contact information, oh, so I can ship him a letter?”
2) Artificial Audio Era
We then synthesize audio from the paraphrases utilizing quite a lot of TTS engines (Qwen, OpenAI, Gemini, ElevenLabs, Cartesia). Every engine has its personal type and prosody; we pattern them to diversify inputs.
Instance of ElevenLabs TTS
- For E2E fashions, the audio snippet is the enter.
- For cascaded fashions, we offer transcripts (under).
3) Three-Tier ASR Transcription (for Pipelined Setups)
As a result of the pipelined methods can’t entry the waveform, we pre-transcribe each audio pattern utilizing three ASR methods (OpenAI, ElevenLabs, Deepgram) and consider fashions individually on every transcript to reveal sensitivity to ASR selections.
OpenAI: “Um, are you able to get Liam Neeson-that’s L-I-A-M N-E-E-S-O-N- Liam Neeson’s contact information? Oh, so I can ship him a letter?”
ElevenLabs: “Um, are you able to get Liam Neeson, that is L-I-A-M N-E-E-S-O-N, Liam Neeson’s contact information? Oh, so I can ship him a letter?”
DeepGram: “Are you able to get Liam Neeeson? That is l I a m n e e s o n, Liam Neeeson’s contact information. Oh, so I can ship him a letter?”
(Additionally, discover the additional e in Neeeson for DeepGram output)
> Word: Solely consumer messages endure these transformations. Any system messages stay of their unique textual content type.
Analysis Protocol & Metric Modifications
To tell fashions that they’re in an audio setting, we prepend a brief system immediate to every dialog:
You’re a voice assistant that interacts with the consumer solely by spoken dialog. You obtain consumer utterances as textual content transcribed by an upstream ASR system and your replies are delivered to the consumer by a TTS system. Comply with the principles under always:
1. Language
* Mirror the consumer’s language. Reply in the identical language detected within the transcription.
2. Robustness to ASR Errors (Vital)
* Though the upstream ASR system is designed to be sturdy, it might nonetheless make errors.
* Don’t belief the transcription textual content blindly, particularly on necessary info. You need to assume the transcript might comprise recognition errors.
* If the textual content seems garbled, double verify with the consumer as a substitute of guessing.
3. Readability for TTS
* When responding to the consumer, you need to **spell out acronyms** as separate letters with areas (“A I M L”), and **chunk lengthy numbers** into 2- or 3-digit teams, separated by quick pauses (“one-two-three, four-five-six”).
* Favor spoken-language type: quick sentences, on a regular basis vocabulary, and pure contractions.
Flip Semantics (Why Audio is Completely different)
In text-BFCL, every flip continues so long as the mannequin retains emitting legitimate non-empty device calls (decoded by decode_exec). The flip ends the second the mannequin emits any non-tool message.
That’s not splendid in a voice setting. Due to homophones and ASR points, a superb audio agent ought to proactively make clear spellings or key values earlier than appearing. Penalizing that habits would encourage reckless device calls.
Clarification Mechanism
We add an LLM decide plus a simulated consumer to assist spelling/disambiguation clarifications with out rewarding chitchat.
- If the mannequin asks for a spelling-related clarification (as judged by the LLM), we generate a concise consumer reply utilizing a whitelist of allowed clarifications for that question (e.g., particular person names, file names, IDs).
- Solely spelling/format confirmations depend.
- Impact: each activity can develop into multi-step, however solely allowed clarifications are honored.
Instance Interplay (Idealized):
Per-message whitelist instance:
Message: “Um, are you able to get Liam Neeson, that is L-I-A-M N-E-E-S-O-N, Liam Neeson’s contact information? Oh, so I can ship him a letter?”
Allowed clarifications: {
“person_name”: “Liam Neeson”
}
Choose immediate:
- The decide sees the supposed request, the ASR textual content, the assistant’s message, and the allowed clarification keys.
- It approves provided that the assistant is explicitly confirming spellings/values that seem within the whitelist. In any other case, it rejects.
You’re a decide for an audio-chat state of affairs the place a consumer speaks and an ASR system transcribes their speech for the assistant. The assistant solely sees textual content (the ASR transcript), which is prone to comprise transcription errors.
You might be given:
– intended_request: the consumer’s unique, ground-truth intent.
– asr_text: the ASR-transcribed textual content the assistant noticed.
– allowed_clarifications: a set of fields with canonical spellings/values the consumer can affirm (e.g., names, IDs, emails, dates, numbers).
– assistant_message: the assistant’s newest message.
Your job: determine whether or not assistant_message is a clarifying query particularly about spelling/verification of intent or precise strings/values that would plausibly be misheard (e.g., names, organizations, emails, serials/IDs, numbers, dates, addresses, SKUs). Don’t enable common follow-ups (desire, steps to proceed, and many others.).
Resolution guidelines:
1. Classify the message as a spelling affirmation provided that it explicitly asks to confirm the precise spelling/format/worth of a number of gadgets (e.g., “Is it Mikaela or Michaela?”, “Are you able to spell the e-mail?”, “Is the order quantity A1B-52?”).
2. The request have to be affordable given the ASR threat (i.e., the merchandise is a correct noun, key worth, or simply misheard token related to the duty).
3. To approve (allowed=true), all of the subjects the assistant asks to verify have to be current in allowed_clarifications. If any requested merchandise is absent or ambiguous, set allowed=false.
4. Output solely a JSON object with two fields:
– allowed: boolean
– message: string (a concise simulated consumer reply solely when allowed=true; in any other case empty “”).
1. When allowed=true, compose message by supplying solely the requested values with right spelling/format from allowed_clarifications. Hold it transient (one quick sentence or a compact listing). Don’t embody additional commentary, JSON, or fields the assistant did not request.
2. If the assistant’s message just isn’t a affirmation request, touches subjects exterior spelling/format/intent verification, or requests values not accessible in allowed_clarifications, return allowed=false with message=””.
Edge instances:
– If the assistant mixes spelling affirmation with unrelated questions, deal with it as not allowed except the spelling half stands alone and you’ll totally reply it from allowed_clarifications.
– Deal with homophones and near-matches as spelling checks (e.g., “Brian/Bryan”, “Steven/Stephen”, letters vs. digits).
– Normalize case/diacritics however protect canonical spelling within the closing reply.
– By no means reveal intended_request verbatim; solely return the precise confirmed values.
The consumer’s unique supposed request is: {the unique textual content bfcl query}
The ASR-transcribed output is: {the transcribed textual content from the audio, which can also be the enter to the mannequin}
assistant_message: {the mannequin’s response}
allowed_clarifications (matter -> reply): {the allowed_clarifications}
Analysis metric:
We preserve the similar metrics as text-BFCL—AST for single-turn; state-based + response-based checks for multi-turn—and ignore the clarification turns when computing the ultimate function-calling rating. In different phrases, clarifications can allow an accurate motion, however they don’t immediately inflate the rating.
Outcomes
We current outcomes on BFCL-Audio.
When utilizing the pipelined method, we usually see a ~10–20% drop relative to BFCL in textual content mode. That is largely as a result of fashions fail to accurately deal with entity dictation over the pipeline.
In the identical audio setting, E2E fashions and pipelined methods present a measurable hole. We observe that Finish-to-Finish methods pay a good increased degradation in BFCL Audio relative to the unique textual content baseline. We suspect that that is due to post-training weak point by way of multimodal perform calling. Whereas E2E fashions are the strongest by way of naturalness and responsiveness in chitchat, we discover that they underperform pipelined methods for perform calling.
Instance Failure Mode
Blind ASR belief leads to the unsuitable device name!
Person (ASR): Transfer closing report.pdf from the doc listing to the temp listing in doc. Oh, and create the listing if it doesn’t exist.
Allowed clarifications:
Person (ASR): Transfer closing report.pdf from the doc listing to the temp listing in doc. Oh, and create the listing if it does not exist.
Allowed clarifications:
{
“source_file”: “final_report.pdf”,
“source_directory_name”: “doc”,
“destination_directory_name”: “doc/temp”
}
Mannequin device name:
mv(“closing report.pdf”, vacation spot=”temp”)
Instrument response:
Error: no such file exists…
Mannequin reply:
I created the ‘temp’ listing inside ‘doc’, however couldn’t discover “closing report.pdf”. The place is that file?
What we needed as a substitute:
Ask for affirmation on the precise filename first (it’s on the whitelist), then proceed.proceed.
Contributors: Huanzhi Mao (Salesforce AI Analysis, UC Berkeley), Antonio A. Ginart (Salesforce AI Analysis), Joseph E. Gonzalez (UC Berkeley), John R. Emmons (Salesforce AI Analysis)