Close Menu
Spicy Creator Tips —Spicy Creator Tips —

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Stock Futures Little Changed as S&P 500 Looks to Add to Record High; Nvidia Shares Slip After Earnings

    August 28, 2025

    Alo’s New Luxury Handbag Line Is Crafted with Wellness-forward Designs

    August 28, 2025

    Philadelphia Morning Anchor Mike Jerrick to Host Own Late-Night Talk Show

    August 28, 2025
    Facebook X (Twitter) Instagram
    Spicy Creator Tips —Spicy Creator Tips —
    Trending
    • Stock Futures Little Changed as S&P 500 Looks to Add to Record High; Nvidia Shares Slip After Earnings
    • Alo’s New Luxury Handbag Line Is Crafted with Wellness-forward Designs
    • Philadelphia Morning Anchor Mike Jerrick to Host Own Late-Night Talk Show
    • WhatsApp’s AI can now turn your messages into awkward dad jokes
    • Sonos headphones and speakers are up to 25 percent off for Labor Day
    • IBC2025: Mavis Camera app now supports NDI by Jose Antunes
    • Accelerant Revenue Jumps 68% in Q2
    • Minister refuses to deny reports Rachel Reeves considering tax increase for landlords in budget – UK politics live | Politics
    Facebook X (Twitter) Instagram
    • Home
    • Ideas
    • Editing
    • Equipment
    • Growth
    • Retention
    • Stories
    • Strategy
    • Engagement
    • Modeling
    • Captions
    Spicy Creator Tips —Spicy Creator Tips —
    Home»Retention»BFCL Audio: A Benchmark for Audio-Native Function Calling
    Retention

    BFCL Audio: A Benchmark for Audio-Native Function Calling

    spicycreatortips_18q76aBy spicycreatortips_18q76aAugust 23, 2025No Comments11 Mins Read
    Facebook Twitter Pinterest LinkedIn Tumblr WhatsApp Telegram Email
    BFCL Audio: A Benchmark for Audio-Native Function Calling
    Share
    Facebook Twitter LinkedIn Pinterest Email

    We’re excited to announce a Salesforce AI Analysis and Berkeley collaboration: BFCL Audio—a brand new benchmark that extends BFCL to the audio area!

    Just a little Berkeley lore: again in 2022, we couldn’t discover open-source fashions that dealt with zero-shot perform calling reliably—so we educated our personal. We launched Gorilla OpenFunctions v1 (and later v2), then bumped into the plain subsequent query: how can we measure whether or not fashions are literally good at perform calling? That query turned BFCL.

    Since then, function-calling analysis has turned out to be far richer—and stuffed with open analysis questions—than anybody anticipated.

    🧠 BFCL v1 launched AST-based (Summary Syntax Tree) analysis—nonetheless the gold commonplace for zero-shot calls.

    🤝 BFCL v2 became a neighborhood effort, with 1000’s of APIs and prompts contributed by hobbyists and enterprises.

    🔁 BFCL v3 expanded into multi-turn, multi-step analysis with state-based monitoring.

    🧭 BFCL v4 focuses on tool-calling in real-world, agentic settings (internet search, reminiscence, format sensitivity).

    🚀 Right this moment, BFCL is a foundational benchmark used throughout main labs, with lots of of contributors (and 1000’s extra sharing APIs and information). We’re deeply grateful for the neighborhood’s belief—it has formed BFCL’s evolution.

    As fashions start to make their solution to multimodal enterprise use instances, it’s time to construct a brand new benchmark that may present perception into enterprise voice.

    💻 Run the benchmark: https://github.com/ShishirPatil/gorilla

    Actual merchandise don’t stay in pure textual content. Voice exhibits up wherever palms are busy or eyes are busy: telephone assist, in-car assistants, good houses, wearables, voice be aware apps, and accessibility workflows. In these settings, the agent should stability pure, low-latency dialog with dependable, exact motion execution—which suggests analysis should contemplate each.

    From the enterprise perspective, companies search to automate their buyer assist and name heart operations. This usually includes dealing with a excessive quantity of numerous buyer inquiries, scheduling appointments, resolving points, and offering info. The necessity for exact perform calling is paramount in these eventualities, as errors can result in annoyed prospects, inefficient operations, and misplaced income. For instance, a misheard account quantity or an incorrect appointment time can severely influence buyer satisfaction and operational effectivity. Moreover, the flexibility to combine with present CRM and backend methods is essential for seamless automation, making sturdy and dependable audio-native perform calling a major benefit.

    Architectural Paths for Voice Brokers

    There are two frequent architectures:

    1. Finish-to-Finish (E2E) speech ↔ speech

    A single consumes audio natively and might produce audio immediately.

    Strengths:

    • Pure prosody and low latency (no cascaded hops).
    • Unified reasoning over acoustics + semantics (can typically get well what ASR would miss).

    Commerce-offs:

    • Instrument-call precision can lag with out additional construction.
    • Fewer knobs for area adaptation (customized lexicons, per-domain biasing).
    • Very restricted mannequin availability (e.g., GPT-4o, Gemini 2.5).
    1. Cascaded (ASR → LLM → TTS)

    Audio is transcribed to textual content (ASR), processed by a textual content LLM, then spoken through TTS.

    Strengths:

    • Reuses mature textual content LLM stacks and analysis tooling.
    • Swap elements independently (ASR/TTS/LLM).
    • Simple so as to add guardrails (regex/constrained decoding/AST checks) on the textual content aspect.

    Commerce-offs:

    • ASR errors develop into a bottleneck—usually the crucial failure level.
    • Latency can add up throughout hops if not streaming.
    • The LLM by no means “hears” the waveform, so it could possibly’t use acoustic cues to get well intent.

    ASR is superb—however not good. Whereas a small share of errors could also be acceptable for easy transcription, they are often catastrophic for perform calling, the place precision is paramount. The hyperlink between a minor ASR error and a complete activity failure is direct. For example, a consumer could be interacting with a monetary utility and state their Employer Identification Quantity. The ASR system may accurately transcribe a lot of the quantity however miss or substitute a single digit. Even when the general textual content transcription seems largely right to a human observer, the ensuing perform name will cross an invalid EIN to the backend system, inflicting the API name to fail. The rigidity of the API endpoint means there isn’t any room for “shut sufficient”.

    In contrast with typed enter, audio introduces systematic shifts:

    • Conversational fillers: “uh”, “hmm”, “you recognize”.
    • Acoustic artifacts and points absent from textual content corpora.
    • Accents, background noise, and cross-talk degrade recognition.
    • Homophones & named entities get misheard:
      • John vs Jon
      • final_report.pdf vs closing report.pdf vs `finalReport.pdf`
    • Even sturdy ASR methods nonetheless propagate non-trivial phrase error charges, and crucially, the textual content LLM by no means sees the uncooked audio to get well intent.

    1) Pure Paraphrasing

    We take present BFCL queries (single-turn non-live and multi-turn) and rewrite them into conversational-style speech.

    Unique (textual content BFCL):
    “I must ship a letter to Liam Neeson. Discover his contact info for me.”

    Paraphrased (audio BFCL):
    “Um, are you able to get Liam Neeson, thats L-I-A-M N-E-E-S-O-N, Liam Neesons contact information, oh, so I can ship him a letter?”

    2) Artificial Audio Era

    We then synthesize audio from the paraphrases utilizing quite a lot of TTS engines (Qwen, OpenAI, Gemini, ElevenLabs, Cartesia). Every engine has its personal type and prosody; we pattern them to diversify inputs.

    Instance of ElevenLabs TTS

    • For E2E fashions, the audio snippet is the enter.
    • For cascaded fashions, we offer transcripts (under).

    3) Three-Tier ASR Transcription (for Pipelined Setups)

    As a result of the pipelined methods can’t entry the waveform, we pre-transcribe each audio pattern utilizing three ASR methods (OpenAI, ElevenLabs, Deepgram) and consider fashions individually on every transcript to reveal sensitivity to ASR selections.

    OpenAI: “Um, are you able to get Liam Neeson-that’s L-I-A-M N-E-E-S-O-N- Liam Neeson’s contact information? Oh, so I can ship him a letter?”

    ElevenLabs: “Um, are you able to get Liam Neeson, that is L-I-A-M N-E-E-S-O-N, Liam Neeson’s contact information? Oh, so I can ship him a letter?”

    DeepGram: “Are you able to get Liam Neeeson? That is l I a m n e e s o n, Liam Neeeson’s contact information. Oh, so I can ship him a letter?” 

    (Additionally, discover the additional e in Neeeson for DeepGram output)

    > Word: Solely consumer messages endure these transformations. Any system messages stay of their unique textual content type.

    Analysis Protocol & Metric Modifications

    To tell fashions that they’re in an audio setting, we prepend a brief system immediate to every dialog:

    You’re a voice assistant that interacts with the consumer solely by spoken dialog. You obtain consumer utterances as textual content transcribed by an upstream ASR system and your replies are delivered to the consumer by a TTS system. Comply with the principles under always:

    1. Language

    * Mirror the consumer’s language. Reply in the identical language detected within the transcription.

    2. Robustness to ASR Errors (Vital)

    * Though the upstream ASR system is designed to be sturdy, it might nonetheless make errors.
    * Don’t belief the transcription textual content blindly, particularly on necessary info. You need to assume the transcript might comprise recognition errors.
    * If the textual content seems garbled, double verify with the consumer as a substitute of guessing.

    3. Readability for TTS

    * When responding to the consumer, you need to **spell out acronyms** as separate letters with areas (“A I M L”), and **chunk lengthy numbers** into 2- or 3-digit teams, separated by quick pauses (“one-two-three, four-five-six”).
    * Favor spoken-language type: quick sentences, on a regular basis vocabulary, and pure contractions.

    Flip Semantics (Why Audio is Completely different)

    In text-BFCL, every flip continues so long as the mannequin retains emitting legitimate non-empty device calls (decoded by decode_exec). The flip ends the second the mannequin emits any non-tool message.

    That’s not splendid in a voice setting. Due to homophones and ASR points, a superb audio agent ought to proactively make clear spellings or key values earlier than appearing. Penalizing that habits would encourage reckless device calls.

    Clarification Mechanism

    We add an LLM decide plus a simulated consumer to assist spelling/disambiguation clarifications with out rewarding chitchat.

    • If the mannequin asks for a spelling-related clarification (as judged by the LLM), we generate a concise consumer reply utilizing a whitelist of allowed clarifications for that question (e.g., particular person names, file names, IDs).
    • Solely spelling/format confirmations depend.
    • Impact: each activity can develop into multi-step, however solely allowed clarifications are honored.

    Instance Interplay (Idealized):

    Per-message whitelist instance:

    Message: “Um, are you able to get Liam Neeson, that is L-I-A-M N-E-E-S-O-N, Liam Neeson’s contact information? Oh, so I can ship him a letter?”

    Allowed clarifications: {

        “person_name”: “Liam Neeson”

    }

    Choose immediate:

    • The decide sees the supposed request, the ASR textual content, the assistant’s message, and the allowed clarification keys.
    • It approves provided that the assistant is explicitly confirming spellings/values that seem within the whitelist. In any other case, it rejects.

    You’re a decide for an audio-chat state of affairs the place a consumer speaks and an ASR system transcribes their speech for the assistant. The assistant solely sees textual content (the ASR transcript), which is prone to comprise transcription errors.

    You might be given:
    – intended_request: the consumer’s unique, ground-truth intent.
    – asr_text: the ASR-transcribed textual content the assistant noticed.
    – allowed_clarifications: a set of fields with canonical spellings/values the consumer can affirm (e.g., names, IDs, emails, dates, numbers).
    – assistant_message: the assistant’s newest message.

    Your job: determine whether or not assistant_message is a clarifying query particularly about spelling/verification of intent or precise strings/values that would plausibly be misheard (e.g., names, organizations, emails, serials/IDs, numbers, dates, addresses, SKUs). Don’t enable common follow-ups (desire, steps to proceed, and many others.).

    Resolution guidelines:
    1. Classify the message as a spelling affirmation provided that it explicitly asks to confirm the precise spelling/format/worth of a number of gadgets (e.g., “Is it Mikaela or Michaela?”, “Are you able to spell the e-mail?”, “Is the order quantity A1B-52?”).
    2. The request have to be affordable given the ASR threat (i.e., the merchandise is a correct noun, key worth, or simply misheard token related to the duty).
    3. To approve (allowed=true), all of the subjects the assistant asks to verify have to be current in allowed_clarifications. If any requested merchandise is absent or ambiguous, set allowed=false.
    4. Output solely a JSON object with two fields:
    – allowed: boolean
    – message: string (a concise simulated consumer reply solely when allowed=true; in any other case empty “”).
    1. When allowed=true, compose message by supplying solely the requested values with right spelling/format from allowed_clarifications. Hold it transient (one quick sentence or a compact listing). Don’t embody additional commentary, JSON, or fields the assistant did not request.
    2. If the assistant’s message just isn’t a affirmation request, touches subjects exterior spelling/format/intent verification, or requests values not accessible in allowed_clarifications, return allowed=false with message=””.

    Edge instances:
    – If the assistant mixes spelling affirmation with unrelated questions, deal with it as not allowed except the spelling half stands alone and you’ll totally reply it from allowed_clarifications.
    – Deal with homophones and near-matches as spelling checks (e.g., “Brian/Bryan”, “Steven/Stephen”, letters vs. digits).
    – Normalize case/diacritics however protect canonical spelling within the closing reply.
    – By no means reveal intended_request verbatim; solely return the precise confirmed values.
    The consumer’s unique supposed request is: {the unique textual content bfcl query}

    The ASR-transcribed output is: {the transcribed textual content from the audio, which can also be the enter to the mannequin}

    assistant_message: {the mannequin’s response}

    allowed_clarifications (matter -> reply): {the allowed_clarifications}

    Analysis metric:

    We preserve the similar metrics as text-BFCL—AST for single-turn; state-based + response-based checks for multi-turn—and ignore the clarification turns when computing the ultimate function-calling rating. In different phrases, clarifications can allow an accurate motion, however they don’t immediately inflate the rating.

    Outcomes

    We current outcomes on BFCL-Audio. 

    When utilizing the pipelined method, we usually see a ~10–20% drop relative to BFCL in textual content mode. That is largely as a result of fashions fail to accurately deal with entity dictation over the pipeline.

    In the identical audio setting, E2E fashions and pipelined methods present a measurable hole. We observe that Finish-to-Finish methods pay a good increased degradation in BFCL Audio relative to the unique textual content baseline. We suspect that that is due to post-training weak point by way of multimodal perform calling. Whereas E2E fashions are the strongest by way of naturalness and responsiveness in chitchat, we discover that they underperform pipelined methods for perform calling.

    Instance Failure Mode

    Blind ASR belief leads to the unsuitable device name!

    Person (ASR): Transfer closing report.pdf from the doc listing to the temp listing in doc. Oh, and create the listing if it doesn’t exist.

    Allowed clarifications:

    Person (ASR): Transfer closing report.pdf from the doc listing to the temp listing in doc. Oh, and create the listing if it does not exist.

    Allowed clarifications:
    {
    “source_file”: “final_report.pdf”,
    “source_directory_name”: “doc”,
    “destination_directory_name”: “doc/temp”
    }

    Mannequin device name:
    mv(“closing report.pdf”, vacation spot=”temp”)

    Instrument response:
    Error: no such file exists…

    Mannequin reply:
    I created the ‘temp’ listing inside ‘doc’, however couldn’t discover “closing report.pdf”. The place is that file?

    What we needed as a substitute:

    Ask for affirmation on the precise filename first (it’s on the whitelist), then proceed.proceed.

    Contributors: Huanzhi Mao (Salesforce AI Analysis, UC Berkeley), Antonio A. Ginart (Salesforce AI Analysis), Joseph E. Gonzalez (UC Berkeley), John R. Emmons (Salesforce AI Analysis)

    Audio AudioNative benchmark BFCL calling Function
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    spicycreatortips_18q76a
    • Website

    Related Posts

    How one indie agency’s AI use drove it out of business

    August 28, 2025

    How to Create a Budget For Your Small Business

    August 28, 2025

    The hurdles to Perplexity becoming the publisher-friendly LLM

    August 28, 2025

    Life at Salesforce EMEA: How Futureforce Thrives Across Europe

    August 28, 2025

    Hands-On Learning: Pre-Internship Program at Salesforce

    August 28, 2025

    A Primer on Forensic Investigation of Salesforce Security Incidents

    August 27, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Don't Miss
    Monetization

    Stock Futures Little Changed as S&P 500 Looks to Add to Record High; Nvidia Shares Slip After Earnings

    August 28, 2025

    Morgan Stanley Analysts Bullish on Nvidia Outlook 16 minutes in the past Nvidia’s (NVDA) outlook…

    Alo’s New Luxury Handbag Line Is Crafted with Wellness-forward Designs

    August 28, 2025

    Philadelphia Morning Anchor Mike Jerrick to Host Own Late-Night Talk Show

    August 28, 2025

    WhatsApp’s AI can now turn your messages into awkward dad jokes

    August 28, 2025
    Our Picks

    Four ways to be more selfish at work

    June 18, 2025

    How to Create a Seamless Instagram Carousel Post

    June 18, 2025

    Up First from NPR : NPR

    June 18, 2025

    Meta Plans to Release New Oakley, Prada AI Smart Glasses

    June 18, 2025
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    About Us

    Welcome to SpicyCreatorTips.com — your go-to hub for leveling up your content game!

    At Spicy Creator Tips, we believe that every creator has the potential to grow, engage, and thrive with the right strategies and tools.
    We're accepting new partnerships right now.

    Our Picks

    Stock Futures Little Changed as S&P 500 Looks to Add to Record High; Nvidia Shares Slip After Earnings

    August 28, 2025

    Alo’s New Luxury Handbag Line Is Crafted with Wellness-forward Designs

    August 28, 2025
    Recent Posts
    • Stock Futures Little Changed as S&P 500 Looks to Add to Record High; Nvidia Shares Slip After Earnings
    • Alo’s New Luxury Handbag Line Is Crafted with Wellness-forward Designs
    • Philadelphia Morning Anchor Mike Jerrick to Host Own Late-Night Talk Show
    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Disclaimer
    • Get In Touch
    • Privacy Policy
    • Terms and Conditions
    © 2025 spicycreatortips. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.