On this planet of AI brokers that click on, scroll, execute and automate — we’re transferring quick from “simply perceive textual content” to “truly use software program for you.” The brand new benchmark SCUBA tackles precisely that: how nicely can brokers do actual enterprise workflows contained in the Salesforce platform?
What makes SCUBA stand out:
- It’s constructed across the precise workflows contained in the Salesforce platform.
- It covers 300 job cases derived from actual person interviews (platform admins, gross sales reps, and repair brokers).
- The duties take a look at not simply “does the mannequin reply the query” however “can the mannequin use the UI, manipulate knowledge, set off workflows, troubleshoot points.”
- It addresses a niche: present benchmarks usually concentrate on net navigation and software program manipulation — however enterprise-software “laptop use” is difficult to measure. SCUBA goals to fill that.
Key Takeaway: If you need brokers that don’t simply chat, however act in enterprise software program, this can be a massive step.
The Enterprise Impression
Think about an AI assistant that may navigate your CRM, replace data, launch workflows, interpret dashboard failures, and assist your service crew get unstuck. That’s the imaginative and prescient this paper leans into.
Right here’s why it’s compelling:
- Enterprise alignment: Many benchmarks are educational or consumer-web oriented. SCUBA places the highlight on business-critical environments (admin, gross sales, and repair).
- Reasonable duties: By deriving duties from person interviews and real personas, it bridges the hole between “toy benchmark” and “dwell person state of affairs.”
- Measurable agent efficiency in context: It permits analysis of how nicely an agent operates inside software program programs, not simply through textual content.
- Roadmap for future AI assistants: As extra organizations undertake AI to automate software program use (not simply evaluation), benchmarks like this set expectations, spotlight challenges, and direct progress.
For companies like Salesforce (and their prospects) the implications are clear: higher agent tooling, fewer guide clicks, sooner difficulty decision, extra environment friendly gross sales/service groups. For the AI neighborhood: a brand new frontier of “job execution in UI” moderately than “simply textual content reasoning”.
Key Insights:
1. Sturdy efficiency gaps throughout agent sorts
- Within the zero-shot setting (i.e., the agent is given solely the duty question), open-source mannequin powered computer-use brokers that carry out nicely on associated benchmarks like OSWorld obtain lower than 5% success fee on SCUBA. In the meantime, strategies constructed on closed-source fashions (basis fashions behind proprietary brokers) achieved as much as 39% job success fee in zero-shot on SCUBA.
- Within the demonstration-augmented setting (the agent is proven human demonstrations of comparable duties), success charges can rise to round 50%, whereas additionally decreasing time and prices (by ~13% and ~16%, respectively).
- A transparent hole: open-source brokers wrestle rather more than closed-source ones; area specificity (CRM duties) is way tougher than generic benchmarks.
2. Demonstrations assist — however solely up to some extent
Information articles and tutorials on find out how to use salesforce platforms are simply accessible. One pure query is whether or not AI brokers can leverage this info successfully. The experiment outcomes reveal that
- Human demonstrations (displaying the agent find out how to do an identical job) improved efficiency throughout most brokers: larger success charges, decrease time, decrease token utilization. However, some brokers didn’t profit as a lot
- Additionally, some browser-use brokers ended up utilizing extra steps in demonstration-augmented mode (for instance as a result of discovering “shortcuts” that the human demo didn’t present). So the design of demonstrations nonetheless issues.
3. Actual-world area shift is difficult
- The efficiency drop when transferring from the extra generic OSWorld benchmark (which covers desktop functions) to SCUBA (CRM, enterprise workflows) is important. The authors present a chart of drop in success charges (e.g., −27.8% to −97.6% relying on agent) when shifting benchmark.
- The qualitative evaluation signifies main failure modes: incorrect planning (which web page to go to subsequent), grounding errors (clicking flawed UI component, flawed coordinates), failure to recuperate from errors.
- So, enterprise software program environments impose challenges past language/imaginative and prescient — UI complexity, state administration, permissions, error restoration.
4. Value, latency, and sensible deployment matter
- Success fee will not be the one metric; latency (time to finish duties) and value (API/token prices, variety of steps) are additionally reported. The paper exhibits scatter plots of “Value vs Success Charge” and “Time vs Success Charge” throughout brokers. As an example, browser-use brokers had excessive success charges however larger latency (as a result of API service response time & multi-agent framework design).
- Demonstration augmentation not solely improves success however can scale back time and prices (the paper studies ~13% decrease time, ~16% decrease value within the demonstration-augmented setting).
For enterprise adoption, this issues: an agent that succeeds however is just too sluggish or too expensive could also be much less helpful in observe.
Implications for the Way forward for CRM Automation:
- Brokers will transfer from “help” to “function”: As an alternative of simply recommending what to click on or produce, brokers will more and more do the press, configuring, workflow launching — inside programs like Salesforce.
- Coaching knowledge will shift towards UI/motion context: Reasonably than solely textual content datasets, we’ll see extra benchmarks and datasets for “agent carried out sequence in software program” duties (click on → fill → submit).
- Enterprise software program UX will matter for AI: As brokers navigate interfaces, software program merchandise themselves might evolve to be extra “agent-friendly” (e.g., extra structured actions, higher logs, agent-observable state).
- New sorts of robustness challenges: Brokers should deal with UI adjustments, versioning, error states, permissions — issues which might be much less widespread in typical NLP benchmarks.
- Metrics will evolve: Success gained’t simply be “did it reply appropriately” however “did it execute the suitable actions within the UI, in right sequence, and deal with exceptions.” SCUBA already contains milestone scores, latency, value.
Hybrid fashions and demonstration pipelines will change into commonplace: Because the experiments present, demonstrations assist. Enterprises may construct libraries of “find out how to” agent episodes for every workflow.
Takeaways for Practitioners
For those who’re working in CRM, gross sales automation, service operations, otherwise you’re constructing brokers for enterprise software program, listed here are some motion gadgets impressed by SCUBA:

