← Back

Benchmarks

How jobs run in production—and how we score them

Enterprise General Intelligence

A benchmark is not a model leaderboard. It is a record of completed runs: the agent executed a standardized job against real systems, left an auditable trail, and passed the spec.

We run standardized jobs against live tenant systems, score each execution pass or fail, and aggregate results across businesses in unlike industries. The benchmark you review is that execution history—not a synthetic capability index.

Sample production run

Prospect, outreach & opportunity engagement

REV-PIPE-042 · Revenue · B2B industrial distributor · anonymized

PASS

14m 32s

Systems touchedSalesforce · Outreach · Slack
00:00Job spec REV-PIPE-042 loaded · tenant CRM bound
00:4247 stale opportunities identified · re-score started
04:18Outbound sequences drafted · policy check passed
09:0512 opportunities engaged · next steps logged in CRM
14:32Job closed · pipeline fields match spec · audit trail written
Goal persistencePass
Tool executionPass
Multi-step completionPass
Constraint adherencePass
Full methodology & run criteria →

How each run executes

Every scored benchmark follows the same execution path—from job spec to systems updated to pass/fail logged.

  1. 01
    Load job spec

    JTBD for the function: what “done” means, which systems may be touched, policy limits.

  2. 02
    Connect tenant stack

    Live CRM, ERP, WMS, or finance APIs for that business—not an isolated test harness.

  3. 03
    Execute end-to-end

    Multi-step run with tool calls and codegen; state held until the job completes or fails.

  4. 04
    Update systems of record

    Production objects written—pipeline, journals, POs, tickets—per the job definition.

  5. 05
    Score the run

    Pass or fail vs. the job spec, plus execution checks below. Every run is logged.

  6. 06
    Publish benchmark

    Passes aggregated across unlike tenants before the agent deploys to you.

[JTBD]
Job spec loaded
[EXECUTE]
Agent runs end-to-end
[TOOLS]
APIs · DBs · SaaS
[SYSTEMS]
Records updated
[SCORE]
Pass / fail logged

Production runs scored

Real end-to-end workflow runs at real businesses. Names omitted; you see the job, what the agent executed, and the pass criteria.

RevenueProspect, outreach & opportunity engagement
Agent executes
Prospects targets, runs outbound outreach, and engages each opportunity in the CRM—activity logged, next steps set, pipeline advanced.
Tenants validated
B2B industrial distributor · tech & SaaS · regional healthcare services
Done when
Prospects sourced, outreach executed, opportunities engaged with documented touchpoints and next steps in CRM.
Pass criteria
Pass = full prospect-to-engagement cycle completes on each tenant's CRM and outreach stack.
FinanceMonth-end reconciliation assist
Agent executes
Runs the full close workflow: pulls sub-ledgers, matches exceptions, drafts journals inside policy, routes approvals, writes audit entries.
Tenants validated
Multi-entity CPG operator · logistics tech platform · financial services back-office
Done when
Exceptions surfaced, proposed journals within policy, full audit trail per run.
Pass criteria
Pass = same job spec, different charts of accounts and approval workflows.
PurchasingSignal-driven purchasing decisions
Agent executes
Tracks demand, inventory, vendor, and spend signals across systems; synthesizes inputs and executes or routes purchase decisions within policy.
Tenants validated
Industrial distribution · CPG & beverage · retail & commerce
Done when
Signals monitored, decision rationale documented, PO or requisition created or escalated per approval rules.
Pass criteria
Pass = full signal-to-decision cycle on each tenant's procurement and ERP stack.

Scored on every run

Before a run counts toward the published benchmark, it must clear these checks. No partial credit for a clever single step.

  • Job closed to defined expectation (yes / no)
  • Correct tools and parameters for that tenant's stack
  • Full sequence executed—not a single demo step
  • Audit log captures each action and system change
  • Policies and constraints held for that business

Execution dimensions

Each dimension describes what the agent must do in production—not a lab capability label.

Goal persistence

Finishes the job through handoffs and long runs

Tool execution

Calls the right APIs with correct parameters

Multi-step completion

Runs the full job chain, not one isolated action

State & context

Retains job state across steps and interruptions

Error recovery

Recovers and continues without abandoning the job

Outcome quality

Deliverables match the job spec

Constraint adherence

Stays inside tenant policy and approval rules

Earn the benchmark across tenants

One completed run is a data point. The benchmark publishes when the same job passes at multiple businesses with different stacks—e.g. prospect-and-outreach runs at a SaaS vendor, a healthcare operator, and an industrial distributor, each with its own CRM, sequences, and approval chain.

That cross-tenant execution record is what you review before deploy—not a generic model score from a fixture.

Benchmarked across unlike industries

CybersecurityIndustrial distributionRetail & commerceeCommerce & DTCCPG & beverageFashion & jewelryLogistics techSoftware & SaaSFinancial servicesEducation & EdTechHealthcare & life sciencesPharma, biotech & CROsData centers & AI infraUtilities & energyIndustrial warehousingFood & beverageCreative & marketing

Same job spec. Different tenants. Scored before deploy.

What you get before Day 1

→ Job spec for your function
→ Pass/fail history on that job at unlike tenants
→ Execution logs and dimension scores from validating runs
→ Deploy in days: connect your stack, run under your policies