OpenBot
← All products
OpenBot Bench · Flagship

Evaluate every robot policy on what actually ships.

Task success, per-subtask survival, sim-to-real gap, intervention rate — across seeds, across embodiments. The numbers your customer signs off on, not the loss curve they don't see.

6
policy types
8
embodiments
5
simulators
bench_eval.py
from openbot import Client

ob = Client()                       # reads OPENBOT_API_KEY

# 4-step kitchen handover on a Franka Panda
run = ob.bench.rollout(
    policy="openvla-7b",
    embodiment="franka_panda",
    task="open_drawer → pick_mug → pour → handover",
    rollouts=200,
    seeds=10,
    sim="isaac_sim",
    real_hw=True,
    edge_target="jetson_orin",
)

result = run.wait()                 # poll until the run finishes
print(result.task_success)          # e.g. 0.73
print(result.subtask["handover"])   # e.g. 0.60  ← bottleneck
print(result.sim_to_real_gap)       # e.g. -0.29
print(result.intervention_rate)     # e.g. 0.14
Capabilities

The metrics that decide ship vs. cancel.

Bench is built around how policy failures actually show up in production rollouts — and how to make them measurable.

  1. 01

    Long-horizon task success

    Per-subtask survival, time-to-success, recovery rate. A 4-step rollout doesn't fail at the end — it fails at step 3. Bench tells you exactly which subtask cracks first.

  2. 02

    Sim-to-real gap, quantified

    Paired rollouts in sim and on real hardware. Get the success-rate delta per subtask, plus the exact dynamics-randomization sweep that closes the gap.

  3. 03

    Across seeds, across embodiments

    Default 200 rollouts × 10 seeds. Run the same policy on Franka, UR, Unitree — get the variance and worst-case, not a single lucky seed your customer won't see.

  4. 04

    Edge deploy metrics

    Latency, FPS, memory on Jetson Orin / Thor and your custom inference stack. Targets TensorRT, DeepStream, and bare CUDA so you know if it runs before you ship.

  5. 05

    Acceptance reports

    PDF / Markdown scorecards your customer can sign off on. Failure videos clustered by root cause, with next-round Data and Synth recommendations attached.

  6. 06

    Webhooks & integrations

    Fire to Slack, Linear, or your CI on every completed rollout. Gate a merge on task success ≥ 80%, or auto-trigger a Synth run when the sim-to-real gap widens.

Example metrics

Numbers that decide ship or cancel.

Not loss curves. Not validation accuracy. The metrics your customer sees when the policy runs on their hardware.

Task success
73%
+8 pp
Sim→Real gap
−29pp
+12 pp
Intervention rate
14%
−6 pp
Recovery rate
82%
+5 pp
run_8c91a4·policy: openvla-7b·embodiment: franka_panda·200 rollouts × 10 seeds
Success 73%Intervention 14%Other 13%
Example report

A report built for sign-off.

Task success, failure point, and next action in one view.

openbot bench · kitchen_handover
run_8c91a4·policy: openvla-7b·embodiment: franka_panda·200 rollouts × 10 seeds

kitchen_handover · open_drawer → pick_mug → pour → handover

Conditional pass
Task success
73%+8 pp
Sim→Real gap closed
−29pp+12 pp
Intervention rate
14%−6 pp
Mean time-to-success
18.4s−2.1 s

Subtask success

200 rollouts, real Franka
  • open_drawer98%
  • pick_mug91%
  • pour82%
  • handover60%

Success across 10 seeds

73% ± 5.2

0
1
2
3
4
5
6
7
8
9
Worst: seed 3 · 65%Best: seed 2 · 80%

Ship for drawer + pick + pour. Replay failed handovers in Synth, then re-run Bench.

Compatibility

Bring your policy, your robot, your stack.

Policies
OpenVLAπ0RT-2ACTDiffusion PolicyBYO
Embodiments
Franka PandaUR3 / 5 / 10exArm 6 / 7Unitree G1 / H1GalaxeaAgileXALOHAStretch
Simulators
Isaac SimIsaac LabMuJoCoRoboCasaLIBERO
Edge runtime
Jetson OrinJetson ThorTensorRTDeepStreamROS 2

Bring a checkpoint, get a verdict.

Read the Bench spec and run the open examples today. Request early access for managed rollouts.