Evaluate every robot policy on what actually ships.
Task success, per-subtask survival, sim-to-real gap, intervention rate — across seeds, across embodiments. The numbers your customer signs off on, not the loss curve they don't see.
- 6
- policy types
- 8
- embodiments
- 5
- simulators
from openbot import Client
ob = Client() # reads OPENBOT_API_KEY
# 4-step kitchen handover on a Franka Panda
run = ob.bench.rollout(
policy="openvla-7b",
embodiment="franka_panda",
task="open_drawer → pick_mug → pour → handover",
rollouts=200,
seeds=10,
sim="isaac_sim",
real_hw=True,
edge_target="jetson_orin",
)
result = run.wait() # poll until the run finishes
print(result.task_success) # e.g. 0.73
print(result.subtask["handover"]) # e.g. 0.60 ← bottleneck
print(result.sim_to_real_gap) # e.g. -0.29
print(result.intervention_rate) # e.g. 0.14The metrics that decide ship vs. cancel.
Bench is built around how policy failures actually show up in production rollouts — and how to make them measurable.
- 01
Long-horizon task success
Per-subtask survival, time-to-success, recovery rate. A 4-step rollout doesn't fail at the end — it fails at step 3. Bench tells you exactly which subtask cracks first.
- 02
Sim-to-real gap, quantified
Paired rollouts in sim and on real hardware. Get the success-rate delta per subtask, plus the exact dynamics-randomization sweep that closes the gap.
- 03
Across seeds, across embodiments
Default 200 rollouts × 10 seeds. Run the same policy on Franka, UR, Unitree — get the variance and worst-case, not a single lucky seed your customer won't see.
- 04
Edge deploy metrics
Latency, FPS, memory on Jetson Orin / Thor and your custom inference stack. Targets TensorRT, DeepStream, and bare CUDA so you know if it runs before you ship.
- 05
Acceptance reports
PDF / Markdown scorecards your customer can sign off on. Failure videos clustered by root cause, with next-round Data and Synth recommendations attached.
- 06
Webhooks & integrations
Fire to Slack, Linear, or your CI on every completed rollout. Gate a merge on task success ≥ 80%, or auto-trigger a Synth run when the sim-to-real gap widens.
Numbers that decide ship or cancel.
Not loss curves. Not validation accuracy. The metrics your customer sees when the policy runs on their hardware.
A report built for sign-off.
Task success, failure point, and next action in one view.
kitchen_handover · open_drawer → pick_mug → pour → handover
Subtask success
200 rollouts, real Franka- open_drawer98%
- pick_mug91%
- pour82%
- handover60%
Success across 10 seeds
73% ± 5.2
Ship for drawer + pick + pour. Replay failed handovers in Synth, then re-run Bench.
Bring your policy, your robot, your stack.
Bring a checkpoint, get a verdict.
Read the Bench spec and run the open examples today. Request early access for managed rollouts.
