OpenBot Bench · Mock runner

Exercise the Bench contract with deterministic mock results.

The API creates runs, queues work, and returns stable report-shaped fixtures. It does not execute a robot, simulator, GPU job, or real policy evaluation.

Read the Bench API

Mock: runner state
JSON: report artifact
0: real evaluators

bench_eval.py

from openbot import Client

ob = Client()                       # reads OPENBOT_API_KEY

# 4-step kitchen handover on a Franka Panda
run = ob.bench.rollout(
    policy="openvla-7b",
    embodiment="franka_panda",
    task="open_drawer → pick_mug → pour → handover",
    rollouts=200,
    seeds=10,
    sim="isaac_sim",
    real_hw=True,
    edge_target="jetson_orin",
)

result = run.wait()                 # poll until the run finishes
print(result.task_success)          # e.g. 0.73
print(result.subtask["handover"])   # e.g. 0.60  ← bottleneck
print(result.sim_to_real_gap)       # e.g. -0.29
print(result.intervention_rate)     # e.g. 0.14

Current contract

What the mock runner can verify.

Use it to integrate the lifecycle and report schema—not to judge a model.

01
Run lifecycle contract
Create, queue, poll, and retrieve a stable terminal result.
02
Deterministic fixtures
The same request produces a stable mock-shaped metrics payload.
03
Input provenance
Preserve policy, embodiment, task, rollout, seed, and target labels.
04
Schema rehearsal
Exercise edge-metric and subtask fields before a real runner exists.
05
JSON report artifacts
Retrieve mock report and failure-cluster JSON through authenticated routes.
06
Terminal delivery path
Exercise the wired webhook lifecycle without treating mock metrics as evidence.

A deterministic contract fixture.

These values are generated by deterministic_mock_v1. They test the report shape; they do not evaluate a policy.

deterministic_mock_v1 · kitchen_handoverillustrative output · not measured

run_8c91a4·policy: openvla-7b·embodiment: franka_panda·200 rollouts × 10 seeds

kitchen_handover · open_drawer → pick_mug → pour → handover

Mock result

Task success

73%+8 pp

Sim→Real gap closed

−29pp+12 pp

Intervention rate

14%−6 pp

Mean time-to-success

18.4s−2.1 s

Subtask success

request metadata · real_hw flag: true

open_drawer98%
pick_mug91%
pour82%
handover60%

Success across 10 seeds

73% ± 5.2

Worst: seed 3 · 65%Best: seed 2 · 80%

A real runner could use this report shape to flag the handover step. This fixture cannot support a ship verdict.

Not a policy verdict

Request schema

Fields accepted by the mock contract.

Request identity

policypolicy_uriembodimenttask

Run controls

rolloutsseedssim labelreal_hw flagedge_target

Mock outputs

metrics JSONsubtasksfailure clustersreport URL

Integrate the contract without mistaking it for evaluation.

Read the API recipe or tell us what a future real runner must prove.

Request Bench review Read Bench API

Exercise the Bench contract with deterministic mock results.

What the mock runner can verify.

Run lifecycle contract

Deterministic fixtures

Input provenance

Schema rehearsal

JSON report artifacts