Product notes for the trust layer.
Practical writeups from building OpenBot: policy evaluation, teleop data quality, sim-to-real debugging, API design, and the product decisions behind the stack.
2026
- Deep diveMay 12, 2026
Per-step survival: a more honest metric for VLA evaluation
Aggregate task-success rate hides which subtask a policy actually fails on, and per-frame mAP misses dropouts of multiple seconds. We propose a per-step survival metric for long-horizon manipulation, validated on 14 tasks across four embodiments, and show that it changes the ranking of three popular VLAs.
OpenBot team
- Deep diveApr 28, 2026
Closing the sim-to-real gap on long-horizon manipulation
A controlled study of which domain-randomization axes most reduce the sim-to-real gap on a 4-step kitchen handover. Friction and gripper compliance close 60% of the gap; lighting and texture close only 11%. We argue many published randomization recipes are overfit to the demo task.
OpenBot team
- Engineering noteApr 10, 2026
Real2sim with 3D Gaussian Splatting on a Franka
Rebuilding a real failure case in simulation by Gaussian-splatting the scene and replaying contact dynamics. We compare three contact-modeling choices and report the wall-clock budget to make a single real failure replayable.
OpenBot team
- Field noteMar 22, 2026
A cross-embodiment study of OpenVLA-7B
We rolled OpenVLA-7B out on Franka, UR5e, xArm-7, and an ALOHA-style bimanual rig, on six tasks, ten seeds each. Task-success ranges from 0.71 (Franka) to 0.34 (ALOHA). We break down where the variance comes from and what it implies for zero-shot embodiment transfer.
OpenBot team
- Engineering noteMar 4, 2026
When does teleop dedup actually help?
Naive behavioral-hash deduplication can hurt performance by removing instructive variance. We study three datasets and find the sweet spot is task-dependent: pick-and-place tolerates aggressive dedup; insertion does not.
OpenBot team
- Engineering noteFeb 18, 2026
Intervention rate as a leading indicator of deployment risk
We track the human-intervention rate during shadow deployments of three policies across four warehouse sites. The metric forecasts customer-cancellable incidents 9 days ahead of the rolling task-success average. We propose it as a default in acceptance reports.
OpenBot team
