Skip to content
Open
98 changes: 98 additions & 0 deletions content/blog/introducing-phail.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
Title: Introducing PhAIL: We Measured How Good Robot AI Actually Is. It's Not.
Date: 2026-03-31
Category: Blog
Slug: introducing-phail
Author: Positronic Team
Summary: We built PhAIL – the Physical AI Leaderboard – to measure how foundation models actually perform on real commercial robotics tasks. The best model runs at 5% of human throughput.
Image: theme/positronic/static/img/phail-leaderboard.png

The best robot AI models can fold laundry, sort objects, and assemble parts – in carefully tuned setups, fine-tuned for each specific task. The results are real and the progress is impressive.

But there's a gap nobody talks about. Existing benchmarks – both real-hardware and simulated – measure success rate on household tasks: folding dishcloths, arranging fruits, tidying up a kitchen. They target the home robot future. We think the traction of physical AI will start somewhere else: commercial and industrial applications where the same operation happens hundreds or thousands of times a day. The industry knows about physical AI and is paying attention – but when operations teams ask "how close are these models to actually working for us?", there's no answer. Nobody measures what matters to them: throughput, reliability, and consistency over hundreds of runs.

We built PhAIL – the Physical AI Leaderboard – to find out. The results are live at [phail.ai](https://phail.ai).

### The numbers

We evaluated four vision-language-action (VLA) models on bin-to-bin order picking: an operator places items in one tote, the robot arm equipped with a gripper picks them one by one and places them into another. Same hardware (Franka FR3 arm with Robotiq 2F-85 gripper), same objects, same conditions. Hundreds of runs per model. Blind evaluation – the operator doesn't know which model is running.

We chose metrics that matter to anyone deciding whether to deploy a robot in production. Units Per Hour (UPH) tells you how fast the robot works – if you're running a fulfilment line, this is the number that determines whether automation pays for itself. Mean Time Between Failures or Assists (MTBF/A) tells you how long the robot runs before a human needs to step in – if it's every few minutes, you haven't saved any labour, you've just added a babysitter.

| | UPH | Completion | MTBF/A |
|---|---|---|---|
| Human (hands) | 1,331 | 100% | – |
| Human (teleoperating the robot) | 330 | 100% | – |
| OpenPI (π0.5) | 64 | 48.5% | 4.0 min |
| GR00T N1.6 (NVIDIA) | 60 | 47.7% | 3.9 min |
| ACT (LeRobot) | 44 | 35.0% | 3.6 min |
| SmolVLA (HuggingFace) | 18 | 10.8% | 2.7 min |

The best AI model runs at **5% of human throughput**. In a typical episode with 8 items and a 4-minute time limit, it moves about half of them before time runs out. It needs operator intervention every four minutes.

Three of the four models – OpenPI, GR00T, and SmolVLA – are fine-tuned versions of large pre-trained VLA models. ACT is trained from scratch on our dataset. None of them are close to production-ready on this task.

### Why this matters

These aren't bad models. OpenPI and GR00T are publicly available VLA models from Physical Intelligence and NVIDIA. They were fine-tuned on our task-specific dataset using published training recipes. The gap isn't a failure of any single team – it's where the entire field stands today when you measure with production metrics.

There are datasets and simulated benchmarks in robotics. What's been missing is rigorous, independent measurement on real hardware with the metrics that matter for deployment.

### How PhAIL works

PhAIL is designed to be obvious. There's no trick.

An operator places items in an outbound tote, sets the item count in the interface, and hits start. The robot arm and gripper run the model autonomously. Each episode has a time limit of 30 seconds per item. When the model finishes – or time runs out – the operator records how many items were successfully moved. If the robot does something unsafe (we've seen it try to place one tote inside another, or push totes off the table), the operator triggers a safety stop. Every run is recorded with synchronized video and robot telemetry, so any result can be independently verified.

Evaluation is blind: model checkpoints are rotated randomly, and the operator does not know which model is running. This eliminates unconscious bias in how episodes are initiated or scored.

We publish the fine-tuning dataset, training scripts, and full methodology in our [white paper](https://phail.ai/whitepaper.pdf). The hardware – Franka FR3 arm with Robotiq 2F-85 gripper in the DROID configuration – is widely available and reproducible.

### Why bin-to-bin picking?

When we set out to build a benchmark for physical AI, we wanted to measure what matters for industrial deployment. Not a research task designed to stress-test a model's generalization – a real operation that companies actually need automated.

Bin-to-bin order picking is one of the most common tasks in warehousing, fulfilment, and manufacturing. Millions of these operations happen daily. It's the kind of work where automation has clear economic value – and where the gap between "works in a demo" and "works at production scale" is most visible.

This is the first task. PhAIL is designed to grow: more tasks, more hardware configurations, more object categories. Bin-to-bin picking is where we start because it's commercially important, physically straightforward, and exposes exactly the reliability and throughput gaps that matter.

### What we observed

Beyond the headline numbers, three patterns stood out.

#### More data helps – and it's not enough

We evaluate on four object types: wooden spoons, towels, scissors, and batteries. Performance varies dramatically across them – and it tracks almost perfectly with how much training data we collected for each.

| Object | Training duration | Avg completion | Avg UPH |
|---|---|---|---|
| Wooden spoons | 340 min | 50.3% | 71.0 |
| Towels | 178 min | 38.4% | 48.2 |
| Scissors | 160 min | 25.9% | 32.4 |
| Batteries | 134 min | 27.3% | 34.8 |

The correlation between training data volume and model performance is strong (r = 0.96 for UPH). The task with the most demonstrations (wooden spoons, nearly 6 hours of teleoperation data) performs best across every model. The tasks with the least data (batteries and scissors, around 2.5 hours each) perform worst.

But even the best-performing task tops out at 50% completion. These models are both data-hungry and data-starved – and simply collecting more demonstrations may not be enough to close the gap.

#### Small changes in setup cause large drops in performance

Between evaluation runs, we vary two things: which side of the workspace the outbound tote is placed on, and which side the external camera is mounted on. Same robot, same task, same objects – just a different spatial arrangement.

The models notice. Tote-left is consistently easier than tote-right across all models, likely because tote-right requires a slightly longer reach and a different approach angle, leading to more collisions. But the more interesting finding is about camera placement.

When the external camera is on the same side as the outbound tote, it has a clear view of both totes. When they're on opposite sides, the inbound tote occludes the outbound one from the camera's perspective. GR00T's completion rate drops by 22 percentage points in the occluded configuration – it relies heavily on the external camera and can't compensate with the wrist camera alone. OpenPI handles the occlusion much better, suggesting more robust visual processing.

A human operator wouldn't notice these changes. For current VLA models, they're the difference between half-working and barely working.


### What's next

The leaderboard is open. If your model can do better – with your own architecture, your own fine-tuning, or your own data – [submit a checkpoint](https://phail.ai/submit) and prove it on the same hardware.

PhAIL launches with one task and four models. More tasks, more hardware, and more models are coming. We're also planning to test generalization to unseen objects – currently all items appear in the training dataset, but real deployments face new objects constantly. The consortium – including Nebius as founding compute partner and Toloka as data partner – is actively welcoming new members.

- Full methodology: [white paper](https://phail.ai/whitepaper.pdf)
- Live results: [phail.ai](https://phail.ai)
- [Fine-tuning dataset](https://phail.ai/about) (352 episodes, 12GB)
- Positronic – the platform powering PhAIL: [GitHub](https://github.com/Positronic-Robotics/positronic)
- Submit your model: [phail.ai/submit](https://phail.ai/submit)
64 changes: 64 additions & 0 deletions content/blog/phail-launch-press-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
Title: Is Physical AI Ready? A New Leaderboard Measures Whether Robots Can Actually Do Real Work
Date: 2026-03-31
Category: Blog
Slug: phail-launch
URL: press/phail-launch.html
Save_as: press/phail-launch.html
Author: Positronic Team
Summary: Positronic Robotics launches PhAIL (Physical AI Leaderboard) – an ongoing, real-world benchmark evaluating robotics foundation models on commercial tasks using throughput and reliability metrics.
Image: theme/positronic/static/img/phail-press-card.png

*San Francisco, CA - March 31, 2026* - Positronic Robotics, a company helping developers make robots work with AI, launches PhAIL (Physical AI Leaderboard) - an ongoing, real-world benchmark evaluating robotics foundation models on commercial tasks using throughput and reliability metrics.

PhAIL evaluates models on physical robotic setups performing commercially relevant operations, starting with bin-to-bin order picking - one of the most common tasks in logistics and industrial automation. In this task, items are transferred one at a time from an inbound container to an outbound container. The current evaluation rig uses a Franka Research 3 robotic arm paired with a Robotiq 2F-85 gripper (DROID-style configuration), a reproducible and widely used research platform.

The leaderboard itself is hardware-agnostic, and additional robotic embodiments will be added beginning in Q2 2026 to reflect the diversity of hardware used in real-world deployments. Bin-to-bin picking is only the starting point: the goal of the benchmark is to measure how well AI models perform on repetitive, economically important operations that occur thousands of times per day in real facilities.

Instead of reporting academic "success rate," PhAIL measures Units Per Hour (UPH), and Mean Time Between Failures or Assists (MTBF/A).

Each run is executed on real hardware, not in simulation. Model checkpoints are selected randomly and evaluated in blinded conditions. Every run is logged and published with synchronized video, robot telemetry, station metadata, and scoring artifacts.

Physical AI has advanced rapidly in recent years, with foundation models capable of handling increasingly diverse manipulation tasks. But most benchmarks still rely on simulation or controlled laboratory conditions, and many public evaluations emphasize curated demo videos rather than sustained operation. For industrial deployment, two variables dominate: throughput and reliability.

PhAIL measures both directly. Models perform multiple timed runs on real hardware, each recorded with synchronized video, telemetry, and scoring artifacts. From these runs, PhAIL computes Units Per Hour and Mean Time Between Failures – the same metrics an operations manager would use to evaluate a deployment. The protocol is fully documented in the PhAIL white paper.

"We all dream about a robot that folds our laundry – but that's a task that happens once a day. In factories and logistics, the same operation runs hundreds of times per shift and most of those still aren't solved," said Sergey Arkhangelskiy, founder of Positronic Robotics. "Physical AI needs to prove itself there first, and PhAIL is how we measure whether it can."

In the inaugural evaluations, four models were fine-tuned and tested: OpenPI 0.5 (Physical Intelligence), GR00T (NVIDIA), SmolVLA (HuggingFace/LeRobot), and ACT (LeRobot) – as well as teleoperated and human baselines. The results show a measurable gap between current foundation models and human-level performance in both throughput and reliability on commercial picking tasks. Positronic frames it as calibration: a transparent baseline that allows progress to be measured consistently over time. As new models are released, they can be evaluated under the same protocol, creating a continuous, comparable record of performance.

PhAIL targets three structural issues in the Physical AI ecosystem:

1. **Lack of objective measurement of commercial readiness.** Most public metrics do not reflect factory-floor constraints.

2. **Unclear ROI signals for operators.** Success rates do not translate directly into deployment decisions.

3. **A broken feedback loop for model builders.** Without standardized, auditable benchmarks, it is difficult to iterate toward real-world reliability.

By publishing synchronized video, logs, firmware versions, hardware configuration, and scoring artifacts for every run, PhAIL emphasizes auditability and reproducibility.

PhAIL launches as a governed consortium rather than as a proprietary product. Nebius, the AI cloud company that provides the cloud foundation for the full robotics lifecycle, joins as a founding consortium partner, and Toloka participates as a data partner supporting evaluation processes. The benchmark is intended as a shared industry yardstick, not as a competitive marketing vehicle.

"Scaling physical AI requires a clear, shared standard for production readiness," said Evan Helda, Head of Physical AI at Nebius. "With no established blueprint for deploying these systems at scale, the PhAIL Leaderboard delivers an important benchmark grounded in real-world performance data—bringing greater transparency to what's ready for deployment. Nebius is committed to accelerating physical AI development across the ecosystem. Through our participation in the PhAIL consortium, we're proud to help advance the next phase of commercial robotics alongside industry partners."

The PhAIL dataset and fine-tuning scripts are publicly available. Model builders can fine-tune their systems and submit checkpoints for evaluation. Hardware vendors can validate model performance across embodiments. Operators can review published artifacts directly.

The leaderboard is live at [phail.ai](https://phail.ai), alongside the full evaluation protocol.

## About Positronic Robotics

Positronic Robotics builds tools that help developers turn AI models into working robot behaviors. The company's platform lets engineers pick a model, connect it to a robot, and run real-world tasks — without having to build the robotics infrastructure from scratch.

Alongside its developer toolkit, Positronic operates PhAIL (Physical AI Leaderboard), an independent benchmark that evaluates how robotics foundation models perform on commercially relevant tasks. While many tools help train robotics models, Positronic focuses on the next step: making robots actually work in the real world.

The company is led by Sergey Arkhangelskiy, former Staff Software Engineer in Search Ranking at Google and CEO and co-founder of WANNA, the AR company acquired by Farfetch.

## About Nebius

Nebius, the AI cloud company, is building the full-stack platform for developers and companies to take charge of their AI future — from data and model training to production deployment. Founded on deep in-house technological expertise and operating at scale with a rapidly expanding global footprint, Nebius serves startups and enterprises building AI products, agents and services worldwide.

Nebius is listed on Nasdaq (NASDAQ: NBIS) and headquartered in Amsterdam.

For more information please visit [www.nebius.com](https://www.nebius.com)

**Media Contacts**
Nebius: [media@nebius.com](mailto:media@nebius.com)
14 changes: 9 additions & 5 deletions content/pages/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@ Not success rates in simulation. Real results on real hardware.
Models on the leaderboard include OpenPI 0.5 (Physical Intelligence), GR00T and DreamZero (NVIDIA), and SmolVLA (HuggingFace) – tested alongside human and teleoperated baselines.

<div style="display: flex; flex-wrap: wrap; justify-content: center; gap: 1.2rem; margin: 3em 0;">
<span class="btn-primary"
<a href="https://phail.ai" class="btn-primary"
style="display: inline-flex; flex-direction: column; align-items: center; gap: 0.3rem;
padding: 1em 2.5em; background: rgb(177, 231, 79); color: #020617;
text-decoration: none; border-radius: 999px; font-weight: 600; font-size: 1.1em;
box-shadow: 0 0 0 1px rgba(17, 24, 39, 0.9), 0 10px 24px rgba(177, 231, 79, 0.35);
transition: all 0.2s ease; cursor: default;">
<span style="font-size: 1em;">Leaderboard Launches March 31</span>
transition: all 0.2s ease;">
<span style="font-size: 1em;">View the Leaderboard</span>
<span style="font-size: 0.82em; opacity: 0.85;">Real models. Real robots. Real metrics.</span>
</span>
</a>
<a href="https://github.com/Positronic-Robotics/positronic" class="btn-github"
style="display: inline-flex; flex-direction: column; align-items: center; gap: 0.3rem;
padding: 1em 2.5em; background: #ffffff; color: #24292f;
Expand All @@ -39,6 +39,10 @@ Models on the leaderboard include OpenPI 0.5 (Physical Intelligence), GR00T and
</div>

<style>
.btn-primary:hover {
transform: translateY(-2px) !important;
box-shadow: 0 0 0 1px rgba(17, 24, 39, 0.9), 0 14px 32px rgba(177, 231, 79, 0.45) !important;
}
.btn-github:hover {
transform: translateY(-2px) !important;
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15), 0 12px 28px rgba(0, 0, 0, 0.12) !important;
Expand Down Expand Up @@ -75,7 +79,7 @@ The teams that deploy physical AI at scale will need more than a trained model.

### Get involved

- **PhAIL launches March 31** – the full leaderboard, methodology, and evaluation data.
- **[PhAIL is live](https://phail.ai)** – the full leaderboard, methodology, and evaluation data.
- **[Star on GitHub](https://github.com/Positronic-Robotics/positronic)** – the open-source infrastructure behind PhAIL.
- **[Join Discord](https://discord.gg/PXvBy4NBgv)** – questions, discussion, feature requests.
- Email: **[hi@positronic.ro](mailto:hi@positronic.ro)**
50 changes: 47 additions & 3 deletions theme/positronic/static/css/style.css
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
:root {
--bg: #111111;
--text: #f0f0f0;
--text: #b8b8b8;
--accent: #b1e74f;
--accent-soft: rgba(177, 231, 79, 0.18);

Expand Down Expand Up @@ -141,6 +141,14 @@ h1 {
width: 100%;
}

h2 {
margin: 2rem 0 0.5rem;
font-size: 26px;
line-height: 1.3;
font-weight: 700;
color: var(--accent);
}

h3 {
margin: 1.5rem 0 0.5rem;
font-size: 20px;
Expand All @@ -149,6 +157,14 @@ h3 {
color: var(--accent);
}

h4 {
margin: 1.2rem 0 0.4rem;
font-size: 17px;
line-height: 1.3;
font-weight: 600;
color: var(--accent);
}

p {
margin: 0;
color: var(--text);
Expand Down Expand Up @@ -208,6 +224,34 @@ pre code {
color: inherit;
}

table {
width: 100%;
border-collapse: collapse;
margin: 1.5rem 0;
font-size: 0.95rem;
}

th {
text-align: left;
padding: 0.6rem 1rem;
border-bottom: 2px solid var(--border-subtle);
color: var(--accent);
font-weight: 600;
font-size: 0.85rem;
text-transform: uppercase;
letter-spacing: 0.05em;
}

td {
padding: 0.5rem 1rem;
border-bottom: 1px solid var(--border-subtle);
color: var(--text);
}

tr:last-child td {
border-bottom: none;
}

a {
color: var(--text);
font-weight: 600;
Expand All @@ -221,15 +265,15 @@ a:hover {
}

.page-content ul li strong {
color: #f0f0f0;
color: var(--text);
}

.page-content ul li strong a {
color: #b0b0b0;
}

.page-content ul li strong a:hover {
color: #f0f0f0;
color: var(--text);
}

/* Navbar */
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added theme/positronic/static/img/phail-press-card.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion theme/positronic/templates/article.html
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ <h1>{{ article.title }}</h1>

{% if article.image %}
<div class="article-image">
<img src="{{ article.image }}" alt="{{ article.title }}" />
<img src="{{ SITEURL }}/{{ article.image }}" alt="{{ article.title }}" />
</div>
{% endif %}

Expand Down