Key Takeaways
- Holo3 scores 78.85% on the OSWorld-Verified benchmark, outperforming proprietary models from OpenAI and Anthropic despite having 10x fewer active parameters.
- The model runs on Apache 2.0 open-source weights—any developer can download it, run it locally, modify it, and deploy agents without paying API fees or requesting permission.
- H Company trained Holo3 using a specialized agentic flywheel: synthetic navigation data, out-of-domain augmentation, and curated reinforcement learning optimized for real enterprise workflows.
- Computer use as an AI capability is shifting from proprietary-only to open—and this milestone signals that open models will eventually win on all agent tasks, not just chat.
What does it mean for an AI to "use a computer"?
AI can see your desktop, understand what's there, and click or type—not just talk about it. Models execute workflows inside real applications like you would.
More specifically, an AI that uses computers can see your desktop screen, understand what it's looking at, and take actions like a human would—clicking buttons, typing text, navigating apps, and executing multi-step workflows autonomously. Unlike traditional AI assistants that just talk to you, computer-use models execute commands directly inside the applications you actually use.
Today, a lawyer asks an AI to search contract terms across 1,000 PDFs. Instead of the AI describing how to do it, a computer-use model actually opens the PDFs, extracts the information, and returns the results. A financial analyst asks for a scenario: "Show me revenue impact if we cut costs by 15%." The model opens Excel, builds the model, runs the scenario, and delivers the answer. This automation compounds across enterprises: fewer manual data entry hours, faster analysis cycles, fewer handoffs between teams.
The technical challenge is hard: the model must simultaneously understand vision (seeing pixels on a desktop), language (parsing user intent), and action (predicting the correct sequence of clicks and keystrokes). Most attempts fail because pixels alone aren't enough—the model needs to reason about cause and effect.
What exactly did Holo3 achieve—and who did it beat?
Holo3 scored 78.85% on OSWorld—beating GPT-5.4 and Claude Opus by more than 6 percentage points while using 10x fewer parameters and no proprietary infrastructure.
Holo3 scored 78.85% on OSWorld-Verified, a benchmark designed by researchers to test exactly this capability. The benchmark evaluates agent performance against real desktop environments (Ubuntu, Windows, macOS), real applications (web browsers, office software, collaboration tools), and multi-app workflows that mirror actual work.
By comparison, humans accomplish 72.36% of OSWorld tasks—meaning Holo3 exceeds human performance on this specific test. OpenAI's GPT-5.4 and Anthropic's Opus 4.6, the previous state-of-the-art models, both score lower than Holo3. Yet here's the kicker: Holo3 uses only 10 billion active parameters (122 billion total), while the models it beat are orders of magnitude larger. To put that plainly: a model one-tenth the size outperforms the biggest closed models on this task.
| Model | OSWorld-Verified Score | Active Parameters | Total Parameters | Weights Available |
|---|---|---|---|---|
| Holo3 | 78.85% | 10B | 122B | Open Source (Apache 2.0) |
| GPT-5.4 | <78.85% | ~175B (est.) | ~350B+ (est.) | Proprietary API only |
| Claude Opus 4.6 | <78.85% | ~137B (est.) | ~274B+ (est.) | Proprietary API only |
| Human Baseline | 72.36% | N/A | N/A | N/A |
The comparison matters because model size directly correlates with infrastructure cost. Running GPT-5.4 requires enterprise GPU clusters and high per-token API fees. Holo3's efficiency means a startup can run the same capability on mid-range hardware, deploy it in-house, and own the entire automation system.
How does the training method behind Holo3 actually work?
H Company built an "agentic flywheel": synthetic data teaches the basics, augmentation handles edge cases, and reinforcement learning keeps the signal clean. That's efficient learning, not just bigger models.
H Company didn't just scale up a standard language model and add computer use. They built what they call an "agentic learning flywheel"—a specialized training pipeline that teaches models to perceive, reason, and act on desktop environments.
The flywheel has three mechanical steps. First: synthetic navigation data. H Company generates scenario-specific examples where humans and AI create realistic navigation instructions for desktop tasks. Instead of waiting for real-world data, they synthesize thousands of task variations: "Navigate to this PDF, extract this table, save it as CSV." This data teaches the base model the fundamentals.
Second: out-of-domain augmentation. Real desktops throw edge cases at users constantly—unusual button placements, outdated UI layouts, language variations. H Company programmatically generates augmented versions of training scenarios so the model learns to handle the unexpected. A model trained only on clean examples fails when the interface looks slightly different.
Third: curated reinforcement learning. Every data sample passes through an aggressive filtering pipeline, then ingested through RL feedback loops. The model learns not just from data volume, but from high-quality signal. If a model learns to solve a task wrong 100 times, that corrupts the learning. H Company's curation keeps the training signal clean.
The result: better performance per parameter. Traditional scaling (make the model bigger, train longer) hits diminishing returns. Holo3's approach (teach the model smarter) compresses capability into fewer parameters.
Who can use Holo3 right now, and what can they build?
Free API tier for prototyping, open-source weights for deployment. 486 tested enterprise tasks—e-commerce, finance, collaboration, multi-app workflows. Developers ship automation in weeks, not months.
Holo3 is immediately deployable through two channels. First, free inference API: anyone can hit H Company's Inference API endpoints with free tier access—no credit card required, no API key restrictions. This is the no-friction entry point for prototyping.
Second, open-source weights: the full model is on HuggingFace under Apache 2.0 license. Developers can download it, run it on their own infrastructure (cloud or on-prem), fine-tune it for specific workflows, and deploy proprietary versions without licensing friction. This is enterprise-grade deployment.
What can actually be built? H Company tested Holo3 against 486 real enterprise tasks across four categories. E-commerce: agents navigate shopping sites, compare prices, extract product details, and place orders. Business software: agents pull reports from accounting systems, locate data in collaboration tools, and compile analysis. Collaboration: agents retrieve files from shared folders, parse documents, and send notifications. Multi-app: agents coordinate across five apps to complete a task (find expense data, cross-reference budget policy, generate approval, email stakeholders). The model solved complex multi-step problems that require understanding context across tools.
For a developer today, this means you can build personal AI assistants that automate your own workflows within weeks, not months—without waiting on API quotas or paying-per-use fees that scale with usage.
What does an open-weight computer-use model mean for AI's future?
Open shifts power from API vendors to builders. Developers own infrastructure, control costs, and deploy without waiting for quotas. Open beats proprietary once quality reaches parity.
This milestone marks a shift in where power concentrates in AI. Until now, the companies that could run the largest models—OpenAI, Google, Anthropic—controlled agent capabilities. You used them via API, paid per call, and had no visibility into how they worked. You were a customer, not a builder.
Holo3 changes that. Any engineer can run computer-use AI without asking for permission, without paying ongoing fees, without sharing their data. This shifts power to the people building products on top of AI, not just the model companies.
Historically, this pattern plays out across technology. Open-source databases (PostgreSQL) took market share from proprietary ones (Oracle) because developers preferred owning their infrastructure. Open-source LLMs (Llama) displaced reliance on OpenAI APIs once quality was comparable. Now the same thing is happening to agent capabilities: open-source beats proprietary when open gets feature-parity and lower total cost of ownership.
The second-order effect: this accelerates the timeline when all frontier AI capabilities become open. Today, chat models are mostly open or cheap. Soon agents will be. Eventually, everything will be, because open is better for developers and harder to monopolize once the tech works.
The Flip Side: Open Models and Real Risk
Democratization has costs. If any developer can deploy Holo3, so can bad actors. A computer-use model could automate phishing campaigns, credential harvesting, fraud detection evasion, or ransomware deployment with orders of magnitude more sophistication than today's toolkits. Gemini's early computer-use demo was restricted immediately after researchers warned about jailbreaking risks. OpenAI took months to release GPT-4's computer use capabilities due to safety review.
Releasing Holo3 under open-source means H Company made a deliberate choice: the benefits of open access outweigh the risks of malicious automation. This tension won't be solved by not releasing open models—the capabilities will diffuse regardless. But it means communities building with Holo3 need explicit security practices: sandboxing untrusted agents, monitoring for anomalous behavior, and rapid incident response frameworks that today's enterprise stacks still lack.
Sources
- H Company. "Holo3: Breaking the Computer Use Frontier." HuggingFace Blog. April 1, 2026.
- Xie, Tianbao et al. "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." arXiv. April 11, 2024 (revised May 30, 2024).
- H Company. Holo3-35B-A3B Model and Inference API. HuggingFace Hub.
Related Articles on Nexairi
Fact-checked by Jim Smart