In this paper, we introduce OSUniverse: a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on ease of use, extensibility, comprehensive coverage of test cases, and automated validation. We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent. In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy. The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%. Therefore, this benchmark presents solid ground for fully automated measuring progress, capabilities and the effectiveness of GUI-navigation AI agents over the short and medium-term horizon.
Agent | Paper | Wood | Bronze | Silver | Gold | Total Score |
---|---|---|---|---|---|---|
Computer Use Agent with computer-use-preview-2025-03-11 | 100.00% | 86.21% | 75.00% | 34.38% | 9.09% | 47.80% |
Claude Computer Use with claude-3-5-sonnet-20241022 | 100.00% | 53.45% | 43.75% | 21.88% | 0% | 28.36% |
AgentDesk-based ReACT with claude-3-5-sonnet-20241022 | 90.91% | 56.90% | 39.58% | 9.38% | 0% | 23.44% |
QWEN-based ReACT with qwen2.5-vl-72b-instruct | 90.91% | 46.55% | 31.25% | 6.25% | 0% | 18.64% |
AgentDesk-based ReACT with gemini-2.5-pro-exp-03-25 | 90.91% | 15.52% | 18.75% | 3.12% | 0% | 9.59% |
AgentDesk-based ReACT with gemini-2.0-flash-001 | 90.91% | 13.79% | 14.58% | 3.12% | 0% | 8.26% |
AgentDesk-based ReACT with gpt-4o-2024-11-20 | 100.00% | 6.90% | 12.50% | 3.12% | 0% | 6.79% |
AgentDesk-based ReACT with gemini-1.5-pro-002 | 90.91% | 3.45% | 12.50% | 3.12% | 0% | 6.12% |
@misc{davydova2025osuniversebenchmarkmultimodalguinavigation,
title={OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents},
author={Mariya Davydova and Daniel Jeffries and Patrick Barker and Arturo Márquez Flores and Sinéad Ryan},
year={2025},
eprint={2505.03570},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.03570},
}