OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents

1Kentauros AI Inc.
*Equal contribution
OSUniverse Demo

Abstract

In this paper, we introduce OSUniverse: a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on ease of use, extensibility, comprehensive coverage of test cases, and automated validation. We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent. In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy. The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%. Therefore, this benchmark presents solid ground for fully automated measuring progress, capabilities and the effectiveness of GUI-navigation AI agents over the short and medium-term horizon.

Core Features

  • Multimodality: designed to rely on vision only, without any extra knowledge about the environment.
  • Diversity: contains 160 tasks across 5 levels of complexity and 9 categories; all tasks are carefully crafted to be challenging and representative of real-world scenarios that are easy for the average office worker but hard for machines.
  • Automated validation: includes a Gemini-powered validator with an average error rate less than 2%; we support 4 types of validations in each test case: validating the textual output of the agent, the final screenshot of the desktop, the agent trajectory, and the output of the arbitrary bash command after the agent has finished the task.
  • Non-deterministic: many scenarios cannot be tested with simple heuristics, like whether an agent effectively drew a smiley face or a flower in GIMP so we leverage our automatic Gemini validation for more complex scenarios.
  • Easily extensible: every test is configured via YAML.
  • Agent architecture independent: ReACT-style agents are the most popular agent architecture, but there are many more, and researchers should be free to implement any architecture they can dream up.
  • Flexibility: you can create custom agents, custom runners, and validators.
  • Escalating complexity: tasks are grouped; they advance from paper level tasks where the agent just has to see the screen and accurately describe it, all the way to gold level tasks, complex multi-app scenarios that test everything from the ability to draw, to dragging and dropping, to distilling information from one app and inputting it into another.

Agent Performance

Agent Performance
Agent Paper Wood Bronze Silver Gold Total Score
Computer Use Agent with computer-use-preview-2025-03-11 100.00% 86.21% 75.00% 34.38% 9.09% 47.80%
Claude Computer Use with claude-3-5-sonnet-20241022 100.00% 53.45% 43.75% 21.88% 0% 28.36%
AgentDesk-based ReACT with claude-3-5-sonnet-20241022 90.91% 56.90% 39.58% 9.38% 0% 23.44%
QWEN-based ReACT with qwen2.5-vl-72b-instruct 90.91% 46.55% 31.25% 6.25% 0% 18.64%
AgentDesk-based ReACT with gemini-2.5-pro-exp-03-25 90.91% 15.52% 18.75% 3.12% 0% 9.59%
AgentDesk-based ReACT with gemini-2.0-flash-001 90.91% 13.79% 14.58% 3.12% 0% 8.26%
AgentDesk-based ReACT with gpt-4o-2024-11-20 100.00% 6.90% 12.50% 3.12% 0% 6.79%
AgentDesk-based ReACT with gemini-1.5-pro-002 90.91% 3.45% 12.50% 3.12% 0% 6.12%

Observations and Insights

BibTeX

@misc{davydova2025osuniversebenchmarkmultimodalguinavigation,
    title={OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents}, 
    author={Mariya Davydova and Daniel Jeffries and Patrick Barker and Arturo Márquez Flores and Sinéad Ryan},
    year={2025},
    eprint={2505.03570},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2505.03570}, 
}