OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents

Abstract

In this paper, we introduce OSUniverse: a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on ease of use, extensibility, comprehensive coverage of test cases, and automated validation. We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent. In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy. The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%. Therefore, this benchmark presents solid ground for fully automated measuring progress, capabilities and the effectiveness of GUI-navigation AI agents over the short and medium-term horizon.

Core Features

Multimodality: designed to rely on vision only, without any extra knowledge about the environment.
Diversity: contains 160 tasks across 5 levels of complexity and 9 categories; all tasks are carefully crafted to be challenging and representative of real-world scenarios that are easy for the average office worker but hard for machines.
Automated validation: includes a Gemini-powered validator with an average error rate less than 2%; we support 4 types of validations in each test case: validating the textual output of the agent, the final screenshot of the desktop, the agent trajectory, and the output of the arbitrary bash command after the agent has finished the task.
Non-deterministic: many scenarios cannot be tested with simple heuristics, like whether an agent effectively drew a smiley face or a flower in GIMP so we leverage our automatic Gemini validation for more complex scenarios.
Easily extensible: every test is configured via YAML.
Agent architecture independent: ReACT-style agents are the most popular agent architecture, but there are many more, and researchers should be free to implement any architecture they can dream up.
Flexibility: you can create custom agents, custom runners, and validators.
Escalating complexity: tasks are grouped; they advance from paper level tasks where the agent just has to see the screen and accurately describe it, all the way to gold level tasks, complex multi-app scenarios that test everything from the ability to draw, to dragging and dropping, to distilling information from one app and inputting it into another.

Agent Performance

Agent	Paper	Wood	Bronze	Silver	Gold	Total Score
Computer Use Agent with computer-use-preview-2025-03-11	100.00%	86.21%	75.00%	34.38%	9.09%	47.80%
Claude Computer Use with claude-3-5-sonnet-20241022	100.00%	53.45%	43.75%	21.88%	0%	28.36%
AgentDesk-based ReACT with claude-3-5-sonnet-20241022	90.91%	56.90%	39.58%	9.38%	0%	23.44%
QWEN-based ReACT with qwen2.5-vl-72b-instruct	90.91%	46.55%	31.25%	6.25%	0%	18.64%
AgentDesk-based ReACT with gemini-2.5-pro-exp-03-25	90.91%	15.52%	18.75%	3.12%	0%	9.59%
AgentDesk-based ReACT with gemini-2.0-flash-001	90.91%	13.79%	14.58%	3.12%	0%	8.26%
AgentDesk-based ReACT with gpt-4o-2024-11-20	100.00%	6.90%	12.50%	3.12%	0%	6.79%
AgentDesk-based ReACT with gemini-1.5-pro-002	90.91%	3.45%	12.50%	3.12%	0%	6.12%

Observations and Insights

Computer Use Preview from OpenAI is undoubtedly a very spectacular model with a lot of potential for GUI navigation. At the same time, it shows the highest instability in the results among the top four agents.
At the same time, Computer Use Preview is the most expensive and slow model due to reasoning capabilities that are part of its decision-making. Speed in real-world scenarios is a major factor for a strong GUI-navigation agent and may be taken into account in future versions of the benchmark.
All the top models (Computer Use Preview, Claude Computer Use, and QWEN 2.5 VL 72B) are specifically trained for GUI navigation, and each of those models have their own action space. That is why we implemented specialized agents for each of these models.
QWEN 2.5 VL 72B is the only open-weight model on the list. This points to a problem where open source and open weights models have focused almost exclusively on text, and multimodal models are only beginning to take focus in the open-source and open-weights space. There is a distinct lack of GUI-navigation-oriented multimodal models on the market today.
There is a large potential to fine-tune open source and open weight models that is lacking in proprietary models, which offers the tantalizing prospect of significantly enhancing performance on GUI reasoning and navigation tasks. Many proprietary models (such as Claude and OpenAI o1–o4) cannot be fine-tuned at all. We expect open models to gain an advantage due to this.

BibTeX

@misc{davydova2025osuniversebenchmarkmultimodalguinavigation,
    title={OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents}, 
    author={Mariya Davydova and Daniel Jeffries and Patrick Barker and Arturo Márquez Flores and Sinéad Ryan},
    year={2025},
    eprint={2505.03570},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2505.03570}, 
}