Notes on OSWorld and SWE-Agent

Papers Read

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

This week for my Agentic Systems Made Real class, we read two papers, SWE-agent and OSWorld. Overall, I thought both papers provided a good overview of the efficacy of agents with certain tasks and the challenges researchers face to create them. The field is moving quickly (OpenAI released Operator that achieved 38.1% success rate on OSWorld vs 12.2% in the original paper) so some of the observations might be slightly outdated now, but adding them below.

First, I was a bit surprised with the contrast between the OSWorld paper’s views on ACI vs SWE-agent’s. In the OSWorld paper, they mention that they believe the “VLM agents using screenshots only…should be the ultimate configuration in the long run”. This gave me the impression that the researchers were trying to create an agent that mimics human performance, not necessarily one that optimizes performance (in particular since the a11y tree configuration performed the best). That view contrasted the main philosophy behind the SWE-agent paper - that agents might perform better with different interfaces. This philosophy is likely to lead to more effective agents as it does not try to limit agents to how we perceive the world (and in particular since “they” definitely perceive the world differently). However, I do realize having an agent that uses a GUI could help with explainability and being able to observe what the agent is doing. During class discussion, another interesting point brought up is that creating an unique ACI for each problem or task might not be as scalable. Both these points suggest there likely will need to be a balance between the two philosophies. Interestingly (and contrary to my initial expectation), OpenAI’s Operator just uses screenshots and text commands (similar to OSWorld’s initial observation) to achieve the 38.1% success rate.

Second, I was left impressed by the OSWorld paper’s commentary on the number of man hours (1800 hours over 3 months) needed to create the 369 problems included in its benchmark. Third, I found it interesting that both papers used pre-trained models for their separate tasks, with the larger / costlier / more advanced versions of GPT / Claude outperforming the smaller models. Even still, the performance tends to be low, with the OSWorld paper commenting that the models are being used in environments they weren’t trained for. The latter observation is likely a consequence of the former, in that creating these benchmarks is a bottleneck to fine-tune / train agents to perform better and be more cost effective. I did come across this paper which detailed its data synthesis process for creating a VLM agent (creating synthetic data and then manually verifying examples), suggesting there may be more efficient ways to generate benchmark data going forward. Similarly, another potential bottleneck is how to create accurate and scalable evaluation functions. Taking it a step further, for training purposes, it also might be helpful to have evaluation functions that can evaluate (and thus reward the agent) mid-task, allowing the agent to figure out the optimal steps to complete the task.

On the topic of training / fine-tuning an agent, while training an agent will likely improve performance, it can also increase the risk of the agent not being able to generalize. For example, Google created the Android in the Wild dataset. With its dataset, Google trained its own agent. Below I have included chart showing Google’s two agent’s performance (BC-single/BC-history) compared to prompting a PaLM model (LLM-0 / LLM-hist-5-CoT).

fig_1

Unsurprisingly, the Google’s trained agents outperform the base LLM agent in the standard configuration. However, performance deteriorates relatively more for Google trained agents in unseen domains compared to the base LLM agent. The absolute performance is nonetheless higher, but still suggests Google’s agent was likely overfit to the Standard setting. The LLM agent powered by PaLM, which is trained on more general text generation, does not suffer as much of a deterioration (and in some cases performs better, although this is likely due to random chance), suggesting training agents on more general tasks (i.e. understand core android UI features and navigation) could help with generalization.

Finally, I did get the sense that how an agent receives instructions and its history can be consolidated more. This was demonstrated with SWE-agent where it didn’t need the full history of its actions to perform well. With smaller context windows, smaller models might achieve good performances, again making the agent more cost effective and accessible.