Papers Read / Referenced

Over the last few weeks, I have started reading more papers on AI Agents and Agentic Systems. Below I am including my thoughts on all of the papers and synthesizing my thoughts on the current state of research in the field.

For clarity, I define / think of Agentic Systems as systems that solve certain tasks in non-deterministic ways by interacting with an environment and tools. Examples can include a digital AI assitance that writes emails for you, or books vacations for you based on your preferences.

Clearly, creating these systems is a large undertaking given the action space of what an agent can do is large. However, the famous AlphaGo Zero paper highlights the potential for an AI agent to 1) learn about its environment and task with limited labeled data, and 2) achieve better performance than humans in these tasks. The paper provides a useful roadmap for how to create an AI agent using reinforcment learning. Below, I outline some of the things on that roadmap and include my thoughts on them.

1) An environment for the agent to interact with

Simulating a game of Go is straightforward. When creating a more advanced agent, simulating its environment is not as straight forward. Most importantly, the environment has to be realisitc. If you are creating a coding assistant, it is not enough for your environment to be a single .py file or python function. When fixing a bug, or adding new code, human software engineers need to navigate through several files and understand how changes in one file might affect code in another file. Thus, the agent should be able to operate in such an environment. Further, as mentioned in the OSWorld paper, an agent might get distracted by background noise, or events unrelated to the main task at hand. For example, if an agent is buying an item on Amazon, it might not know how to handle a pop up from an unrelated process if it was only tested on Amazon’s website (as opposed to a web browser). Creating a realistic environment for the agent to explore helps avoid these issues.

2) A benchmark to determine how well the agent performs

Understanding if an agent won in a game of Go is straight forward, given the rules are clearly delineated. Evaluating whether an agent achieved its task in a more complex environment can be much harder. In the OSWorld paper, the authors mention that it took 1800 hours over 3 months to create 369 tests to evaluate digital agents. If we are to gain more confidence in using agents in our day to day, we need a more scalable way of evaluating if an agent is performing its task accurately. Further, simply evaluating an agent on accuracy of the task completed may not be sufficent. For example, when fixing a bug, a software engineer might make the code base more robust to avoid future problems, or be able to identify bugs that might arise from fixing the original problem. An agent simply focused on solving the original issue may not be able to do this, as noted in the SWE-bench paper. Cost of achieving a certain accuracy should also be included in these evaluations. Measuring dollar cost is the easiest way to evaluate cost. However, other metrics are also important to consider, such as latency and other unintended consequences of the agent (e.g. maybe you asked the agent to delete a table in your database, but it ends up deleting the original table and a few more). In summary, in order to create more effective agents, we need scalable ways to measure their efficacy.

3) An understanding of the action space

In Go, the action space is easy to understand. Either the agent can place a stone in an empty spot, or pass. However, when creating agents, the action space is much larger. For example, an agent in charge of booking a trip through an online website needs to understand how the internet works, and how the particular booking website works. However, the models being used to create these agents typically were not trained to understand the environments they are put in (e.g the internet, or a computer). In the ExAct paper, the authors note that 90+% of their agent’s errors came from the agent’s lack of understanding its environment. The OSWorld paper also made the point that its agents powered by a vision-language model struggled with the higher image quality given the models were trained on lower quality images. To me this suggests that the greatest improvements to these agents might come from further training these models to better understand their environment. Once an agent generally is able to interpret its environment (for example, understand the core components of a webpage), RL/reflective prompting techniques might be able to fill in the gaps to solve the final tasks. Another avenue that might help agents understand their enviornments more is the concept of agent-computer inferfaces (as opposed to HCI, human-computer inferfaces) proposed in the SWE-agent paper. Foundational models “percieve” the world in a very different way than we do, so there might be more optimal ways to present an environment to an agent than what we are accustomed to.

Ensuring an agent understands its environment leads me to my last point / concern. As agents interact with more complex environments, they will need to have a strong causal understanding of these environment. Simply fine-tuning an agent on more observational data to achieve a task will not be enough, as they will likely encounter slightly different environments / situations frequently (see my previous blog post for an example with Google’s android dataset). Instead, methods that can help an agent learn the underlying structural causal model of its environment are needed. Further research in that space can be found here from Columbia’s Causal AI lab.

While most of this post has centered on areas that need further research, I wanted to talk briefly about how a modular framework, as proposed in DSPy, could help (partially) overcome some of the challenges mentioned in this post. The paper (and subsequent package) emphasises creating agentic systems that break down the original problem into smaller, more manageable pieces. The hope is that these pieces will then be more managable for an agentic system to handle, and thus improve results and robustness of the agent. I would recommend reading Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models to see an example of this concept. I suspect creating systems that breakdown a task into more smaller pieces is one route to solve the limitations outlined in point 3.

Overall, as foundational models get stronger / more capable, it opens a lot of exciting opportunities for agents. However, as outlined above, there is still a lot of advancements needed in the field before we get there.