Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Overview

Large language models (LLMs) have recently gained much attention in building autonomous agents. However, the performance of current LLM-based web agents in long-horizon tasks is far from optimal, often yielding errors such as repeatedly buying a non-refundable flight ticket. By contrast, humans can avoid such an irreversible mistake, as we have an awareness of the potential outcomes (e.g., losing money) of our actions, also known as the "world model". Motivated by this, our study first starts with preliminary analyses, confirming the absence of world models in current LLMs (e.g., GPT-4, Claude-3.5-Sonnet, etc.). Then, we present a World-Model-Augmented (WMA) web agent, which simulates the outcomes of its actions for better decision-making. To overcome the challenges in training LLMs as world models predicting next observations, such as repeated elements across observations and long HTML inputs, we propose a transition-focused observation abstraction, where the prediction objectives are free-form natural language descriptions exclusively highlighting important state differences between time steps. Experiments on WebArena and Mind2Web show that our world models improve agents' policy selection without training and demonstrate our agents' cost- and time-efficiency compared to recent tree-search-based agents.

🌍 News

[2025/01/22] WMA Web Agent is accepted by ICLR 2025!
[2024/06/12] WMA Web Agent is out!

Methodology

Step I: Harvesting Agent-Environment Interaction Data

We start by collecting the dataset \( \mathcal{D} = \sum^{n}_{t=1} \{ I, o_t, a_t, o_{t+1} \} \) from the environment \( \mathcal{E} \) for training world models. For that, we prompt an LLM as a web agent to achieve the goal provided in the user instruction \( I \), by iteratively predicting an action \( a_t \) based on the current observation \( o_t \) throughout all \( n \) time steps. Consequently, we obtain \( \mathcal{D} \) from trajectory \( \tau = \{o_1, a_1, o_2, ..., a_{n}, o_{n+1}\} \) based on \( I \), and environment states of \( n \) time steps \( \{s_1, ..., s_{n+1}\} \subset \mathcal{S} \) obtained via transition function \( \mathcal{T} \).

Step II: Transition-Focused Observation Abstraction

With the collected data \( \mathcal{D} = \sum^{n}_{t=1} \{ I, o_t, a_t, o_{t+1} \} \), it is intuitive to train LLM-based world models to predict \( o_{t+1} \), which is expressed with texts (e.g., HTML and accessibility tree).

Figure 5: Transition-Focused Observation Abstraction

As shown in Figure 5, we first (i) apply the Hungarian algorithm to calculate a cost matrix for matching elements between \( o_t \) and \( o_{t+1} \) and (ii) mechanically transform the results into a list of state transition \( \Delta(o_t, o_{t+1}) \), pointing out UPDATED, DELETED, and ADDED elements on the web. After that, we prompt an LLM to convert the extracted \( \Delta(o_t, o_{t+1}) \) into a free-form natural language description \( \tilde{o}_{t+1} \), which highlights the difference between the new observation \( o_{t+1} \) and \( o_t \). Replacing \( o_{t+1} \) in \( \mathcal{D} = \{ I, o_t, a_t, o_{t+1} \} \) collected in Step I with \( \tilde{o}_{t+1} \) we just acquired here, we get a final dataset \( \tilde{\mathcal{D}} = \sum^{n}_{t=1} \{ I, o_t, a_t, \tilde{o}_{t+1} \} \) for training world models.

Step III: Learning Environment Dynamics

Lastly, using \( \tilde{\mathcal{D}} \), we proceed to train the internal world model \( \phi \) of the web agent to learn the environment dynamics. Formally, an LLM working as the world model is trained to predict the abstracted observation \( \tilde{o} \) of the next state \( s_{t+1} \), given three inputs: the user instruction \( I \), the current observation \( o_t \), and the current action \( a_t \). This LLM is trained to minimize the following loss term via the next-token prediction objective:

\[ \mathcal{L}_{\phi} = -\log \sum_{(\tilde{o}, o, a, I) \in \tilde{\mathcal{D}}} p(\tilde{o}_{t+1}| o_t, a_t, I) \]

Experiments Setup

Benchmarks and evaluation metrics

For evaluation, we use the official WebArena and Mind2Web benchmarks. WebArena includes 812 real-life tasks in simulated environments across five different websites, spanning four key domains - e-commerce (Shopping), social forums (Reddit), collaborative software development (Gitlab), content manage- ment (CMS), and Map. The main metric, Success Rate (SR), is calculated as the percentage of the user instructions that are success- fully accomplished by the generated agent trajectory. On the other hand, Mind2Web covers over 2,000 open-ended tasks, collected from 137 websites of 31 domains and crowd- sourced action sequences for the tasks. Along with the SR, Mind2Web also uses Step SR, which measures whether the predicted action selects both the correct action type (action F1) and element ID (element accuracy). When the agent succeeds in all steps in a trajectory, it is evaluated as success.

Results

Agent Performance in WebArena

From our experiments in Table 1 and Table 2, we observed the following results:

WMA vs. Vanilla CoT
- WMA web agent achieves a 16.6% success rate compared to 13.1% for vanilla CoT.
- Significant improvements are observed across almost all domains in WebArena (see Table 2).

Performance Gains with GPT-4o-mini
- 181% performance gain over CoT in the Gitlab domain.
- 92% performance gain over CoT in the Map domain.

Comparison with Tree Search Agent (Koh et al., 2024)
- The Tree search agent has a slightly higher absolute success rate (19.2%) compared to the WMA agent (16.6%).
- The WMA agent shows a larger performance improvement over vanilla CoT (+29.7%) than the Tree search agent (+28.0%).

Agent Performance in Mind2Web

From our experiments in Table 3, we observed the following results:

Comparison with Previous SOTA Methods
- WMA web agent is compared with MindAct (Deng et al., 2024) and AWM (Wang et al., 2024b).
- WMA web agent significantly outperforms AWM, achieving new SOTA performance.

Generalization Capability of WMA
- WMA web agent, trained on Mind2Web data, shows strong generalization capabilities.
- This makes our approach much more valuable in scenarios where collecting data for new web environments is non-trivial.

Analysis

Time and Cost Efficiency

Time Efficiency
- Tree search agent takes an average of 748.3 seconds per user instruction due to state exploration and backtracing.
- WMA web agent completes the same task in only 140.3 seconds by simulating actions instead of executing them.
- WMA is 5.3 times faster than Tree search agent.

API Cost Efficiency
- Tree search agent incurs 6.8 times higher API costs due to its multi-modal inputs.

Ablation Study

We conduct several ablation studies on our WMA web agent with 200 randomly sampled instances from WebArena (Shopping: 50; Gitlab: 50; Map: 100). We use GPT-4o-mini as policy models.

We observe the following findings in Table 5:

Accessing simulated next states in reward estimation improves agent performance.
Fine-tuning facilitates better world models than prompt-based approaches.
Abstracting observation elicits better next state prediction.

Additionally, we reveal the following findings in Table 6 and Figure 6:

Fine-tuning the value function is a reasonable alternative in scenarios where API budgets are limited.
Our WMA web agent may benefit from more exploration of the future states when the budget is allowed.

Case Study

WMA web agent successfully inferences on Gitlab domain in the WebArena benchmark (instance #175). Using the policy model (i.e., GPT-4o), WMA web agent selects the most proper action click [88] by leveraging its learned environment dynamics.

Citation

@inproceedings{chae2024web,
  title={Web agents with world models: Learning and leveraging environment dynamics in web navigation},
  author={Chae, Hyungjoo and Kim, Namyoung and Ong, Kai Tzu-iunn and Gwak, Minju and Song, Gwanwoo and Kim, Jihoon and Kim, Sunghwan and Lee, Dongha and Yeo, Jinyoung},
  booktitle={The Thirteenth International Conference on Learning Representations}
  }