Arrow Research search

Author name cluster

Huan Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers
1 author row

Possible papers

9

TMLR Journal 2025 Journal Article

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

  • Yu Gu
  • Kai Zhang
  • Yuting Ning
  • Boyuan Zheng
  • Boyu Gou
  • Tianci Xue
  • Cheng Chang
  • Sanjari Srivastava

Language agents based on large language models (LLMs) have demonstrated great promise in automating web-based tasks. Recent work has shown that incorporating advanced planning algorithms, e.g., tree search, is advantageous over reactive planning for web agents. However, unlike simulated sandbox environments, real-world environments such as the web are rife with irreversible actions. This undermines the feasibility of backtracking, a cornerstone of (tree) search. Overly relying on test-time search also hurts efficiency. We advocate model-based planning for web agents that employs a world model to simulate and deliberate over the outcome of each candidate action before committing to one. We systematically explore this paradigm by: (1) Proposing a model-based planning framework, WebDreamer, which employs LLMs to serve as both world models and value functions; (2) Training specialized LLMs as world models with a scalable data synthesis pipeline. Empirical results demonstrate that WebDreamers achieves substantial performance improvements over reactive baselines. It is competitive, while being - times more efficient, with tree search in sandbox environments (VisualWebArena) and also works effectively on real-world websites (Online-Mind2Web and Mind2Web-Live). Furthermore, our trained world model, Dreamer-7B, performs comparable to GPT-4o, highlighting the potential of specialized world models for efficient and effective planning in complex web environments. All code, models, and data are publicly available at https://github.com/OSU-NLP-Group/WebDreamer

NeurIPS Conference 2025 Conference Paper

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

  • Boyu Gou
  • Zanming Huang
  • Yuting Ning
  • Yu Gu
  • Michael Lin
  • Weijian Qi
  • Andrei Kopanev
  • Botao Yu

Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.

NeurIPS Conference 2024 Conference Paper

Grokking of Implicit Reasoning in Transformers: A Mechanistic Journey to the Edge of Generalization

  • Boshi Wang
  • Xiang Yue
  • Yu Su
  • Huan Sun

We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types, composition and comparison, we consistently find that transformers can learn implicit reasoning, but only through grokking, i. e. , extended training far beyond overfitting. The levels of generalization also vary across reasoning types: when faced with out-of-distribution examples, transformers fail to systematically generalize for composition but succeed for comparison. We delve into the model's internals throughout training, conducting analytical experiments that reveal: 1) the mechanism behind grokking, such as the formation of the generalizing circuit and its relation to the relative efficiency of generalizing and memorizing circuits, and 2) the connection between systematicity and the configuration of the generalizing circuit. Our findings guide data and training setup to better induce implicit reasoning and suggest potential improvements to the transformer architecture, such as encouraging cross-layer knowledge sharing. Furthermore, we demonstrate that for a challenging reasoning task with a large search space, GPT-4-Turbo and Gemini-1. 5-Pro based on non-parametric memory fail badly regardless of prompting styles or retrieval augmentation, while a fully grokked transformer can achieve near-perfect accuracy, showcasing the power of parametric memory for complex reasoning.

NeurIPS Conference 2023 Conference Paper

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

  • Kai Zhang
  • Lingbo Mo
  • Wenhu Chen
  • Huan Sun
  • Yu Su

Text-guided image editing is widely needed in daily life, ranging from personal use to professional applications such as Photoshop. However, existing methods are either zero-shot or trained on an automatically synthesized dataset, which contains a high volume of noise. Thus, they still require lots of manual tuning to produce desirable outcomes in practice. To address this issue, we introduce MagicBrush, the first large-scale, manually annotated dataset for instruction-guided real image editing that covers diverse scenarios: single-turn, multi-turn, mask-provided, and mask-free editing. MagicBrush comprises over 10K manually annotated triplets (source image, instruction, target image), which supports trainining large-scale text-guided image editing models. We fine-tune InstructPix2Pix on MagicBrush and show that the new model can produce much better images according to human evaluation. We further conduct extensive experiments to evaluate current image editing baselines from multiple dimensions including quantitative, qualitative, and human evaluations. The results reveal the challenging nature of our dataset and the gap between current baselines and real-world editing needs.

NeurIPS Conference 2023 Conference Paper

Mind2Web: Towards a Generalist Agent for the Web

  • Xiang Deng
  • Yu Gu
  • Boyuan Zheng
  • Shijie Chen
  • Sam Stevens
  • Boshi Wang
  • Huan Sun
  • Yu Su

We introduce Mind2Web, the first dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2, 000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action sequences for the tasks, Mind2Web provides three necessary ingredients for building generalist web agents: 1) diverse domains, websites, and tasks, 2) use of real-world websites instead of simulated and simplified ones, and 3) a broad spectrum of user interaction patterns. Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents. While the raw HTML of real-world websites are often too large to be fed to LLMs, we show that first filtering it with a small LM significantly improves the effectiveness and efficiency of LLMs. Our solution demonstrates a decent level of performance, even on websites or entire domains the model has never seen before, but there is still a substantial room to improve towards truly generalizable agents. We open-source our dataset, model implementation, and trained models (https: //osu-nlp-group. github. io/Mind2Web) to facilitate further research on building a generalist agent for the web.

IJCAI Conference 2020 Conference Paper

EndCold: An End-to-End Framework for Cold Question Routing in Community Question Answering Services

  • Jiankai Sun
  • Jie Zhao
  • Huan Sun
  • Srinivasan Parthasarathy

Routing newly posted questions (a. k. a cold questions) to potential answerers with suitable expertise in Community Question Answering sites (CQAs) is an important and challenging task. The existing methods either focus only on embedding the graph structural information and are less effective for newly posted questions, or adopt manually engineered feature vectors that are not as representative as the graph embedding methods. Therefore, we propose to address the challenge of leveraging heterogeneous graph and textual information for cold question routing by designing an end-to-end framework that jointly learns CQA node embeddings and finds best answerers for cold questions. We conducted extensive experiments to confirm the usefulness of incorporating the textual information from question tags and demonstrate that an end-2-end framework can achieve promising performances on routing newly posted questions asked by both existing users and newly registered users.

AAAI Conference 2020 Conference Paper

Question-Driven Purchasing Propensity Analysis for Recommendation

  • Long Chen
  • Ziyu Guan
  • Qibin Xu
  • Qiong Zhang
  • Huan Sun
  • Guangyue Lu
  • Deng Cai

Merchants of e-commerce Websites expect recommender systems to entice more consumption which is highly correlated with the customers’ purchasing propensity. However, most existing recommender systems focus on customers’ general preference rather than purchasing propensity often governed by instant demands which we deem to be well conveyed by the questions asked by customers. A typical recommendation scenario is: Bob wants to buy a cell phone which can play the game PUBG. He is interested in HUAWEI P20 and asks “can PUBG run smoothly on this phone? ” under it. Then our system will be triggered to recommend the most eligible cell phones to him. Intuitively, diverse user questions could probably be addressed in reviews written by other users who have similar concerns. To address this recommendation problem, we propose a novel Question-Driven Attentive Neural Network (QDANN) to assess the instant demands of questioners and the eligibility of products based on user generated reviews, and do recommendation accordingly. Without supervision, QDANN can well exploit reviews to achieve this goal. The attention mechanisms can be used to provide explanations for recommendations. We evaluate QDANN in three domains of Taobao. The results show the efficacy of our method and its superiority over baseline methods.

AAAI Conference 2019 Conference Paper

Answer Identification from Product Reviews for User Questions by Multi-Task Attentive Networks

  • Long Chen
  • Ziyu Guan
  • Wei Zhao
  • Wanqing Zhao
  • Xiaopeng Wang
  • Zhou Zhao
  • Huan Sun

Online Shopping has become a part of our daily routine, but it still cannot offer intuitive experience as store shopping. Nowadays, most e-commerce Websites offer a Question Answering (QA) system that allows users to consult other users who have purchased the product. However, users still need to wait patiently for others’ replies. In this paper, we investigate how to provide a quick response to the asker by plausible answer identification from product reviews. By analyzing the similarity and discrepancy between explicit answers and reviews that can be answers, a novel multi-task deep learning method with carefully designed attention mechanisms is developed. The method can well exploit large amounts of user generated QA data and a few manually labeled review data to address the problem. Experiments on data collected from Amazon demonstrate its effectiveness and superiority over competitive baselines.

AAAI Conference 2019 Conference Paper

Interactive Semantic Parsing for If-Then Recipes via Hierarchical Reinforcement Learning

  • Ziyu Yao
  • Xiujun Li
  • Jianfeng Gao
  • Brian Sadler
  • Huan Sun

Given a text description, most existing semantic parsers synthesize a program in one shot. However, it is quite challenging to produce a correct program solely based on the description, which in reality is often ambiguous or incomplete. In this paper, we investigate interactive semantic parsing, where the agent can ask the user clarification questions to resolve ambiguities via a multi-turn dialogue, on an important type of programs called “If-Then recipes. ” We develop a hierarchical reinforcement learning (HRL) based agent that significantly improves the parsing performance with minimal questions to the user. Results under both simulation and human evaluation show that our agent substantially outperforms non-interactive semantic parsers and rule-based agents. 1