Arrow Research search

Author name cluster

Zhi Gao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
1 author row

Possible papers

10

AAAI Conference 2026 Conference Paper

TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents

  • Bofei Zhang
  • Zirui Shang
  • Zhi Gao
  • Wang Zhang
  • Rui Xie
  • Xiaojian Ma
  • Tao Yuan
  • Xinxiao Wu

Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks. However, a major challenge in developing generalized GUI agents is the lack of sufficient trajectory data across various operating systems and applications, mainly due to the high cost of manual annotations. In this paper, we propose the TongUI framework that transforms millions of multimodal web tutorials into GUI trajectories for generalized GUI agents. Concretely, we crawl GUI videos and articles from the Internet and process them into GUI agent trajectory data. Based on this, we construct the GUI-Net-1M dataset, which contains 1 million trajectories across five operating systems and over 280 applications. To the best of our knowledge, this is the largest open-source GUI trajectory dataset. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B/32B models on GUI-Net-1M, which shows consistent performance improvements on commonly used grounding and navigation benchmarks, outperforming baseline agents by 10\% on multiple benchmarks, showing the effectiveness of the GUI-Net-1M dataset and underscoring the significance of our TongUI framework.

NeurIPS Conference 2025 Conference Paper

Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning

  • Xiaomeng Fan
  • Yuchuan Mao
  • Zhi Gao
  • Yuwei Wu
  • Jin Chen
  • Yunde Jia

Open-vocabulary learning requires modeling the data distribution in open environments, which consists of both seen-class and unseen-class data. Existing methods estimate the distribution in open environments using seen-class data, where the absence of unseen classes makes the estimation error inherently unidentifiable. Intuitively, learning beyond the seen classes is crucial for distribution estimation to bound the estimation error. We theoretically demonstrate that the distribution can be effectively estimated by generating unseen-class data, through which the estimation error is upper-bounded. Building on this theoretical insight, we propose a novel open-vocabulary learning method, which generates unseen-class data for estimating the distribution in open environments. The method consists of a class-domain-wise data generation pipeline and a distribution alignment algorithm. The data generation pipeline generates unseen-class data under the guidance of a hierarchical semantic tree and domain information inferred from the seen-class data, facilitating accurate distribution estimation. With the generated data, the distribution alignment algorithm estimates and maximizes the posterior probability to enhance generalization in open-vocabulary learning. Extensive experiments on 11 datasets demonstrate that our method outperforms baseline approaches by up to 14%, highlighting its effectiveness and superiority.

NeurIPS Conference 2025 Conference Paper

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

  • Pengxiang Li
  • Zhi Gao
  • Bofei Zhang
  • Yapeng Mi
  • Xiaojian (Shawn) Ma
  • Chenrui Shi
  • Tao Yuan
  • Yuwei Wu

Multimodal agents, which integrate a controller (e. g. , a vision language model) with external tools, have demonstrated remarkable capabilities in tackling complex multimodal tasks. Existing approaches for training these agents, both supervised fine-tuning and reinforcement learning, depend on extensive human-annotated task-answer pairs and tool trajectories. However, for complex multimodal tasks, such annotations are prohibitively expensive or impractical to obtain. In this paper, we propose an iterative tool usage exploration method for multimodal agents without any pre-collected data, namely SPORT, via step-wise preference optimization to refine the trajectories of tool usage. Our method enables multimodal agents to autonomously discover effective tool usage strategies through self-exploration and optimization, eliminating the bottleneck of human annotation. SPORT has four iterative components: task synthesis, step sampling, step verification, and preference tuning. We first synthesize multimodal tasks using language models. Then, we introduce a novel trajectory exploration scheme, where step sampling and step verification are executed alternately to solve synthesized tasks. In step sampling, the agent tries different tools and obtains corresponding results. In step verification, we employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller for tool usage through preference tuning, producing a SPORT agent. By interacting with real environments, the SPORT agent gradually evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks shows that the SPORT agent achieves 6. 41% and 3. 64% improvements, underscoring the generalization and effectiveness introduced by our method.

NeurIPS Conference 2024 Conference Paper

FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

  • Pengxiang Li
  • Zhi Gao
  • Bofei Zhang
  • Tao Yuan
  • Yuwei Wu
  • Mehrtash Harandi
  • Yunde Jia
  • Song-Chun Zhu

Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction. In this paper, we build FIRE, a feedback-refinement dataset, consisting of 1. 1M multi-turn conversations that are derived from 27 source datasets, empowering VLMs to spontaneously refine their responses based on user feedback across diverse tasks. To scale up the data collection, FIRE is collected in two components: FIRE-100K and FIRE-1M, where FIRE-100K is generated by GPT-4V, and FIRE-1M is freely generated via models trained on FIRE-100K. Then, we build FIRE-Bench, a benchmark to comprehensively evaluate the feedback-refining capability of VLMs, which contains 11K feedback-refinement conversations as the test data, two evaluation settings, and a model to provide feedback for VLMs. We develop the FIRE-LLaVA model by fine-tuning LLaVA on FIRE-100K and FIRE-1M, which shows remarkable feedback-refining capability on FIRE-Bench and outperforms untrained VLMs by 50%, making more efficient user-agent interactions and underscoring the significance of the FIRE dataset.

AAAI Conference 2022 Conference Paper

Efficient Riemannian Meta-Optimization by Implicit Differentiation

  • Xiaomeng Fan
  • Yuwei Wu
  • Zhi Gao
  • Yunde Jia
  • Mehrtash Harandi

To solve optimization problems with nonlinear constrains, the recently developed Riemannian meta-optimization methods show promise, which train neural networks as an optimizer to perform optimization on Riemannian manifolds. A key challenge is the heavy computational and memory burdens, because computing the meta-gradient with respect to the optimizer involves a series of time-consuming derivatives, and stores large computation graphs in memory. In this paper, we propose an efficient Riemannian meta-optimization method that decouples the complex computation scheme from the meta-gradient. We derive Riemannian implicit differentiation to compute the meta-gradient by establishing a link between Riemannian optimization and the implicit function theorem. As a result, the updating our optimizer is only related to the final two iterations, which in turn speeds up our method and reduces the memory footprint significantly. We theoretically study the computational load and memory footprint of our method for long optimization trajectories, and conduct an empirical study to demonstrate the benefits of the proposed method. Evaluations of three optimization problems on different Riemannian manifolds show that our method achieves state-of-the-art performance in terms of the convergence speed and the quality of optima.

NeurIPS Conference 2022 Conference Paper

Hyperbolic Feature Augmentation via Distribution Estimation and Infinite Sampling on Manifolds

  • Zhi Gao
  • Yuwei Wu
  • Yunde Jia
  • Mehrtash Harandi

Learning in hyperbolic spaces has attracted growing attention recently, owing to their capabilities in capturing hierarchical structures of data. However, existing learning algorithms in the hyperbolic space tend to overfit when limited data is given. In this paper, we propose a hyperbolic feature augmentation method that generates diverse and discriminative features in the hyperbolic space to combat overfitting. We employ a wrapped hyperbolic normal distribution to model augmented features, and use a neural ordinary differential equation module that benefits from meta-learning to estimate the distribution. This is to reduce the bias of estimation caused by the scarcity of data. We also derive an upper bound of the augmentation loss, which enables us to train a hyperbolic model by using an infinite number of augmentations. Experiments on few-shot learning and continual learning tasks show that our method significantly improves the performance of hyperbolic algorithms in scarce data regimes.

AAAI Conference 2021 Conference Paper

Learning a Gradient-free Riemannian Optimizer on Tangent Spaces

  • Xiaomeng Fan
  • Zhi Gao
  • Yuwei Wu
  • Yunde Jia
  • Mehrtash Harandi

A principal way of addressing constrained optimization problems is to model them as problems on Riemannian manifolds. Recently, Riemannian meta-optimization provides a promising way for solving constrained optimization problems by learning optimizers on Riemannian manifolds in a datadriven fashion, making it possible to design task-specific constrained optimizers. A close look at the Riemannian metaoptimization reveals that learning optimizers on Riemannian manifolds needs to differentiate through the nonlinear Riemannian optimization, which is complex and computationally expensive. In this paper, we propose a simple yet efficient Riemannian meta-optimization method that learns to optimize on tangent spaces of manifolds. In doing so, we present a gradient-free optimizer on tangent spaces, which takes parameters of the model along with the training data as inputs, and generates the updated parameters directly. As a result, the constrained optimization is transformed from Riemannian manifolds to tangent spaces where complex Riemannian operations (e. g. , retraction operations) are removed from the optimizer, and learning the optimizer does not need to differentiate through the Riemannian optimization. We empirically show that our method brings efficient learning of the optimizer, while enjoying a good optimization trajectory in a data-driven manner.

AAAI Conference 2020 Conference Paper

Revisiting Bilinear Pooling: A Coding Perspective

  • Zhi Gao
  • Yuwei Wu
  • Xiaoxun Zhang
  • Jindou Dai
  • Yunde Jia
  • Mehrtash Harandi

Bilinear pooling has achieved state-of-the-art performance on fusing features in various machine learning tasks, owning to its ability to capture complex associations between features. Despite the success, bilinear pooling suffers from redundancy and burstiness issues, mainly due to the rank-one property of the resulting representation. In this paper, we prove that bilinear pooling is indeed a similarity-based coding-pooling formulation. This establishment then enables us to devise a new feature fusion algorithm, the factorized bilinear coding (FBC) method, to overcome the drawbacks of the bilinear pooling. We show that FBC can generate compact and discriminative representations with substantially fewer parameters. Experiments on two challenging tasks, namely image classification and visual question answering, demonstrate that our method surpasses the bilinear pooling technique by a large margin.