Author name cluster

Xiaojian Ma

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

1 author row

AAAI Conference 2026 Conference Paper

TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents

Bofei Zhang
Zirui Shang
Zhi Gao
Wang Zhang
Rui Xie
Xiaojian Ma
Tao Yuan
Xinxiao Wu

Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks. However, a major challenge in developing generalized GUI agents is the lack of sufficient trajectory data across various operating systems and applications, mainly due to the high cost of manual annotations. In this paper, we propose the TongUI framework that transforms millions of multimodal web tutorials into GUI trajectories for generalized GUI agents. Concretely, we crawl GUI videos and articles from the Internet and process them into GUI agent trajectory data. Based on this, we construct the GUI-Net-1M dataset, which contains 1 million trajectories across five operating systems and over 280 applications. To the best of our knowledge, this is the largest open-source GUI trajectory dataset. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B/32B models on GUI-Net-1M, which shows consistent performance improvements on commonly used grounding and navigation benchmarks, outperforming baseline agents by 10\% on multiple benchmarks, showing the effectiveness of the GUI-Net-1M dataset and underscoring the significance of our TongUI framework.

PDF Details DOI

EAAI Journal 2024 Journal Article

A deep evidence fusion framework for apple leaf disease classification

Hang Wang
Jiaxu Zhang
Zhu Yin
Liucheng Huang
Jie Wang
Xiaojian Ma

Apple leaf disease is one of the main culprits of apple yield reduction. The accurate classification of apple leaf diseases is essential to reduce economic losses. However, current studies face the challenge that when diseases have similar visual symptoms, they are difficult to distinguish. This study provides a new solution from the perspective of multi-source evidence fusion. Specifically, we propose a deep evidence fusion framework using both multi-saliency map in Hue Saturation Value (HSV) color space and a belief Cauchy–Schwarz divergence. Then, a new evidence fusion method based on the belief Cauchy–Schwarz is proposed, which fills the gap between evidence theory and apple leaf disease classification. Experimental results present that the proposed method can boost the accuracy of particularly classification backbone networks, which achieves the best performance with 98. 1% in EfficientNetV2-S network and the highest improvement with 4. 8% in Van-T networks. In addition, a series of experiments are implemented to evaluate the proposal’s effectiveness and superiority. The proposed method is a suitable alternative for classifying apple leaf diseases with similar visual symptoms, and in the future, more plant diseases will be extended to this fusion framework as well.

Details DOI

NeurIPS Conference 2024 Conference Paper

Multi-modal Situated Reasoning in 3D Scenes

Xiongkun Linghu
Jiangyong Huang
Xuesong Niu
Xiaojian Ma
Baoxiong Jia
Siyuan Huang

Situation awareness is essential for understanding and reasoning about 3D scenes in embodied AI agents. However, existing datasets and benchmarks for situated understanding suffer from severe limitations in data modality, scope, diversity, and scale. To address these limitations, we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes. MSQA includes 251K situated questionanswering pairs across 9 distinct question categories, covering complex scenarios and object modalities within 3D scenes. We introduce a novel interleaved multimodal input setting in our benchmark to provide both texts, images, and point clouds for situation and question description, aiming to resolve ambiguity in describing situations with single-modality inputs (e. g. , texts). Additionally, we devise the Multi-modal Next-step Navigation (MSNN) benchmark to evaluate models’ grounding of actions and transitions between situations. Comprehensive evaluations on reasoning and navigation tasks highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling. Experiments on data scaling and crossdomain transfer further demonstrate the effectiveness of leveraging MSQA as a pre-training dataset for developing more powerful situated reasoning models, contributing to advancements in 3D scene understanding for embodied AI.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Zihao Wang
Shaofei Cai
Zhancun Mu
Haowei Lin
Ceyao Zhang
Xuejie Liu
Qing Li
Anji Liu

This paper presents OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $\tau = \{o_0, a_0, \dots\}$ and an imitation learning policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models. With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the imitation learning policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials. The dataset, models, and code will be released at https: //craftjarvis. org/OmniJARVIS.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

Haozhe Zhao
Xiaojian Ma
Liang Chen
Shuzheng Si
Rujie Wu
Kaikai An
Peiyu Yu
Minjia Zhang

This paper presents UltraEdit, a large-scale (~ 4M editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrush, and provide a systematic approach to producing massive and high-quality image editing samples: 1) UltraEdit includes more diverse editing instructions by combining LLM creativity and in-context editing examples by human raters; 2) UltraEdit is anchored on real images (photographs or artworks), which offers more diversity and less biases than those purely synthesized by text-to-image models; 3) UltraEdit supports region-based editing with high-quality, automatically produced region annotations. Our experiments show that canonical diffusion-based editing baselines trained on UltraEdit set new records on challenging MagicBrush and Emu-Edit benchmarks, respectively. Our analysis further confirms the crucial role of real image anchors and region-based editing data. The dataset, code, and models will be made public.

PDF Details DOI

NeurIPS Conference 2021 Conference Paper

Unsupervised Foreground Extraction via Deep Region Competition

Peiyu Yu
Sirui Xie
Xiaojian Ma
Yixin Zhu
Ying Nian Wu
Song-Chun Zhu

We present Deep Region Competition (DRC), an algorithm designed to extract foreground objects from images in a fully unsupervised manner. Foreground extraction can be viewed as a special case of generic image segmentation that focuses on identifying and disentangling objects from the background. In this work, we rethink the foreground extraction by reconciling energy-based prior with generative image modeling in the form of Mixture of Experts (MoE), where we further introduce the learned pixel re-assignment as the essential inductive bias to capture the regularities of background regions. With this modeling, the foreground-background partition can be naturally found through Expectation-Maximization (EM). We show that the proposed method effectively exploits the interaction between the mixture components during the partitioning process, which closely connects to region competition, a seminal approach for generic image segmentation. Experiments demonstrate that DRC exhibits more competitive performances on complex real-world data and challenging multi-object scenes compared with prior methods. Moreover, we show empirically that DRC can potentially generalize to novel foreground objects even from categories unseen during training.

PDF Details

AAAI Conference 2020 Conference Paper

Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance

Mingxuan Jing
Xiaojian Ma
Wenbing Huang
Fuchun Sun
Chao Yang
Bin Fang
Huaping Liu

In this paper, we study Reinforcement Learning from Demonstrations (RLfD) that improves the exploration efﬁciency of Reinforcement Learning (RL) by providing expert demonstrations. Most of existing RLfD methods require demonstrations to be perfect and sufﬁcient, which yet is unrealistic to meet in practice. To work on imperfect demonstrations, we ﬁrst deﬁne an imperfect expert setting for RLfD in a formal way, and then point out that previous methods suffer from two issues in terms of optimality and convergence, respectively. Upon the theoretical ﬁndings we have derived, we tackle these two issues by regarding the expert guidance as a soft constraint on regulating the policy exploration of the agent, which eventually leads to a constrained optimization problem. We further demonstrate that such problem is able to be addressed efﬁciently by performing a local linear search on its dual form. Considerable empirical evaluations on a comprehensive collection of benchmarks indicate our method attains consistent improvement over other RLfD counterparts.

PDF Details

AAAI Conference 2020 Conference Paper

Theory-Based Causal Transfer:Integrating Instance-Level Induction and Abstract-Level Structure Learning

Mark Edmonds
Xiaojian Ma
Siyuan Qi
Yixin Zhu
Hongjing Lu
Song-Chun Zhu

Learning transferable knowledge across similar but different settings is a fundamental component of generalized intelligence. In this paper, we approach the transfer learning challenge from a causal theory perspective. Our agent is endowed with two basic yet general theories for transfer learning: (i) a task shares a common abstract structure that is invariant across domains, and (ii) the behavior of speciﬁc features of the environment remain constant across domains. We adopt a Bayesian perspective of causal theory induction and use these theories to transfer knowledge between environments. Given these general theories, the goal is to train an agent by interactively exploring the problem space to (i) discover, form, and transfer useful abstract and structural knowledge, and (ii) induce useful knowledge from the instance-level attributes observed in the environment. A hierarchy of Bayesian structures is used to model abstract-level structural causal knowledge, and an instance-level associative learning scheme learns which speciﬁc objects can be used to induce state changes through interaction. This model-learning scheme is then integrated with a model-based planner to achieve a task in the Open- Lock environment, a virtual “escape room” with a complex hierarchy that requires agents to reason about an abstract, generalized causal structure. We compare performances against a set of predominate model-free reinforcement learning (RL) algorithms. RL agents showed poor ability transferring learned knowledge across different trials. Whereas the proposed model revealed similar performance trends as human learners, and more importantly, demonstrated transfer behavior across trials and learning situations. 1

PDF Details

NeurIPS Conference 2019 Conference Paper

Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement

Chao Yang
Xiaojian Ma
Wenbing Huang
Fuchun Sun
Huaping Liu
Junzhou Huang
Chuang Gan

This paper studies Learning from Observations (LfO) for imitation learning with access to state-only demonstrations. In contrast to Learning from Demonstration (LfD) that involves both action and state supervisions, LfO is more practical in leveraging previously inapplicable resources (e. g. , videos), yet more challenging due to the incomplete expert guidance. In this paper, we investigate LfO and its difference with LfD in both theoretical and practical perspectives. We first prove that the gap between LfD and LfO actually lies in the disagreement of inverse dynamics models between the imitator and expert, if following the modeling approach of GAIL. More importantly, the upper bound of this gap is revealed by a negative causal entropy which can be minimized in a model-free way. We term our method as Inverse-Dynamics-Disagreement-Minimization (IDDM) which enhances the conventional LfO method through further bridging the gap to LfD. Considerable empirical results on challenging benchmarks indicate that our method attains consistent improvements over other LfO counterparts.

PDF Details

AAAI Conference 2019 Conference Paper

Task Transfer by Preference-Based Cost Learning

Mingxuan Jing
Xiaojian Ma
Wenbing Huang
Fuchun Sun
Huaping Liu

The goal of task transfer in reinforcement learning is migrating the action policy of an agent to the target task from the source task. Given their successes on robotic action planning, current methods mostly rely on two requirements: exactlyrelevant expert demonstrations or the explicitly-coded cost function on target task, both of which, however, are inconvenient to obtain in practice. In this paper, we relax these two strong conditions by developing a novel task transfer framework where the expert preference is applied as a guidance. In particular, we alternate the following two steps: Firstly, letting experts apply pre-defined preference rules to select related expert demonstrates for the target task. Secondly, based on the selection result, we learn the target cost function and trajectory distribution simultaneously via enhanced Adversarial MaxEnt IRL and generate more trajectories by the learned target distribution for the next preference selection. The theoretical analysis on the distribution learning and convergence of the proposed algorithm are provided. Extensive simulations on several benchmarks have been conducted for further verifying the effectiveness of the proposed method.

PDF Details