Arrow Research search

Author name cluster

Yi Bin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

AAAI Conference 2026 Conference Paper

D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies

  • Sen Chen
  • Tong Zhao
  • Yi Bin
  • Fei Ma
  • Wenqi Shao
  • Zheng Wang

Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.

AAAI Conference 2025 Conference Paper

DFDNet: Disentangling and Filtering Dynamics for Enhanced Video Prediction

  • Lianqiang Gan
  • Junyu Lai
  • Jingze Ju
  • Lianli Gao
  • Yi Bin

Videos inherently contain complex temporal dynamics across various spatial directions, often entangled in ways that obscure effective dynamic extraction. Previous studies typically process video spatiotemporal features without disentangling, which hampers their ability to extract dynamic information. Additionally, the extraction of dynamics is disrupted by transient high-dynamic information in video sequences, e.g., noise or flicker, which has received limited attention in the literature. To tackle those problems, this paper proposes the Disentangling and Filtering Dynamics Network (DFDNet). Firstly, to disentangle the interwoven dynamics, DFDNet decomposes the spatially encoded video sequences into lower dimensional sequences. Secondly, a learnable threshold filter is proposed to eliminate the transient high-dynamic information. Thirdly, the model incorporates an MLP to extract the temporal dependencies from the disentangled and filtered sequences. DFDNet demonstrates competitive performance across four chosen datasets, including both low and high-resolution videos. Specifically, on the low-resolution Moving MNIST dataset, DFDNet achieves a 19% improvement on MSE over the previous state-of-the-art model. On the high-resolution SJTU4K dataset, it outperforms the previous state-of-the-art model by 10% on the LPIPS metric under similar inference time.

ICLR Conference 2025 Conference Paper

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

  • Yue Yang
  • Shuibo Zhang
  • Kaipeng Zhang
  • Yi Bin
  • Yu Wang 0002
  • Ping Luo 0002
  • Wenqi Shao

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks such as visual perception and reasoning, leading to good performance on various multimodal evaluation benchmarks. However, these benchmarks keep a static nature and overlap with the pre-training data, resulting in fixed complexity constraints and data contamination issues. This raises the concern regarding the validity of the evaluation. To address these two challenges, we introduce a dynamic multimodal evaluation protocol called Vision-Language Bootstrapping (VLB). VLB provides a robust and comprehensive assessment for LVLMs with reduced data contamination and flexible complexity. To this end, VLB dynamically generates new visual question-answering samples through a multimodal bootstrapping module that modifies both images and language, while ensuring that newly generated samples remain consistent with the original ones by a judge module. By composing various bootstrapping strategies, VLB offers dynamic variants of existing benchmarks with diverse complexities, enabling the evaluation to co-evolve with the ever-evolving capabilities of LVLMs. Extensive experimental results across multiple benchmarks, including SEEDBench, MMBench, and MME, show that VLB significantly reduces data contamination and exposes performance limitations of LVLMs.

AAAI Conference 2025 Conference Paper

Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation

  • Thong Thanh Nguyen
  • Xiaobao Wu
  • Yi Bin
  • Cong-Duy T Nguyen
  • See-Kiong Ng
  • Anh Tuan Luu

To equip artificial intelligence with a comprehensive understanding towards a temporal world, video and 4D panoptic scene graph generation abstracts visual data into nodes to represent entities and edges to capture temporal relations. Existing methods encode entity masks tracked across temporal dimensions (mask tubes), then predict their relations with temporal pooling operation, which does not fully utilize the motion indicative of the entities' relation. To overcome this limitation, we introduce a contrastive representation learning framework that focuses on motion pattern for temporal scene graph generation. Firstly, our framework encourages the model to learn close representations for mask tubes of similar subject-relation-object triplets. Secondly, we seek to push apart mask tubes from their temporally shuffled versions. Moreover, we also learn distant representations for mask tubes belonging to the same video but different triplets. Extensive experiments show that our motion-aware contrastive framework significantly improves state-of-the-art methods on both video and 4D datasets.

AAAI Conference 2025 Conference Paper

Multi-Scale Contrastive Learning for Video Temporal Grounding

  • Thong Thanh Nguyen
  • Yi Bin
  • Xiaobao Wu
  • Zhiyuan Hu
  • Cong-Duy T Nguyen
  • See-Kiong Ng
  • Anh Tuan Luu

Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online memory banks to obtain positive and negative samples. To enable such an extension, we introduce a sampling process to draw multiple video moments corresponding to a common query. Subsequently, by utilizing these moments' representations across video encoder layers, we instantiate a novel form of multi-scale and cross-scale contrastive learning that links local short-range video moments with global long-range video moments. Extensive experiments demonstrate the effectiveness of our framework for not only long-form but also short-form video grounding.

JBHI Journal 2022 Journal Article

Multi-Modality MR Image Synthesis via Confidence-Guided Aggregation and Cross-Modality Refinement

  • Bo Peng
  • Bingzheng Liu
  • Yi Bin
  • Lili Shen
  • Jianjun Lei

Magnetic resonance imaging (MRI) can provide multi-modality MR images by setting task-specific scan parameters, and has been widely used in various disease diagnosis and planned treatments. However, in practical clinical applications, it is often difficult to obtain multi-modality MR images simultaneously due to patient discomfort, and scanning costs, etc. Therefore, how to effectively utilize the existing modality images to synthesize missing modality image has become a hot research topic. In this paper, we propose a novel confidence-guided aggregation and cross-modality refinement network (CACR-Net) for multi-modality MR image synthesis, which effectively utilizes complementary and correlative information of multiple modalities to synthesize high-quality target-modality images. Specifically, to effectively utilize the complementary modality-specific characteristics, a confidence-guided aggregation module is proposed to adaptively aggregate the multiple target-modality images generated from multiple source-modality images by using the corresponding confidence maps. Based on the aggregated target-modality image, a cross-modality refinement module is presented to further refine the target-modality image by mining correlative information among the multiple source-modality images and aggregated target-modality image. By training the proposed CACR-Net in an end-to-end manner, high-quality and sharp target-modality MR images are effectively synthesized. Experimental results on the widely used benchmark demonstrate that the proposed method outperforms state-of-the-art methods.

AAAI Conference 2019 Conference Paper

MR-NET: Exploiting Mutual Relation for Visual Relationship Detection

  • Yi Bin
  • Yang Yang
  • Chaofan Tao
  • Zi Huang
  • Jingjing Li
  • Heng Tao Shen

Inferring the interactions between objects, a. k. a visual relationship detection, is a crucial point for vision understanding, which captures more definite concepts than object detection. Most previous work that treats the interaction between a pair of objects as a one way fail to exploit the mutual relation between objects, which is essential to modern visual application. In this work, we propose a mutual relation net, dubbed MR-Net, to explore the mutual relation between paired objects for visual relationship detection. Specifically, we construct a mutual relation space to model the mutual interaction of paired objects, and employ linear constraint to optimize the mutual interaction, which is called mutual relation learning. Our mutual relation learning does not introduce any parameters, and can adapt to improve the performance of other methods. In addition, we devise a semantic ranking loss to discriminatively penalize predicates with semantic similarity, which is ignored by traditional loss function (e. g. , cross entropy with softmax). Then, our MR-Net optimizes the mutual relation learning together with semantic ranking loss with a siamese network. The experimental results on two commonly used datasets (VG and VRD) demonstrate the superior performance of the proposed approach.