Arrow Research search

Author name cluster

Yali Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers
2 author rows

Possible papers

20

AAAI Conference 2026 Conference Paper

G-UBS: Towards Robust Understanding of Implicit Feedback via Group-Aware User Behavior Simulation

  • Boyu Chen
  • Siran Chen
  • Zhengrong Yue
  • Kainan Yan
  • Chenyun Yu
  • Beibei Kong
  • Lei Cheng
  • Chengxiang Zhuo

User feedback is critical for refining recommendation systems, yet explicit feedback (e.g., likes or dislikes) remains scarce in practice. As a more feasible alternative, inferring user preferences from massive implicit feedback has shown great potential (e.g., a user quickly skipping a recommended video usually indicates disinterest). Unfortunately, implicit feedback is often noisy: a user might skip a video due to accidental clicks or other reasons, rather than disliking it. Such noise can easily misjudge user interests, thereby undermining recommendation performance. To address this issue, we propose a novel Group-aware User Behavior Simulation (G-UBS) paradigm, which leverages contextual guidance from relevant user groups, enabling robust and in-depth interpretation of implicit feedback for individual users. Specifically, G-UBS operates via two key agents. First, the User Group Manager (UGM) effectively clusters users to generate group profiles utilizing a ``summarize-cluster-reflect" workflow based on LLMs. Second, the User Feedback Modeler (UFM) employs an innovative group-aware reinforcement learning approach, where each user is guided by the associated group profiles during the reinforcement learning process, allowing UFM to robustly and deeply examine the reasons behind implicit feedback. To assess our G-UBS paradigm, we have constructed a Video Recommendation benchmark with Implicit Feedback (IF-VR). To the best of our knowledge, this is the first multi-modal benchmark for implicit feedback evaluation in video recommendation, encompassing 15k users, 25k videos, and 933k interaction records with implicit feedback. Extensive experiments on IF-VR demonstrate that G-UBS significantly outperforms mainstream LLMs and MLLMs, with a 4.0% higher proportion of videos achieving a play rate > 30% and 14.9% higher reasoning accuracy on IF-VR.

AAAI Conference 2026 Conference Paper

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

  • Zikang Wang
  • Boyu Chen
  • Zhengrong Yue
  • Yi Wang
  • Yu Qiao
  • Limin Wang
  • Yali Wang

Recent advances in video understanding have been driven by MLLMs. But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent paradigms have recently been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. However, most existing agents ignore the key fact that a long video is composed with multiple shots, i.e., to answer the user question from a long video, it is critical to deeply understand its relevant shots like human. Without such insight, these agents often mistakenly find redundant even noisy temporal context, restricting their capacity for long video understanding. To fill this gap, we propose VideoChat-A1, a novel long video agent paradigm. Different from the previous works, our VideoChat-A1 can deeply think with long videos, via a distinct chain-of-shot reasoning paradigm. More specifically, it can progressively select the relevant shots of user question, and look into these shots in a coarse-to-fine partition. By multi-modal reasoning along the shot chain, VideoChat-A1 can effectively mimic step-by-step human thinking process, allowing the interactive discovery of preferable temporal context for thoughtful understanding in long videos. Extensive experiments show that, VideoChat-A1 achieves the state-of-the-art performance on the mainstream long video QA benchmarks, e.g., it achieves 77.0 on VideoMME(w/ subs) and 70.1 on EgoSchema, outperforming its strong baselines (e.g., InternVL2.5-8B and InternVideo2.5-8B), by up to 10.1% and 6.2%. Compared to leading closed-source GPT-4o and Gemini 1.5 Pro, VideoChat-A1 offers competitive accuracy, but only with 7% input frames and 12% inference time on average.

AAAI Conference 2026 Conference Paper

VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

  • Siran Chen
  • Boyu Chen
  • Yuxiao Luo
  • Chenyun Yu
  • Yi Ouyang
  • Lei Cheng
  • Chengxiang Zhuo
  • Zang Li

Large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. However, existing studies predominantly resort to prompt-based simulation using frozen LLMs, which frequently results in suboptimal item modeling and user preference learning, thereby ultimately constraining recommendation performance. To address these challenges, we introduce VRAgent-R1, a novel agent-based paradigm that incorporates human-like intelligence in user simulation. Specifically, VRAgent-R1 comprises two distinct agents: the Item Perception (IP) Agent and the User Simulation (US) Agent, designed for interactive user-item modeling. Firstly, the IP Agent emulates human-like progressive thinking based on MLLMs, effectively capturing hidden recommendation semantics in videos. With a more comprehensive multimodal content understanding provided by the IP Agent, the video recommendation system is equipped to provide higher-quality candidate items. Subsequently, the US Agent refines the recommended video sets based on in-depth chain-of-thought (CoT) reasoning and achieves better alignment with real user preferences through reinforcement learning. Experimental results on a large-scale video recommendation benchmark MicroLens-100k have demonstrated the effectiveness of our proposed VRAgent-R1 method, e.g., the IP Agent achieves a 6.0% improvement in NDCG@10, while the US Agent shows approximately 45.0% higher accuracy in user decision simulation compared to state-of-the-art baselines.

AAAI Conference 2026 Conference Paper

When Top-ranked Recommendations Fail: Modeling Multi-Granular Negative Feedback for Explainable and Robust Video Recommendation

  • Siran Chen
  • Boyu Chen
  • Chenyun Yu
  • Yi Ouyang
  • Lei Cheng
  • Chengxiang Zhuo
  • Zang Li
  • Yali Wang

Existing video recommendation systems, relying mainly on ID-based embedding mapping and collaborative filtering, often fail to capture in-depth video content semantics. Moreover, most struggle to address biased user behaviors (e.g., accidental clicks, fast skips), leading to inaccurate interest modeling and frequent negative feedback in top recommendations with unclear causes. To tackle this issue, we collect real-world user video-watching sequences, annotate the reasons for users' dislikes, and construct a benchmark dataset for personalized explanations. We then introduce the Agentic Explainable Negative Feedback (ENF) framework, which integrates three core components: (1) the Profile Agent, extracting behavioral cues from users' historical data to derive psychological and personality profiles; (2) the Video Agent, performing comprehensive multimodal video analysis; and (3) the Reason Agent, synthesizing information from the other two agents to predict user engagement and generate explanations. Additionally, we propose the S-GRPO algorithm, enabling the model to progressively address complex tasks during reinforcement fine-tuning. Experimental results on the collected dataset show that our method significantly outperforms state-of-the-art baselines in negative feedback prediction and reason explanation. Notably, it achieves an 8.6% improvement over GPT-4o in reason classification. Deployment on the business platform further validates its benefits: increasing average user watch time by 6.2%, reducing the fast-skip rate by 9.4%, and significantly enhancing user satisfaction.

AAAI Conference 2025 Conference Paper

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

  • Siran Chen
  • Yuxiao Luo
  • Yue Ma
  • Yu Qiao
  • Yali Wang

With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolution. Second, Q-Mamba flexibly transforms the current frame as the learnable query, and attentively select multi-granularity video context into query. Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding. Via a plug-and-play paradigm in MLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving, e.g., for risk object detection, it outperforms the previous SOTA method with 5.5% mIoU improvement.

AAAI Conference 2025 Conference Paper

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

  • Yanbo Ding
  • Shaobin Zhuang
  • Kunchang Li
  • Zhengrong Yue
  • Yu Qiao
  • Yali Wang

Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in the 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES develops a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state-of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step forward for MUSES in bridging natural language, 2D image generation, and 3D world.

NeurIPS Conference 2025 Conference Paper

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

  • Ziang Yan
  • Yinan He
  • Xinhao Li
  • Zhengrong Yue
  • Xiangyu Zeng
  • Yali Wang
  • Yu Qiao
  • Limin Wang

Inducing reasoning in multimodal large language models (MLLMs) is critical for achieving human-level perception and understanding. Existing methods mainly leverage LLM reasoning to analyze parsed visuals, often limited by static perception stages. This paper introduces Visual Test-Time Scaling (VTTS), a novel approach to enhance MLLMs' reasoning via iterative perception during inference. VTTS mimics humans' hierarchical attention by progressively refining focus on high-confidence spatio-temporal regions, guided by updated textual predictions. Specifically, VTTS employs an Iterative Perception (ITP) mechanism, incorporating reinforcement learning with spatio-temporal supervision to optimize reasoning. To support this paradigm, we also present VTTS-80K, a dataset tailored for iterative perception. These designs allows a MLLM to enhance its performance by increasing its perceptual compute. Extensive experiments validate VTTS's effectiveness and generalization across diverse tasks and benchmarks. Our newly introduced Videochat-R1. 5 model has achieved remarkable improvements, with an average increase of over 5\%, compared to robust baselines such as Qwen2. 5VL-3B and -7B, across more than 15 benchmarks that encompass video conversation, video reasoning, and spatio-temporal perception.

AAAI Conference 2024 Conference Paper

M-BEV: Masked BEV Perception for Robust Autonomous Driving

  • Siran Chen
  • Yue Ma
  • Yu Qiao
  • Yali Wang

3D perception is a critical problem in autonomous driving. Recently, the Bird’s-Eye-View (BEV) approach has attracted extensive attention, due to low-cost deployment and desirable vision detection capacity. However, the existing models ignore a realistic scenario during the driving procedure, i.e., one or more view cameras may be failed, which largely deteriorates their performance. To tackle this problem, we propose a generic Masked BEV (M-BEV) perception framework, which can effectively improve robustness to this challenging scenario, by random masking and reconstructing camera views in the end-to-end training. More specifically, we develop a novel Masked View Reconstruction (MVR) module in our M-BEV. It mimics various missing cases by randomly masking features of different camera views, then leverages the original features of these views as self-supervision and reconstructs the masked ones with the distinct spatio-temporal context across camera views. Via such a plug-and-play MVR, our M-BEV is capable of learning the missing views from the resting ones, and thus well generalized for robust view recovery and accurate perception in the testing. We perform extensive experiments on the popular NuScenes benchmark, where our framework can significantly boost 3D perception performance of the state-of-the-art models on various missing view cases, e.g., for the absence of back view, our M-BEV promotes the PETRv2 model with 10.3% mAP gain.

NeurIPS Conference 2024 Conference Paper

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

  • Yiwei Guo
  • Shaobin Zhuang
  • Kunchang Li
  • Yu Qiao
  • Yali Wang

Vision-language foundation models (such as CLIP) have recently shown their power in transfer learning, owing to large-scale image-text pre-training. However, target domain data in the downstream tasks can be highly different from the pre-training phase, which makes it hard for such a single model to generalize well. Alternatively, there exists a wide range of expert models that contain diversified vision and/or language knowledge pre-trained on different modalities, tasks, networks, and datasets. Unfortunately, these models are "isolated agents" with heterogeneous structures, and how to integrate their knowledge for generalizing CLIP-like models has not been fully explored. To bridge this gap, we propose a general and concise TransAgent framework, which transports the knowledge of the isolated agents in a unified manner, and effectively guides CLIP to generalize with multi-source knowledge distillation. With such a distinct framework, we flexibly collaborate with 11 heterogeneous agents to empower vision-language foundation models, without further cost in the inference phase. Finally, our TransAgent achieves state-of-the-art performance on 11 visual recognition datasets. Under the same low-shot setting, it outperforms the popular CoOp with around 10\% on average, and 20\% on EuroSAT which contains large domain shifts.

AAAI Conference 2021 Conference Paper

PC-HMR: Pose Calibration for 3D Human Mesh Recovery from 2D Images/Videos

  • Tianyu Luan
  • Yali Wang
  • Junhao Zhang
  • Zhe Wang
  • Zhipeng Zhou
  • Yu Qiao

The end-to-end Human Mesh Recovery (HMR) approach (Kanazawa et al. 2018) has been successfully used for 3D body reconstruction. However, most HMR-based frameworks reconstruct human body by directly learning mesh parameters from images or videos, while lacking explicit guidance of 3D human pose in visual data. As a result, the generated mesh often exhibits incorrect pose for complex activities. To tackle this problem, we propose to exploit 3D pose to calibrate human mesh. Specifically, we develop two novel Pose Calibration frameworks, i. e. , Serial PC-HMR and Parallel PC-HMR. By coupling advanced 3D pose estimators and HMR in a serial or parallel manner, these two frameworks can effectively correct human mesh with guidance of a concise pose calibration module. Furthermore, since the calibration module is designed via non-rigid pose transformation, our PC- HMR frameworks can flexibly tackle bone length variations to alleviate misplacement in the calibrated mesh. Finally, our frameworks are based on generic and complementary integration of data-driven learning and geometrical modeling. Via plug-and-play modules, they can be efficiently adapted for both image/video-based human mesh recovery. Additionally, they have no requirement of extra 3D pose annotations in the testing phase, which releases inference difficulties in practice. We perform extensive experiments on the popular benchmarks, i. e. , Human3. 6M, 3DPW and SURREAL, where our PC-HMR frameworks achieve the SOTA results.

AAAI Conference 2020 Conference Paper

Context-Transformer: Tackling Object Confusion for Few-Shot Detection

  • Ze Yang
  • Yali Wang
  • Xianyu Chen
  • Jianzhuang Liu
  • Yu Qiao

Few-shot object detection is a challenging but realistic scenario, where only a few annotated training images are available for training detectors. A popular approach to handle this problem is transfer learning, i. e. , fine-tuning a detector pretrained on a source-domain benchmark. However, such transferred detector often fails to recognize new objects in the target domain, due to low data diversity of training samples. To tackle this problem, we propose a novel Context-Transformer within a concise deep transfer framework. Specifically, Context-Transformer can effectively leverage source-domain object knowledge as guidance, and automatically exploit contexts from only a few training images in the target domain. Subsequently, it can adaptively integrate these relational clues to enhance the discriminative power of detector, in order to reduce object confusion in fewshot scenarios. Moreover, Context-Transformer is flexibly embedded in the popular SSD-style detectors, which makes it a plug-and-play module for end-to-end few-shot learning. Finally, we evaluate Context-Transformer on the challenging settings of few-shot detection and incremental few-shot detection. The experimental results show that, our framework outperforms the recent state-of-the-art approaches.

AAAI Conference 2020 Conference Paper

Learning Attentive Pairwise Interaction for Fine-Grained Classification

  • Peiqin Zhuang
  • Yali Wang
  • Yu Qiao

Fine-grained classification is a challenging problem, due to subtle differences among highly-confused categories. Most approaches address this difficulty by learning discriminative representation of individual input image. On the other hand, humans can effectively identify contrastive clues by comparing image pairs. Inspired by this fact, this paper proposes a simple but effective Attentive Pairwise Interaction Network (API-Net), which can progressively recognize a pair of fine-grained images by interaction. Specifically, API-Net first learns a mutual feature vector to capture semantic differences in the input pair. It then compares this mutual vector with individual vectors to generate gates for each input image. These distinct gate vectors inherit mutual context on semantic differences, which allow API-Net to attentively capture contrastive clues by pairwise interaction between two images. Additionally, we train API-Net in an end-to-end manner with a score ranking regularization, which can further generalize API-Net by taking feature priorities into account. We conduct extensive experiments on five popular benchmarks in fine-grained classification. API-Net outperforms the recent SOTA methods, i. e. , CUB-200-2011 (90. 0%), Aircraft (93. 9%), Stanford Cars (95. 3%), Stanford Dogs (90. 3%), and NABirds (88. 1%).

AAAI Conference 2018 Conference Paper

LSTD: A Low-Shot Transfer Detector for Object Detection

  • Hao Chen
  • Yali Wang
  • Guoyou Wang
  • Yu Qiao

Recent advances in object detection are mainly driven by deep learning with large-scale detection benchmarks. However, the fully-annotated training set is often limited for a target detection task, which may deteriorate the performance of deep detectors. To address this challenge, we propose a novel low-shot transfer detector (LSTD) in this paper, where we leverage rich source-domain knowledge to construct an effective target-domain detector with very few training examples. The main contributions are described as follows. First, we design a flexible deep architecture of LSTD to alleviate transfer difficulties in low-shot detection. This architecture can integrate the advantages of both SSD and Faster RCNN in a unified deep framework. Second, we introduce a novel regularized transfer learning framework for low-shot detection, where the transfer knowledge (TK) and background depression (BD) regularizations are proposed to leverage object knowledge respectively from source and target domains, in order to further enhance fine-tuning with a few target images. Finally, we examine our LSTD on a number of challenging low-shot detection experiments, where LSTD outperforms other state-of-the-art approaches. The results demonstrate that LSTD is a preferable deep detector for low-shot scenarios.

AAAI Conference 2017 Conference Paper

Sparse Deep Transfer Learning for Convolutional Neural Network

  • Jiaming Liu
  • Yali Wang
  • Yu Qiao

Extensive studies have demonstrated that the representations of convolutional neural networks (CNN), which are learned from a large-scale data set in the source domain, can be effectively transferred to a new target domain. However, compared to the source domain, the target domain often has limited data in practice. In this case, overfitting may significantly depress transferability, due to the model redundancy of the intensive CNN structures. To deal with this difficulty, we propose a novel sparse deep transfer learning approach for CNN. There are three main contributions in this work. First, we introduce a Sparse-SourceNet to reduce the redundancy in the source domain. Second, we introduce a Hybrid-TransferNet to improve the generalization ability and the prediction accuracy of transfer learning, by taking advantage of both model sparsity and implicit knowledge. Third, we introduce a Sparse-TargetNet, where we prune our Hybrid-TransferNet to obtain a highlycompact, source-knowledge-integrated CNN in the target domain. To examine the effectiveness of our methods, we perform our sparse deep transfer learning approach on a number of benchmark transfer learning tasks. The results show that, compared to the standard fine-tuning approach, our proposed approach achieves a significant pruning rate on CNN while improves the accuracy of transfer learning.

UAI Conference 2014 Conference Paper

Bayesian Filtering with Online Gaussian Process Latent Variable Models

  • Yali Wang
  • Marcus A. Brubaker
  • Brahim Chaib-draa
  • Raquel Urtasun

In this paper we present a novel non-parametric approach to Bayesian filtering, where the prediction and observation models are learned in an online fashion. Our approach is able to handle multimodal distributions over both models by employing a mixture model representation with Gaussian Processes (GP) based components. To cope with the increasing complexity of the estimation process, we explore two computationally efficient GP variants, sparse online GP and local GP, which help to manage computation requirements for each mixture component. Our experiments demonstrate that our approach can track human motion much more accurately than existing approaches that learn the prediction and observation models offline and do not update these models with the incoming data stream.

ICML Conference 2014 Conference Paper

Gaussian Processes for Bayesian Estimation in Ordinary Differential Equations

  • David Barber
  • Yali Wang

Bayesian parameter estimation in coupled ordinary differential equations (ODEs) is challenging due to the high computational cost of numerical integration. In gradient matching a separate data model is introduced with the property that its gradient can be calculated easily. Parameter estimation is achieved by requiring consistency between the gradients computed from the data model and those specified by the ODE. We propose a Gaussian process model that directly links state derivative information with system observations, simplifying previous approaches and providing a natural generative model.

IJCAI Conference 2013 Conference Paper

A KNN Based Kalman Filter Gaussian Process Regression

  • Yali Wang
  • Brahim Chaib-draa

The standard Gaussian process (GP) regression is often intractable when a data set is large or spatially nonstationary. In this paper, we address these challenging data properties by designing a novel K nearest neighbor based Kalman filter Gaussian process (KNN-KFGP) regression. Based on a state space model established by the KNN driven data grouping, our KNN-KFGP recursively filters out the latent function values in a computationally ef- ficient and accurate Kalman filtering framework. Moreover, KNN allows each test point to find its strongly correlated local training subset, so our KNN-KFGP provides a suitable way to deal with spatial nonstationary problems. We evaluate the performance of our KNN-KFGP on several synthetic and real data sets to show its validity.

NeurIPS Conference 2012 Conference Paper

A Marginalized Particle Gaussian Process Regression

  • Yali Wang
  • Brahim Chaib-draa

We present a novel marginalized particle Gaussian process (MPGP) regression, which provides a fast, accurate online Bayesian filtering framework to model the latent function. Using a state space model established by the data construction procedure, our MPGP recursively filters out the estimation of hidden function values by a Gaussian mixture. Meanwhile, it provides a new online method for training hyperparameters with a number of weighted particles. We demonstrate the estimated performance of our MPGP on both simulated and real large data sets. The results show that our MPGP is a robust estimation algorithm with high computational efficiency, which outperforms other state-of-art sparse GP methods.

ICRA Conference 2012 Conference Paper

An adaptive nonparametric particle filter for state estimation

  • Yali Wang
  • Brahim Chaib-draa

Particle filter is one of the most widely applied stochastic sampling tools for state estimation problems in practice. However, the proposal distribution in the traditional particle filter is the transition probability based on state equation, which would heavily affect estimation performance in that the samples are blindly drawn without considering the current observation information. Additionally, the fixed particle number in the typical particle filter would lead to wasteful computation, especially when the posterior distribution greatly varies over time. In this paper, an advanced adaptive nonparametric particle filter is proposed by incorporating gaussian process based proposal distribution into KLD-Sampling particle filter framework so that the high-qualified particles with adaptively KLD based quantity are drawn from the learned proposal with observation information at each time step to improve the approximation accuracy and efficiency. Our state estimation experiments on univariate nonstationary growth model and two-link robot arm show that the adaptive nonparametric particle filter outperforms the existing approaches with smaller size of particles.