Author name cluster

Sanping Zhou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

AAAI Conference 2026 Conference Paper

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Canhui Tang
Zifan Han
Hongbo Sun
Sanping Zhou
Xuchong Zhang
Xin Wei
Ye Yuan
Huayu Zhang

Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (**TSPO**), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization for the temporal sampling policy. Furthermore, we propose a dual-style long video training data construction pipeline, balancing comprehensive temporal understanding and key segment localization. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Diversifying Query: Region-Guided Transformer for Temporal Sentence Grounding

Xiaolong Sun
Liushuai Shi
Le Wang
Sanping Zhou
Kun Xia
Yabing Wang
Gang Hua

Temporal sentence grounding is a challenging task that aims to localize the moment spans relevant to a language description. Although recent DETR-based models have achieved notable progress by leveraging multiple learnable moment queries, they suffer from overlapped and redundant proposals, leading to inaccurate predictions. We attribute this limitation to the lack of task-related guidance for the learnable queries to serve a specific mode. Furthermore, the complex solution space generated by variable and open-vocabulary language descriptions complicates optimization, making it harder for learnable queries to adaptively distinguish each other, leading to more severe overlapped proposals. To address this limitation, we present the Region-Guided TRansformer (RGTR) for temporal sentence grounding, which introduces regional guidance to increase query diversity and eliminate overlapped proposals. Instead of using learnable queries, RGTR adopts a set of anchor pairs as moment queries to introduce explicit regional guidance. Each moment query takes charge of moment prediction for a specific temporal region, which reduces the optimization difficulty and ensures the diversity of the proposals. In addition, we design an IoU-aware scoring head to improve proposal quality. Extensive experiments demonstrate the effectiveness of RGTR, outperforming state-of-the-art methods on three public benchmarks and exhibiting good generalization and robustness on out-of-distribution splits.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation

Jingyi Tian
Le Wang
Sanping Zhou
Sen Wang
Gang Hua

Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.

PDF Details

AAAI Conference 2025 Conference Paper

RefDetector: A Simple Yet Effective Matching-based Method for Referring Expression Comprehension

Yabing Wang
Zhuotao Tian
Zheng Qin
Sanping Zhou
Le Wang

Despite the rapid and substantial advancements in object detection, it continues to face limitations imposed by pre-defined category sets. Current methods for visual grounding primarily focus on how to better leverage the visual backbone to generate text-tailored visual features, which may require adjusting the parameters of the entire model. Besides, some early methods, \ie, matching-based method, build upon and extend the functionality of existing object detectors by enabling them to localize an object based on free-form linguistic expressions, which have good application potential. However, the untapped potential of the matching-based approach has not been fully realized due to inadequate exploration. In this paper, we first analyze the limitations that exist in the current matching-based method (\ie, mismatch problem and complicated fusion mechanisms), and then present a simple yet effective matching-based method, namely RefDetector. To tackle the above issues, we devise a simple heuristic rule to generate proposals with improved referent recall. Additionally, we introduce a straightforward vision-language interaction module that eliminates the need for intricate manually-designed mechanisms. Moreover, we have explored the visual grounding based on the modern detector DETR, and achieved significant performance improvement. Extensive experiments on three REC benchmark datasets, \ie, RefCOCO, RefCOCO+, and RefCOCOg validate the effectiveness of the proposed method.

PDF Details DOI

AAAI Conference 2025 Conference Paper

REGNav: Room Expert Guided Image-Goal Navigation

Pengna Li
Kangyi Wu
Jingwen Fu
Sanping Zhou

Image-goal navigation aims to steer an agent towards the goal location specified by an image. Most prior methods tackle this task by learning a navigation policy, which extracts visual features of goal and observation images, compares their similarity and predicts actions. However, if the agent is in a different room from the goal image, it's extremely challenging to identify their similarity and infer the likely goal location, which may result in the agent wandering around. Intuitively, when humans carry out this task, they may roughly compare the current observation with the goal image, having an approximate concept of whether they are in the same room before executing the actions. Inspired by this intuition, we try to imitate human behaviour and propose a Room Expert Guided Image-Goal Navigation model~(REGNav) to equip the agent with the ability to analyze whether goal and observation images are taken in the same room. Specifically, we first pre-train a room expert with an unsupervised learning technique on the self-collected unlabelled room images. The expert can extract the hidden room style information of goal and observation images and predict their relationship about whether they belong to the same room. In addition, two different fusion approaches are explored to efficiently guide the agent navigation with the room relation knowledge. Extensive experiments show that our REGNav surpasses prior state-of-the-art works on three popular benchmarks.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

SAMPO: Scale-wise Autoregression with Motion Prompt for Generative World Models

Sen Wang
Jingyi Tian
Le Wang
Zhimin Liao
Huaiyi Dong
Kun Xia
Sanping Zhou
Wei Tang

World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose Scale-wise Autoregression with Motion PrOmpt (SAMPO), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4. 4× faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.

PDF Details

NeurIPS Conference 2024 Conference Paper

Molecule Design by Latent Prompt Transformer

Deqian Kong
Yuhao Huang
Jianwen Xie
Edouardo Honig
Ming Xu
Shuanghong Xue
Pei Lin
Sanping Zhou

This work explores the challenging problem of molecule design by framing it as a conditional generative modeling task, where target biological properties or desired chemical constraints serve as conditioning variables. We propose the Latent Prompt Transformer (LPT), a novel generative model comprising three components: (1) a latent vector with a learnable prior distribution modeled by a neural transformation of Gaussian white noise; (2) a molecule generation model based on a causal Transformer, which uses the latent vector as a prompt; and (3) a property prediction model that predicts a molecule's target properties and/or constraint values using the latent prompt. LPT can be learned by maximum likelihood estimation on molecule-property pairs. During property optimization, the latent prompt is inferred from target properties and constraints through posterior sampling and then used to guide the autoregressive molecule generation. After initial training on existing molecules and their properties, we adopt an online learning algorithm to progressively shift the model distribution towards regions that support desired target properties. Experiments demonstrate that LPT not only effectively discovers useful molecules across single-objective, multi-objective, and structure-constrained optimization tasks, but also exhibits strong sample efficiency.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Referencing Where to Focus: Improving Visual Grounding with Referential Query

Yabing Wang
Zhuotao Tian
Qingpei Guo
Zheng Qin
Sanping Zhou
Ming Yang
Le Wang

Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels. To address these issues, we propose a novel approach, called RefFormer. It consists of the query adaption module that can be seamlessly integrated into CLIP and generate the referential query to provide the prior context for decoder, along with a task-specific decoder. By incorporating the referential query into the decoder, we can effectively mitigate the learning difficulty of the decoder, and accurately concentrate on the target object. Additionally, our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method, outperforming state-of-the-art approaches on five visual grounding benchmarks.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Temporal Correlation Vision Transformer for Video Person Re-Identification

Pengfei Wu
Le Wang
Sanping Zhou
Gang Hua
Changyin Sun

Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To address this issue, we propose a Temporal Correlation Vision Transformer (TCViT) for video person Re-ID. TCViT consists of a Temporal Correlation Attention (TCA) module and a Learnable Temporal Aggregation (LTA) module. The TCA module is designed to reduce the impact of non-target persons by relative state, while the LTA module is used to aggregate frame-level features based on their completeness. Specifically, TCA is a parameter-free module that first aligns frame-level features to restore semantic coherence in videos and then enhances the features of the target person according to temporal correlation. Additionally, unlike previous methods that treat each frame equally with a pooling layer, LTA introduces a lightweight learnable module to weigh and aggregate frame-level features under the guidance of a classification score. Extensive experiments on four prevalent benchmarks demonstrate that our method achieves state-of-the-art performance in video Re-ID.

PDF Details DOI

IROS Conference 2024 Conference Paper

Vehicle Trajectory Prediction with Soft Behavior Constraints

Ke Ye
Sanping Zhou
Miao Kang
Jingwen Fu
Nanning Zheng 0001

Trajectory prediction plays a crucial role in autonomous driving, but it is challenging due to the multi-modal nature of future trajectories. Behavior information is frequently employed to capture more diverse modalities of future trajectories. Traditional behavior information is typically hard-encoded, which is often inaccurate and inadequate for reflecting future multimodality. Therefore, we introduce the concept of soft vehicle behavior, which is represented as a probability distribution over a predefined comprehensive set of behaviors. This approach allows for a more rational depiction of vehicle behavior and captures potential future driving modalities. Based on it, we propose a new soft-behavior-constrained vehicle trajectory prediction framework. The framework consists of a backbone and a lightweight and plug-and-play behavior prediction module, which is used to imbue soft behavior constraints to assist in representation learning. We integrated the behavior prediction module into five representative trajectory predictors and achieved improvements of at least 4. 2% in minFDE(K=5) on the nuScenes dataset and 0. 5% in minFDE(K=6) on the Argoverse 1 motion forecasting dataset. These universal increments prove the effectiveness and generalizability of soft behavior constraints in vehicle trajectory prediction.

Details

AAAI Conference 2024 Conference Paper

Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection

Yuhao Huang
Sanping Zhou
Junjie Zhang
Jinpeng Dong
Nanning Zheng

Efficient representation of point clouds is fundamental for LiDAR-based 3D object detection. While recent grid-based detectors often encode point clouds into either voxels or pillars, the distinctions between these approaches remain underexplored. In this paper, we quantify the differences between the current encoding paradigms and highlight the limited vertical learning within. To tackle these limitations, we propose a hybrid detection framework named Voxel-Pillar Fusion (VPF), which synergistically combines the unique strengths of both voxels and pillars. To be concrete, we first develop a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively, and then introduce the Sparse Fusion Layer (SFL), facilitating bidirectional interaction between sparse voxel and pillar features. Our computationally efficient, fully sparse method can be seamlessly integrated into both dense and sparse detectors. Leveraging this powerful yet straightforward representation, VPF delivers competitive performance, achieving real-time inference speeds on the nuScenes and Waymo Open Dataset.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Multi-Stream Representation Learning for Pedestrian Trajectory Prediction

Yuxuan Wu
Le Wang
Sanping Zhou
Jinghai Duan
Gang Hua
Wei Tang

Forecasting the future trajectory of pedestrians is an important task in computer vision with a range of applications, from security cameras to autonomous driving. It is very challenging because pedestrians not only move individually across time but also interact spatially, and the spatial and temporal information is deeply coupled with one another in a multi-agent scenario. Learning such complex spatio-temporal correlation is a fundamental issue in pedestrian trajectory prediction. Inspired by the procedure that the hippocampus processes and integrates spatio-temporal information to form memories, we propose a novel multi-stream representation learning module to learn complex spatio-temporal features of pedestrian trajectory. Specifically, we learn temporal, spatial and cross spatio-temporal correlation features in three respective pathways and then adaptively integrate these features with learnable weights by a gated network. Besides, we leverage the sparse attention gate to select informative interactions and correlations brought by complex spatio-temporal modeling and reduce complexity of our model. We evaluate our proposed method on two commonly used datasets, i.e. ETH-UCY and SDD, and the experimental results demonstrate our method achieves the state-of-the-art performance. Code: https://github.com/YuxuanIAIR/MSRL-master

PDF Details DOI

AAAI Conference 2022 Conference Paper

Complementary Attention Gated Network for Pedestrian Trajectory Prediction

Jinghai Duan
Le Wang
Chengjiang Long
Sanping Zhou
Fang Zheng
Liushuai Shi
Gang Hua

Pedestrian trajectory prediction is crucial in many practical applications due to the diversity of pedestrian movements, such as social interactions and individual motion behaviors. With similar observable trajectories and social environments, different pedestrians may make completely different future decisions. However, most existing methods only focus on the frequent modal of the trajectory and thus are difficult to generalize to the peculiar scenario, which leads to the decline of the multimodal fitting ability when facing similar scenarios. In this paper, we propose a complementary attention gated network (CAGN) for pedestrian trajectory prediction, in which a dual-path architecture including normal and inverse attention is proposed to capture both frequent and peculiar modals in spatial and temporal patterns, respectively. Specifically, a complementary block is proposed to guide normal and inverse attention, which are then be summed with learnable weights to get attention features by a gated network. Finally, multiple trajectory distributions are estimated based on the fused spatio-temporal attention features due to the multimodality of future trajectory. Experimental results on benchmark datasets, i. e. , the ETH, and the UCY, demonstrate that our method outperforms state-of-the-art methods by 13. 8% in Average Displacement Error (ADE) and 10. 4% in Final Displacement Error (FDE). Code will be available at https: //github. com/jinghaiD/CAGN

PDF Details

IROS Conference 2022 Conference Paper

Pedestrian Intention Prediction Based on Traffic-Aware Scene Graph Model

Xingchen Song
Miao Kang
Sanping Zhou
Jianji Wang 0001
Yishu Mao 0003
Nanning Zheng 0001

Anticipating the future behavior of pedestrians is a crucial part of deploying Automated Driving Systems (ADS) in urban traffic scenarios. Most recent works utilize a convolutional neural network (CNN) to extract visual information, which is then input to a recurrent neural network (RNN) along with pedestrian-specific features like location and speed to obtain temporal features. However, the majority of these approaches lack the ability to parse the relationships of the related objects in the specific traffic scene, which leads to omitting the interactions between the pedestrians and the interactions between the pedestrians and the traffic. For this purpose, we propose a graph-structured model which can dig out pedestrians' dynamic constraints by constructing a traffic-aware scene graph within each frame. In addition, to capture pedestrian movement more effectively, we also introduce a temporal feature representation model, which first uses inter-frame and intra-frame GRU (II-GRU) to mine inter-frame information and intra-frame information together, and then employs a novel attention mechanism to adaptively generate attention weights. Extensive experiments on the JAAD and PIE datasets prove that our proposed model is effective in reaching and enhancing the state-of-the-art performance.

Details

AAAI Conference 2022 Conference Paper

Social Interpretable Tree for Pedestrian Trajectory Prediction

Liushuai Shi
Le Wang
Chengjiang Long
Sanping Zhou
Fang Zheng
Nanning Zheng
Gang Hua

Understanding the multiple socially-acceptable future behaviors is an essential task for many vision applications. In this paper, we propose a tree-based method, termed as Social Interpretable Tree (SIT), to address this multi-modal prediction task, where a hand-crafted tree is built depending on the prior information of observed trajectory to model multiple future trajectories. Specifically, a path in the tree from the root to leaf represents an individual possible future trajectory. SIT employs a coarse-to-fine optimization strategy, in which the tree is first built by high-order velocity to balance the complexity and coverage of the tree and then optimized greedily to encourage multimodality. Finally, a teacher-forcing refining operation is used to predict the final fine trajectory. Compared with prior methods which leverage implicit latent variables to represent possible future trajectories, the path in the tree can explicitly explain the rough moving behaviors (e. g. , go straight and then turn right), and thus provides better interpretability. Despite the hand-crafted tree, the experimental results on ETH-UCY and Stanford Drone datasets demonstrate that our method is capable of matching or exceeding the performance of state-of-the-art methods. Interestingly, the experiments show that the raw built tree without training outperforms many prior deep neural network based approaches. Meanwhile, our method presents sufficient flexibility in longterm prediction and different best-of-K predictions.

PDF Details

NeurIPS Conference 2019 Conference Paper

Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting

Jun Shu
Qi Xie
Lixuan Yi
Qian Zhao
Sanping Zhou
Zongben Xu
Deyu Meng

Current deep neural networks(DNNs) can easily overfit to biased training data with corrupted labels or class imbalance. Sample re-weighting strategy is commonly used to alleviate this issue by designing a weighting function mapping from training loss to sample weight, and then iterating between weight recalculating and classifier updating. Current approaches, however, need manually pre-specify the weighting function as well as its additional hyper-parameters. It makes them fairly hard to be generally applied in practice due to the significant variation of proper weighting schemes relying on the investigated problem and training data. To address this issue, we propose a method capable of adaptively learning an explicit weighting function directly from data. The weighting function is an MLP with one hidden layer, constituting a universal approximator to almost any continuous functions, making the method able to fit a wide range of weighting function forms including those assumed in conventional research. Guided by a small amount of unbiased meta-data, the parameters of the weighting function can be finely updated simultaneously with the learning process of the classifiers. Synthetic and real experiments substantiate the capability of our method for achieving proper weighting functions in class imbalance and noisy label cases, fully complying with the common settings in traditional methods, and more complicated scenarios beyond conventional cases. This naturally leads to its better accuracy than other state-of-the-art methods.

PDF Details