Arrow Research search

Author name cluster

Han Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
2 author rows

Possible papers

10

AAAI Conference 2026 Conference Paper

Decompose and Conquer: Compositional Reasoning for Zero-Shot Temporal Action Localization

  • Haoyu Tang
  • Tianyuan Liang
  • Han Jiang
  • Xuesong Liu
  • Qinghai Zheng
  • Yupeng Hu

Current Zero-Shot Temporal Action Localization (ZSTAL) methods, whether training-based or training-free ones, still predominantly rely on a single, unified query to localize an entire action. This unified representation is fundamentally ill-suited for complex real-world activities, as it fails to capture their internal compositional structure and adapt to dynamic, multi-stage variations across videos. To address this, we regard ZSTAL as a compositional reasoning task and introduce CASCADE, a Context-Aware Staged Action DEcomposition framework. Inspired by the human cognitive process of perceiving context, decomposing events, and reconstructing instances, CASCADE follows a training-free pipeline. It first perceives the video's context by leveraging a Multimodal Large Language Model (MLLM) to both filter out irrelevant actions and then generate a rich, video-specific caption for each action present in the video. An LLM then decomposes this caption into multiple, temporally ordered stages, which serve as fine-grained queries to guide the MLLM in estimating frame-level confidence scores. Recognizing that this decomposition can fragment a single action, a novel hierarchical merging logic then reconstructs complete instances by intelligently fusing these preliminary temporal segments based on their semantic progression and coherence. Extensive experiments and ablation studies on THUMOS14 and ActivityNet-1.3 show that CASCADE not only sets a new state-of-the-art among training-free methods but, most notably, significantly outperforms all prior training-based approaches on ActivityNet-1.3.

NeurIPS Conference 2025 Conference Paper

AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

  • Sixiang Chen
  • Jiaming Liu
  • Siyuan Qian
  • Han Jiang
  • Zhuoyang Liu
  • Chenyang Gu
  • Xiaoqi Li
  • Chengkai Hou

Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumulation under high degrees of freedom. On the other hand, they treat the entire mobile manipulation process with the same visual observation modality (e. g. , either all 2D or all 3D), overlooking the distinct multimodal perception requirements at different stages during mobile manipulation. To address this, we propose the Adaptive Coordination Diffusion Transformer (AC-DiT), which enhances mobile base and manipulator coordination for end-to-end mobile manipulation. First, since the motion of the mobile base directly influences the manipulator's actions, we introduce a mobility-to-body conditioning mechanism that guides the model to first extract base motion representations, which are then used as context prior for predicting whole-body actions. This enables whole-body control that accounts for the potential impact of the mobile base’s motion. Second, to meet the perception requirements at different stages of mobile manipulation, we design a perception-aware multimodal conditioning strategy that dynamically adjusts the fusion weights between various 2D visual images and 3D point clouds, yielding visual features tailored to the current perceptual needs. This allows the model to, for example, adaptively rely more on 2D inputs when semantic information is crucial for action prediction, while placing greater emphasis on 3D geometric information when precise spatial understanding is required. We empirically validate AC-DiT through extensive experiments on both simulated and real-world mobile manipulation tasks, demonstrating superior performance compared to existing methods.

AAAI Conference 2025 Conference Paper

Boundary-Aware Temporal Dynamic Pseudo-Supervision Pairs Generation for Zero-Shot Natural Language Video Localization

  • Xiongwen Deng
  • Haoyu Tang
  • Han Jiang
  • Qinghai Zheng
  • Jihua Zhu

Zero-shot Natural Language Video Localization (NLVL) aims to automatically generate moments and corresponding pseudo queries from raw videos for the training of the localization model without any manual annotations. Existing approaches typically produce pseudo queries as simple words, which overlook the complexity of queries in real-world scenarios. Considering the powerful text modeling capabilities of large language models (LLMs), leveraging LLMs to generate complete queries that are closer to human descriptions is a potential solution. However, directly integrating LLMs into existing approaches introduces several issues, including insensitivity, isolation, and lack of regulation, which prevent the full exploitation of LLMs to enhance zero-shot NLVL performance. To address these issues, we propose BTDP, an innovative framework for Boundary-aware Temporal Dynamic Pseudo-supervision pairs generation. Our method contains two crucial operations: 1) Boundary Segmentation that identifies both visual boundaries and semantic boundaries to generate the atomic segments and activity descriptions, tackling the issue of insensitivity. 2) Context Aggregation that employs the LLMs with a self-evaluation process to aggregate and summarize global video information for optimized pseudo moment-query pairs, tackling the issue of isolation and lack of regulation. Comprehensive experimental results on the Charades-STA and ActivityNet Captions datasets demonstrate the effectiveness of our BTDP method.

NeurIPS Conference 2025 Conference Paper

PanoWan: Lifting Diffusion Video Generation Models to 360$^\circ$ with Latitude/Longitude-aware Mechanisms

  • Yifei Xia
  • Shuchen Weng
  • Siqi Yang
  • Jingqi Liu
  • Chengxuan Zhu
  • Minggui Teng
  • Zijian Jia
  • Han Jiang

Panoramic video generation enables immersive 360$^\circ$ content creation, valuable in applications that demand scene-consistent world exploration. However, existing panoramic video generation models struggle to leverage pre-trained generative priors from conventional text-to-video models for high-quality and diverse panoramic videos generation, due to limited dataset scale and the gap in spatial feature representations. In this paper, we introduce PanoWan to effectively lift pre-trained text-to-video models to the panoramic domain, equipped with minimal modules. PanoWan employs latitude-aware sampling to avoid latitudinal distortion, while its rotated semantic denoising and padded pixel-wise decoding ensure seamless transitions at longitude boundaries. To provide sufficient panoramic videos for learning these lifted representations, we contribute PanoVid, a high-quality panoramic video dataset with captions and diverse scenarios. Consequently, PanoWan achieves state-of-the-art performance in panoramic video generation and demonstrates robustness for zero-shot downstream tasks.

IJCAI Conference 2024 Conference Paper

DenseKoopman: A Plug-and-Play Framework for Dense Pedestrian Trajectory Prediction

  • Xianbang Li
  • Yilong Ren
  • Han Jiang
  • Haiyang Yu
  • Yanlei Cui
  • Liang Xu

Pedestrian trajectory prediction has emerged as a core component of human-robot interaction and autonomous driving. Fast and accurate prediction of surrounding pedestrians contributes to making decisions and improves safety and efficiency. However, pedestrians’ future trajectories will interact with their surrounding traffic participants. As the density of pedestrians increases, the complexity of such interactions also increases significantly, leading to an inevitable decrease in the accuracy of pedestrian trajectory prediction. To address this issue, we propose DenseKoopman, a plug-and-play framework for dense pedestrian trajectory prediction. Specifically, we introduce the Koopman operator theory to find an embedding space for a global linear approximation of a nonlinear pedestrian motion system. By encoding historical trajectories as linear state embeddings in the Koopman space, we transforms nonlinear trajectory data for pedestrians in dense scenes. This linearized representation greatly reduces the complexity of dense pedestrian trajectory prediction. Extensive experiments on pedestrian trajectory prediction benchmarks demonstrate the superiority of the proposed framework. We also conducted an analysis of the data transformation to explore how our DenseKoopman framework works with each validation method and uncovers motion patterns that may be hidden within the trajectory data. Code is available at https: //github. com/lixianbang/DenseKoopman.

IROS Conference 2024 Conference Paper

Small Multi-Rotor UAV Oriented Direct Thrust Sensor Based on Lightweight Barometers

  • Han Jiang
  • Yanchun Chang
  • Liying Yang 0002
  • Yuqing He

The multirotor unmanned aerial vehicle (UAV) requires precise control over thrust output when operating in wind-disturbed environments or executing intricate flight missions. Although current commercial force sensors offer high sensitivity and accuracy, they are often heavy and costly. These characteristics restrict their applicability in weight-sensitive and cost-sensitive scenarios, such as thrust measurement in UAVs. To overcome this difficulty, we have developed an embedded barometric force sensor (BFS) that mounts between the UAV’s airframe and motor, allowing direct measurement of the force exerted by the rotor on the UAV’s rigid body. The BFS is designed using low-cost MEMS barometers as tactile force sensors, encased in polyurethane rubber. Subsequently, we established its parameter model and devised a stability improvement strategy to reduce the impact of temperature. Additionally, we designed a structure suitable for mounting the BFS on the UAV to safeguard the rubber module from damage and reconstructed the thrust model to account for the impact of weight and friction on thrust measurement. Finally, we assembled testing platforms to validate the performance of the BFS. Experimental results demonstrate the BFS’s excellent linearity, wide range, adequate bandwidth to respond to UAV thrust variations, and confirm the feasibility of mounting the BFS on the UAV for thrust measurement and force feedback control.

JBHI Journal 2022 Journal Article

A Progressive Generative Adversarial Method for Structurally Inadequate Medical Image Data Augmentation

  • ruixuan zhang
  • Wenhuan Lu
  • Xi Wei
  • Jialin Zhu
  • Han Jiang
  • Zhiqiang Liu
  • Jie Gao
  • Xuewei Li

The generation-based data augmentation method can overcome the challenge caused by the imbalance of medical image data to a certain extent. However, most of the current research focus on images with unified structure which are easy to learn. What is different is that ultrasound images are structurally inadequate, making it difficult for the structure to be captured by the generative network, resulting in the generated image lacks structural legitimacy. Therefore, a Progressive Generative Adversarial Method for Structurally Inadequate Medical Image Data Augmentation is proposed in this paper, including a network and a strategy. Our Progressive Texture Generative Adversarial Network alleviates the adverse effect of completely truncating the reconstruction of structure and texture during the generation process and enhances the implicit association between structure and texture. The Image Data Augmentation Strategy based on Mask-Reconstruction overcomes data imbalance from a novel perspective, maintains the legitimacy of the structure in the generated data, as well as increases the diversity of disease data interpretably. The experiments prove the effectiveness of our method on data augmentation and image reconstruction on Structurally Inadequate Medical Image both qualitatively and quantitatively. Finally, the weakly supervised segmentation of the lesion is the additional contribution of our method.

IJCAI Conference 2013 Conference Paper

i, Poet: Automatic Chinese Poetry Composition through a Generative Summarization Framework under Constrained Optimization

  • Rui Yan
  • Han Jiang
  • Mirella Lapata
  • Shou-De Lin
  • Xueqiang Lv
  • Xiaoming Li

Part of the long lasting cultural heritage of China is the classical ancient Chinese poems which follow strict formats and complicated linguistic rules. Automatic Chinese poetry composition by programs is considered as a challenging problem in computational linguistics and requires high Artificial Intelligence assistance, and has not been well addressed. In this paper, we formulate the poetry composition task as an optimization problem based on a generative summarization framework under several constraints. Given the user specified writing intents, the system retrieves candidate terms out of a large poem corpus, and then orders these terms to fit into poetry formats, satisfying tonal and rhythm requirements. The optimization process under constraints is conducted via iterative term substitutions till convergence, and outputs the subset with the highest utility as the generated poem. For experiments, we perform generation on large datasets of 61, 960 classic poems from Tang and Song Dynasty of China. A comprehensive evaluation, using both human judgments and ROUGE scores, has demonstrated the effectiveness of our proposed approach.