Author name cluster

Weiliang Meng

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers

2 author rows

AAAI Conference 2026 Conference Paper

Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction

Mingda Jia
Weiliang Meng
Zenghuang Fu
Yiheng Li
Qi Zeng
Yifan Zhang
Ju Xin
Rongtao Xu

Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.

PDF Details DOI

IROS Conference 2025 Conference Paper

AccidentX: A Large-Scale Multimodal BEV Dataset for Traffic Accident Analysis and Prevention

Muyang Zhang
Zhe Feng
Jinming Yang
Mingda Jia
Weiliang Meng
Wenxuan Wu
Jiguang Zhang
Xiaopeng Zhang 0001

With the rapid development and widespread application of autonomous driving technology, the accurate analysis and prevention of traffic accidents have become critical challenges. However, current traffic accident datasets are often constrained by limited scale and diversity, impeding progress in this field. To address these limitations, we introduce AccidentX, a large-scale multimodal dataset specifically curated for comprehensive traffic accident analysis and prevention. Our AccidentX comprises over 10, 000 bird’s-eye view (BEV) videos generated using the CARLA simulator, with detailed annotations covering a wide range of traffic scenarios. In comparison to existing datasets such as nuScenes, our AccidentX offers seven times more video frames and leverages Vision-Language Models (VLMs) and GPT-4o for enhanced scene understanding and decision-making. We also establish a benchmark for state-of-the-art Multimodal Large Language Models (MLLMs) on AccidentX, fostering further research and innovation within the community. AccidentX will be made available as a fully open source resource for the advancement of the autonomous driving safety algorithm community.

Details

IJCAI Conference 2025 Conference Paper

DiffusionIMU: Diffusion-Based Inertial Navigation with Iterative Motion Refinement

Xiaoqiang Teng
Chenyang Li
Shibiao Xu
Zhihao Hao
Deke Guo
Jingyuan Li
Haisheng Li
Weiliang Meng

Inertial navigation enables self-contained localization using only Inertial Measurement Units (IMUs), making it widely applicable in various domains such as navigation, augmented reality, and robotics. However, existing methods suffer from drift accumulation due to the sensor noise and difficulty capturing long-range temporal dependencies, limiting their robustness and accuracy. To address these challenges, we propose DiffusionIMU, a novel diffusion-based framework for inertial navigation. DiffusionIMU enhances direct velocity regression from IMU data through an iterative generative denoising process, progressively refining motion state estimation. It integrates the noise-adaptive feature modulation for sensor variability handling, the feature alignment mechanism for representation consistency, and the diffusion-based temporal modeling to decrease accumulated drift. Experiments show that DiffusionIMU consistently outperforms existing methods, demonstrating superior generalization to unseen users while alleviating the impact of the sensor noise.

PDF Details DOI

AAAI Conference 2025 Conference Paper

PanoDiT: Panoramic Videos Generation with Diffusion Transformer

Muyang Zhang
Yuzhi Chen
Rongtao Xu
Changwei Wang
Jinming Yang
Weiliang Meng
Jianwei Guo
Huihuang Zhao

As immersive experiences become increasingly popular, panoramic video has garnered significant attention in both research and applications. The high cost associated with capturing panoramic video underscores the need for efficient prompt-based generation methods. Although recent text-to-video (T2V) diffusion techniques have shown potential in standard video generation, they face challenges when applied to panoramic videos due to substantial differences in content and motion patterns. In this paper, we propose PanoDiT, a framework that utilizes the Diffusion Transformer (DiT) architecture to generate panoramic videos from text descriptions. Unlike traditional methods that rely on UNet-based denoising, our method leverages a transformer architecture for denoising, incorporating both temporal and global attention mechanisms. This ensures coherent frame generation and smooth motion transitions, offering distinct advantages in long-horizon generation tasks. To further enhance motion and consistency in the generated videos, we introduce DTM-LoRA and two panoramic-specific losses. Compared to previous methods, our PanoDiT achieves state-of-the-art performance across various evaluation metrics and user study, with code is available in the supplementary material.

PDF Details DOI

ICML Conference 2025 Conference Paper

Reidentify: Context-Aware Identity Generation for Contextual Multi-Agent Reinforcement Learning

Zhiwei Xu 0005
Kun Hu
Xin Xin 0003
Weiliang Meng
Yiwei Shi
Hangyu Mao
Bin Zhang 0052
Dapeng Li 0001

Generalizing multi-agent reinforcement learning (MARL) to accommodate variations in problem configurations remains a critical challenge in real-world applications, where even subtle differences in task setups can cause pre-trained policies to fail. To address this, we propose Context-Aware Identity Generation (CAID), a novel framework to enhance MARL performance under the Contextual MARL (CMARL) setting. CAID dynamically generates unique agent identities through the agent identity decoder built on a causal Transformer architecture. These identities provide contextualized representations that align corresponding agents across similar problem variants, facilitating policy reuse and improving sample efficiency. Furthermore, the action regulator in CAID incorporates these agent identities into the action-value space, enabling seamless adaptation to varying contexts. Extensive experiments on CMARL benchmarks demonstrate that CAID significantly outperforms existing approaches by enhancing both sample efficiency and generalization across diverse context variants.

Details

ICRA Conference 2024 Conference Paper

DefFusion: Deformable Multimodal Representation Fusion for 3D Semantic Segmentation

Rongtao Xu
Changwei Wang 0001
Duzhen Zhang
Man Zhang 0005
Shibiao Xu
Weiliang Meng
Xiaopeng Zhang 0001

The complementarity between camera and LiDAR data makes fusion methods a promising approach to improve 3D semantic segmentation performance. Recent transformer-based methods have also demonstrated superiority in segmentation. However, multimodal solutions incorporating transformers are underexplored and face two key inherent difficulties: over-attention and noise from different modal data. To overcome these challenges, we propose a Deformable Multimodal Representation Fusion (DefFusion) framework consisting mainly of a Deformable Representation Fusion Transformer and Dynamic Representation Augmentation Modules. The Deformable Representation Fusion Transformer introduces the deformable mechanism in multimodal fusion, avoiding over-attention and improving efficiency by adaptively modeling a 2D key/value set for a given 3D query, thus enabling multimodal fusion with higher flexibility. To enhance the 2D representation and 3D representation, the Dynamic Representation Enhancement Module is proposed to dynamically remove noise in the input representation via Dynamic Grouped Representation Generation and Dynamic Mask Generation. Extensive experiments validate that our model achieves the best 3D semantic segmentation performance on SemanticKITTI and NuScenes benchmarks.

Details

JBHI Journal 2024 Journal Article

SkinFormer: Learning Statistical Texture Representation With Transformer for Skin Lesion Segmentation

Rongtao Xu
Changwei Wang
Jiguang Zhang
Shibiao Xu
Weiliang Meng
Xiaopeng Zhang

Accurate skin lesion segmentation from dermoscopic images is of great importance for skin cancer diagnosis. However, automatic segmentation of melanoma remains a challenging task because it is difficult to incorporate useful texture representations into the learning process. Texture representations are not only related to the local structural information learned by CNN, but also include the global statistical texture information of the input image. In this paper, we propose a trans Former network ( SkinFormer ) that efficiently extracts and fuses statistical texture representation for Skin lesion segmentation. Specifically, to quantify the statistical texture of input features, a Kurtosis-guided Statistical Counting Operator is designed. We propose Statistical Texture Fusion Transformer and Statistical Texture Enhance Transformer with the help of Kurtosis-guided Statistical Counting Operator by utilizing the transformer's global attention mechanism. The former fuses structural texture information and statistical texture information, and the latter enhances the statistical texture of multi-scale features. Extensive experiments on three publicly available skin lesion datasets validate that our SkinFormer outperforms other SOAT methods, and our method achieves 93. 2% Dice score on ISIC 2018. It can be easy to extend SkinFormer to segment 3D images in the future.

Details DOI

EAAI Journal 2023 Journal Article

Automatic polyp segmentation via image-level and surrounding-level context fusion deep neural network

Changwei Wang
Rongtao Xu
Shibiao Xu
Weiliang Meng
Xiaopeng Zhang

More than 95% of colorectal cancers are gradually transformed from polyps, so regular colonoscopy polyp examination plays an important role in cancer prevention and early treatment. However, automatic polyp segmentation remains a challenging task due to the low-contrast tissue environment and the small size and variety (e. g. , shape, color, texture) of polyps. In this case, the rich context information in colonoscopy images is worth exploring to address the above issues. On the one hand, the image-level context with a global receptive field can be used to enhance the discrimination between the foreground and the background to alleviate the occult and indistinguishability of polyps in colonoscopy images. On the other hand, the surrounding-level context focused on the surrounding pathological region of the polyp has more detailed features that are beneficial for polyp segmentation. Therefore, we propose a novel network named ISCNet that aims to fuse image-level and surrounding-level context information for polyp segmentation. Specifically, we first introduce the Global-Guided Context Aggregation (GGCA) module to explicitly model the foreground and background of polyp segmentation through image-level context, thereby flexibly enhancing polyp-related features and suppressing background-related features. Then, we design the Diverse Surrounding Context Focus (DSCF) module to focus on the surrounding area of the polyp to extract diverse local contexts to refine the segmentation results. Finally, we fuse the feature maps derived from these two modules so that our ISCNet can enjoy the facilitation of both the image-level and surrounding-level context information. To verify the effectiveness of our method, we conduct comprehensive experimental evaluations on three challenging datasets. The quantitative and qualitative experimental results confirm that our ISCNet outperforms current state-of-the-art methods by a large margin. Our code is available at https: //github. com/vvmedical/ISCNet.

Details DOI

EAAI Journal 2023 Journal Article

Dual-stream Representation Fusion Learning for accurate medical image segmentation

Rongtao Xu
Changwei Wang
Shibiao Xu
Weiliang Meng
Xiaopeng Zhang

Accurate segmenting regions of interest in various medical images are essential to clinical research and applications. Although deep learning-based methods have achieved good results, the fully automated segmentation results still need to be refined on the tininess, complexities, and irregularities of lesion shapes. To address this issue, we propose a Dual-stream Representation Fusion Learning (DRFL) paradigm for accurate clinical segmentation, including Dual-stream Fusion Module, Representation Fusion Transformer Module and Peakiness Fusion Attention Module. Specifically, Dual-stream Fusion Module can simultaneously generate binary masks and high-resolution images with segmentation stream and super-resolution stream that share a feature extractor, then both prediction outputs are merged as the input of Fusion Module to further improve the performance of the network for generating the final segmentation result; Representation Fusion Transformer Module is lightweight to fuse high-resolution representation and fine-grained structure representation; Peakiness Fusion Attention Module can capture more salient features while fusing more spatial information to improve the performance of the network. The effectiveness of our dual-stream representation fusion learning is validated on different medical image segmentation tasks, and extensive experiments show that our DRFL outperforms the state-of-the-art methods in segmentation quality of lung nodule segmentation, lung segmentation, cell contour segmentation, and prostate segmentation. Our code is available at https: //github. com/Rongtao-Xu/RepresentationLearning/tree/main/DRFL-EAAI2023.

Details DOI

AAAI Conference 2023 Conference Paper

Self Correspondence Distillation for End-to-End Weakly-Supervised Semantic Segmentation

Rongtao Xu
Changwei Wang
Jiaxi Sun
Shibiao Xu
Weiliang Meng
Xiaopeng Zhang

Efficiently training accurate deep models for weakly supervised semantic segmentation (WSSS) with image-level labels is challenging and important. Recently, end-to-end WSSS methods have become the focus of research due to their high training efficiency. However, current methods suffer from insufficient extraction of comprehensive semantic information, resulting in low-quality pseudo-labels and sub-optimal solutions for end-to-end WSSS. To this end, we propose a simple and novel Self Correspondence Distillation (SCD) method to refine pseudo-labels without introducing external supervision. Our SCD enables the network to utilize feature correspondence derived from itself as a distillation target, which can enhance the network's feature learning process by complementing semantic information. In addition, to further improve the segmentation accuracy, we design a Variation-aware Refine Module to enhance the local consistency of pseudo-labels by computing pixel-level variation. Finally, we present an efficient end-to-end Transformer-based framework (TSCD) via SCD and Variation-aware Refine Module for the accurate WSSS task. Extensive experiments on the PASCAL VOC 2012 and MS COCO 2014 datasets demonstrate that our method significantly outperforms other state-of-the-art methods. Our code is available at https://github.com/Rongtao-Xu/RepresentationLearning/tree/main/SCD-AAAI2023.

PDF Details DOI

IROS Conference 2022 Conference Paper

GeoROS: Georeferenced Real-time Orthophoto Stitching with Unmanned Aerial Vehicle

Guangze Gao
Mengke Yuan
Zhihao Ma
Jiaming Gu
Weiliang Meng
Shibiao Xu
Xiaopeng Zhang 0001

Simultaneous orthophoto stitching during the flight of Unmanned Aerial Vehicles (UAV) can greatly promote the practicability and instantaneity of diverse applications such as emergency disaster rescue, digital agriculture, and cadastral survey, which is of remarkable interest in aerial photogrammetry. However, the inaccurately estimated camera poses and the intuitive fusion strategy of existing methods lead to misalignment and distortion artifacts in orthophoto mosaics. To address these issues, we propose a Georeferenced Real-time Orthophoto Stitching method (GeoROS), which can achieve efficient and accurate camera pose estimation through exploiting geolocation information in monocular visual simultaneous localization and mapping (SLAM) and fuse transformed images via orthogonality-preserving criterion. Specifically, in the SLAM process, georeferenced tracking is employed to acquire high-quality initial camera poses with a geolocation based motion model and facilitate non-linear pose optimization. Meanwhile, we design a georeferenced mapping scheme by introducing robust geolocation constraints in joint optimization of camera poses and the position of landmarks. Finally, aerial images warped with localized cameras are fused by considering both the orthogonality of camera orientation relative to the ground plane and the pixel centrality to fulfill global orthorectification. Besides, we construct two datasets with global navigation satellite system (GNSS) information of different scenarios and validate the superiority of our GeoROS method compared with state-of-the-art methods in accuracy and efficiency.

Details

EAAI Journal 2022 Journal Article

Instance segmentation of biological images using graph convolutional network

Rongtao Xu
Ye Li
Changwei Wang
Shibiao Xu
Weiliang Meng
Xiaopeng Zhang

Instance segmentation in biological images is an important task in the field of biological images and biomedical analysis. Different from the instance segmentation of natural image scenes, this task is still challenging because there are a large number of overlapping objects with similar appearance as well as great variability in shape, size and texture in the foreground and background. In this paper, we propose a novel method for segmentation of graph-guided instances of biological images, which successfully addresses these peculiarities. Our method predicts the embedding at each pixel and uses clustering to recover instances during testing. Specifically, we design the Graph-guided Feature Fusion Module in response to overlapping instances. Our Graph-guided Feature Fusion Module combines fine deep features and coarse shallow features to learn the affinity matrix, and then uses graph convolutional network to guide the network to learn object-level local features. Next, we devise the Gated Spatial Attention Module to effectively learn key spatial information by introducing a gating mechanism. Furthermore, we give the Cluster Distance Loss that can effectively distinguish foreground objects from similar backgrounds. The effectiveness of our proposed method has been verified on various biological and biomedical datasets. The experimental results show that our method is superior to previous embedding-based instance segmentation methods. The SBD metric for our method reached 90. 8% on the plant phenotype dataset (CVPPP), 72. 5% on the cell nucleus dataset (DSB2018), and 81. 8% on the C. elegans dataset, all achieving state-of-the-art performance.

Details DOI

AAAI Conference 2022 Conference Paper

MTLDesc: Looking Wider to Describe Better

Changwei Wang
Rongtao Xu
Yuyang Zhang
Shibiao Xu
Weiliang Meng
Bin Fan
Xiaopeng Zhang

Limited by the locality of convolutional neural networks, most existing local features description methods only learn local descriptors with local information and lack awareness of global and surrounding spatial context. In this work, we focus on making local descriptors “look wider to describe better” by learning local Descriptors with More Than just Local information (MTLDesc). Specifically, we resort to context augmentation and spatial attention mechanisms to make our MTLDesc obtain non-local awareness. First, Adaptive Global Context Augmented Module and Diverse Local Context Augmented Module are proposed to construct robust local descriptors with context information from global to local. Second, Consistent Attention Weighted Triplet Loss is designed to integrate spatial attention awareness into both optimization and matching stages of local descriptors learning. Third, Local Features Detection with Feature Pyramid is given to obtain more stable and accurate keypoints localization. With the above innovations, the performance of our MTLDesc significantly surpasses the prior state-of-the-art local descriptors on HPatches, Aachen Day-Night localization and In- Loc indoor localization benchmarks. Our code is available at https: //github. com/vignywang/MTLDesc.

PDF Details