Arrow Research search

Author name cluster

Xuemeng Song

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
2 author rows

Possible papers

13

AAAI Conference 2026 Conference Paper

D2MoRA: Diversity-Regulated Asymmetric MoE-LoRA Decomposition for Efficient Multi-Task Adaptation

  • Jianhui Zuo
  • Xuemeng Song
  • Haokun Wen
  • Meng Liu
  • Yupeng Hu
  • Jiuru Wang
  • Liqiang Nie

Low-Rank Adaptation (LoRA) has emerged as a powerful parameter-efficient fine-tuning method for adapting large language models to downstream tasks. Recent studies have leveraged Mixture-of-Experts (MoE) mechanism to effectively integrate multiple LoRA modules, facilitating efficient parameter adaptation for multi-task scenarios. It has been shown that fostering knowledge sharing across LoRA experts can greatly enhance parameter adaptation efficiency. However, the existing approach for LoRA expert knowledge sharing still faces two key limitations: constrained functional specialization and induced expert homogenization. To address these issues, we propose a novel diversity-regulated asymmetric MoE-LoRA decomposition framework, which achieves flexible knowledge sharing through asymmetric expert decomposition and guarantees expert diversity with a dual orthogonality regularization. Extensive experiments on eight public benchmarks, spanning both multi-task and single-task settings, demonstrate the superiority of our approach over existing methods.

AAAI Conference 2026 Conference Paper

TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs

  • Yunxiao Wang
  • Meng Liu
  • Wenqi Liu
  • Xuemeng Song
  • Bin Wen
  • Fan Yang
  • Tingting Gao
  • Di Zhang

Video large language models have achieved remarkable performance in tasks such as video question answering, however, their temporal understanding remains suboptimal. To address this limitation, we curate a dedicated instruction fine-tuning dataset that focuses on enhancing temporal comprehension across five key dimensions. In order to reduce reliance on costly temporal annotations, we introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets without requiring additional annotations. Furthermore, we develop a novel benchmark for temporal-sensitive video understanding that not only fills the gaps in dimension coverage left by existing benchmarks but also rigorously filters out potential shortcuts, ensuring a more accurate evaluation. Extensive experimental results demonstrate that our approach significantly enhances the temporal understanding of video-LLMs while avoiding reliance on shortcuts.

AAAI Conference 2025 Conference Paper

Content-aware Balanced Spectrum Encoding in Masked Modeling for Time Series Classification

  • Yudong Han
  • Haocong Wang
  • Yupeng Hu
  • Yongshun Gong
  • Xuemeng Song
  • Weili Guan

Due to the superior ability of global dependency, transformer and its variants have become the primary choice in Masked Time-series Modeling (MTM) towards time-series classification task. In this paper, we experimentally analyze that existing transformer-based MTM methods encounter with two under-explored issues when dealing with time series data: (1) they encode features by performing long-dependency ensemble averaging, which easily results in rank collapse and feature homogenization as the layer goes deeper; (2) they exhibit distinct priorities in fitting different frequency components contained in the time-series, inevitably leading to spectrum energy imbalance of encoded feature. To tackle these issues, we propose an auxiliary content-aware balanced decoder (CBD) to optimize the encoding quality in the spectrum space within masked modeling scheme. Specifically, the CBD iterates on a series of fundamental blocks, and thanks to two tailored units, each block could progressively refine the masked representation via adjusting the interaction pattern based on local content variations of time-series and learning to recalibrate the energy distribution across different frequency components. Moreover, dual-constraint loss is devised to enhance the mutual optimization of vanilla decoder and our CBD. Extensive experimental results on ten time-series classification datasets show that our method nearly surpasses a bunch of baselines. Meanwhile, a series of explanatory results are showcased to sufficiently demystify the behaviors of our method.

NeurIPS Conference 2025 Conference Paper

Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization

  • Wenqi Liu
  • Xuemeng Song
  • Jiaxi Li
  • Yinwei Wei
  • Na Zheng
  • Jianhua Yin
  • Liqiang Nie

Direct Preference Optimization (DPO) has emerged as an effective approach for mitigating hallucination in Multimodal Large Language Models (MLLMs). Although existing methods have achieved significant progress by utilizing vision-oriented contrastive objectives for enhancing MLLMs' attention to visual inputs and hence reducing hallucination, they suffer from non-rigorous optimization objective function and indirect preference supervision. To address these limitations, we propose a Symmetric Multimodal Preference Optimization (SymMPO), which conducts symmetric preference learning with direct preference supervision (i. e. , response pairs) for visual understanding enhancement, while maintaining rigorous theoretical alignment with standard DPO. In addition to conventional ordinal preference learning, SymMPO introduces a preference margin consistency loss to quantitatively regulate the preference gap between symmetric preference pairs. Comprehensive evaluation across five benchmarks demonstrate SymMPO's superior performance, validating its effectiveness in hallucination mitigation of MLLMs.

AAAI Conference 2024 Conference Paper

Exploiting the Social-Like Prior in Transformer for Visual Reasoning

  • Yudong Han
  • Yupeng Hu
  • Xuemeng Song
  • Haoyu Tang
  • Mingzhu Xu
  • Liqiang Nie

Benefiting from instrumental global dependency modeling of self-attention (SA), transformer-based approaches have become the pivotal choices for numerous downstream visual reasoning tasks, such as visual question answering (VQA) and referring expression comprehension (REC). However, some studies have recently suggested that SA tends to suffer from rank collapse thereby inevitably leads to representation degradation as the transformer layer goes deeper. Inspired by social network theory, we attempt to make an analogy between social behavior and regional information interaction in SA, and harness two crucial notions of structural hole and degree centrality in social network to explore the possible optimization towards SA learning, which naturally deduces two plug-and-play social-like modules. Based on structural hole, the former module allows to make information interaction in SA more structured, which effectively avoids redundant information aggregation and global feature homogenization for better rank remedy, followed by latter module to comprehensively characterize and refine the representation discrimination via considering degree centrality of regions and transitivity of relations. Without bells and whistles, our model outperforms a bunch of baselines by a noticeable margin when considering our social-like prior on five benchmarks in VQA and REC tasks, and a series of explanatory results are showcased to sufficiently reveal the social-like behaviors in SA.

ICML Conference 2024 Conference Paper

Multi-Factor Adaptive Vision Selection for Egocentric Video Question Answering

  • Haoyu Zhang
  • Meng Liu 0006
  • Zixin Liu
  • Xuemeng Song
  • Yaowei Wang 0001
  • Liqiang Nie

The challenge of interpreting the world from a human perspective in Artificial Intelligence (AI) is particularly evident in egocentric video question answering, which grapples with issues like small object recognition, noise suppression, and spatial-temporal reasoning. To address these challenges, we introduce the Multi-Factor Adaptive vision Selection (MFAS) framework. MFAS integrates a patch partition and merging module for enhanced small object recognition, a prior-guided patch selection module for noise suppression and focused analysis, and a hierarchical aggregation network to aggregate visual semantics guided by questions. Extensive experiments on several public egocentric datasets have validated the effectiveness and generalization of our framework. Code and data are available in https: //github. com/Hyu-Zhang/EgoVideoQA.

AAAI Conference 2023 Conference Paper

Mutual-Enhanced Incongruity Learning Network for Multi-Modal Sarcasm Detection

  • Yang Qiao
  • Liqiang Jing
  • Xuemeng Song
  • Xiaolin Chen
  • Lei Zhu
  • Liqiang Nie

Sarcasm is a sophisticated linguistic phenomenon that is prevalent on today's social media platforms. Multi-modal sarcasm detection aims to identify whether a given sample with multi-modal information (i.e., text and image) is sarcastic. This task's key lies in capturing both inter- and intra-modal incongruities within the same context. Although existing methods have achieved compelling success, they are disturbed by irrelevant information extracted from the whole image and text, or overlooking some important information due to the incomplete input. To address these limitations, we propose a Mutual-enhanced Incongruity Learning Network for multi-modal sarcasm detection, named MILNet. In particular, we design a local semantic-guided incongruity learning module and a global incongruity learning module. Moreover, we introduce a mutual enhancement module to take advantage of the underlying consistency between the two modules to boost the performance. Extensive experiments on a widely-used dataset demonstrate the superiority of our model over cutting-edge methods.

IJCAI Conference 2020 Conference Paper

Auxiliary Template-Enhanced Generative Compatibility Modeling

  • Jinhuan Liu
  • Xuemeng Song
  • Zhaochun Ren
  • Liqiang Nie
  • Zhaopeng Tu
  • Jun Ma

In recent years, there has been a growing interest in the fashion analysis (e. g. , clothing matching) due to the huge economic value of the fashion industry. The essential problem is to model the compatibility between the complementary fashion items, such as the top and bottom in clothing matching. The majority of existing work on fashion analysis has focused on measuring the item-item compatibility in a latent space with deep learning methods. In this work, we aim to improve the compatibility modeling by sketching a compatible template for a given item as an auxiliary link between fashion items. Specifically, we propose an end-to-end Auxiliary Template-enhanced Generative Compatibility Modeling (AT-GCM) scheme, which introduces an auxiliary complementary template generation network equipped with the pixel-wise consistency and compatible template regularization. Extensive experiments on two real-world datasets demonstrate the superiority of the proposed approach.

IJCAI Conference 2018 Conference Paper

A^3NCF: An Adaptive Aspect Attention Model for Rating Prediction

  • Zhiyong Cheng
  • Ying Ding
  • Xiangnan He
  • Lei Zhu
  • Xuemeng Song
  • Mohan Kankanhalli

Current recommender systems consider the various aspects of items for making accurate recommendations. Different users place different importance to these aspects which can be thought of as a preference/attention weight vector. Most existing recommender systems assume that for an individual, this vector is the same for all items. However, this assumption is often invalid, especially when considering a user's interactions with items of diverse characteristics. To tackle this problem, in this paper, we develop a novel aspect-aware recommender model named A$^3$NCF, which can capture the varying aspect attentions that a user pays to different items. Specifically, we design a new topic model to extract user preferences and item characteristics from review texts. They are then used to 1) guide the representation learning of users and items, and 2) capture a user's special attention on each aspect of the targeted item with an attention network. Through extensive experiments on several large-scale datasets, we demonstrate that our model outperforms the state-of-the-art review-aware recommender systems in the rating prediction task.

IJCAI Conference 2018 Conference Paper

Adaptive Collaborative Similarity Learning for Unsupervised Multi-view Feature Selection

  • Xiao Dong
  • Lei Zhu
  • Xuemeng Song
  • Jingjing Li
  • Zhiyong Cheng

In this paper, we investigate the research problem of unsupervised multi-view feature selection. Conventional solutions first simply combine multiple pre-constructed view-specific similarity structures into a collaborative similarity structure, and then perform the subsequent feature selection. These two processes are separate and independent. The collaborative similarity structure remains fixed during feature selection. Further, the simple undirected view combination may adversely reduce the reliability of the ultimate similarity structure for feature selection, as the view-specific similarity structures generally involve noises and outlying entries. To alleviate these problems, we propose an adaptive collaborative similarity learning (ACSL) for multi-view feature selection. We propose to dynamically learn the collaborative similarity structure, and further integrate it with the ultimate feature selection into a unified framework. Moreover, a reasonable rank constraint is devised to adaptively learn an ideal collaborative similarity structure with proper similarity combination weights and desirable neighbor assignment, both of which could positively facilitate the feature selection. An effective solution guaranteed with the proved convergence is derived to iteratively tackle the formulated optimization problem. Experiments demonstrate the superiority of the proposed approach.

IJCAI Conference 2018 Conference Paper

SDMCH: Supervised Discrete Manifold-Embedded Cross-Modal Hashing

  • Xin Luo
  • Xiao-Ya Yin
  • Liqiang Nie
  • Xuemeng Song
  • Yongxin Wang
  • Xin-Shun Xu

Cross-modal hashing methods have attracted considerable attention. Most pioneer approaches only preserve the neighborhood relationship by constructing the correlations among heterogeneous modalities. However, they neglect the fact that the high-dimensional data often exists on a low-dimensional manifold embedded in the ambient space and the relative proximity between the neighbors is also important. Although some methods leverage the manifold learning to generate the hash codes, most of them fail to explicitly explore the discriminative information in the class labels and discard the binary constraints during optimization, generating large quantization errors. To address these issues, in this paper, we present a novel cross-modal hashing method, named Supervised Discrete Manifold-Embedded Cross-Modal Hashing (SDMCH). It can not only exploit the non-linear manifold structure of data and construct the correlation among heterogeneous multiple modalities, but also fully utilize the semantic information. Moreover, the hash codes can be generated discretely by an iterative optimization algorithm, which can avoid the large quantization errors. Extensive experimental results on three benchmark datasets demonstrate that SDMCH outperforms ten state-of-the-art cross-modal hashing methods.

AAAI Conference 2016 Conference Paper

Fusing Social Networks with Deep Learning for Volunteerism Tendency Prediction

  • Yongpo Jia
  • Xuemeng Song
  • Jingbo Zhou
  • Li Liu
  • Liqiang Nie
  • David Rosenblum

Social networks contain a wealth of useful information. In this paper, we study a challenging task for integrating users’ information from multiple heterogeneous social networks to gain a comprehensive understanding of users’ interests and behaviors. Although much effort has been dedicated to study this problem, most existing approaches adopt linear or shallow models to fuse information from multiple sources. Such approaches cannot properly capture the complex nature of and relationships among different social networks. Adopting deep learning approaches to learning a joint representation can better capture the complexity, but this neglects measuring the level of confidence in each source and the consistency among different sources. In this paper, we present a framework for multiple social network learning, whose core is a novel model that fuses social networks using deep learning with source confidence and consistency regularization. To evaluate the model, we apply it to predict individuals’ tendency to volunteerism. With extensive experimental evaluations, we demonstrate the effectiveness of our model, which outperforms several state-of-the-art approaches in terms of precision, recall and F1-score.

IJCAI Conference 2015 Conference Paper

Interest Inference via Structure-Constrained Multi-Source Multi-Task Learning

  • Xuemeng Song
  • Liqiang Nie
  • Luming Zhang
  • Maofu Liu
  • Tat-Seng Chua

User interest inference from social networks is a fundamental problem to many applications. It usually exhibits dual-heterogeneities: a user’s interests are complementarily and comprehensively reflected by multiple social networks; interests are inter-correlated in a nonuniform way rather than independent to each other. Although great success has been achieved by previous approaches, few of them consider these dual-heterogeneities simultaneously. In this work, we propose a structure-constrained multi-source multi-task learning scheme to co-regularize the source consistency and the tree-guided task relatedness. Meanwhile, it is able to jointly learn the tasksharing and task-specific features. Comprehensive experiments on a real-world dataset validated our scheme. In addition, we have released our dataset to facilitate the research communities.