Author name cluster

Jiawei Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

35 papers

2 author rows

AAAI Conference 2026 Conference Paper

Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance

Lifan Zheng
Jiawei Chen
Qinghong Yin
Jingyuan Zhang
Xinyi Zeng
Yu Tian

Ensuring the reliability of agent architectures and effectively identifying problematic agents when failures occur are crucial challenges in multi-agent systems (MAS). Advances in large language models (LLMs) have established LLM-based agents as a major branch of MAS, enabling major breakthroughs in complex problem solving and world modeling. However, the reliability implications of this shift remain largely unexplored. i.e., whether substituting traditional agents with LLM-based agents can effectively enhance the reliability of MAS. In this work, we investigate and quantify the reliability of LLM-based agents from the perspective of Byzantine fault tolerance. We observe that LLM-based agents demonstrate stronger skepticism when processing erroneous message flows, a characteristic that enables them to outperform traditional agents across different topological structures. Motivated by the results of the pilot experiment, we design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies. It capitalizes on the intrinsic reflective and discriminative capabilities of LLMs by employing a probe-based, weighted information flow transmission method to improve the reliability of LLM-based agents. Extensive experiments demonstrate that CP-WBFT achieves superior performance across diverse network topologies under extreme Byzantine conditions (85.7 % fault rate). Notably, our approach surpasses traditional methods by attaining remarkable accuracy on various topologies and maintaining strong reliability in both mathematical reasoning and safety assessment tasks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

RFNNS: Robust Fixed Neural Network Steganography with Universal Text-to-Image Models

Yu Cheng
Jiuan Zhou
Jiawei Chen
Zhaoxia Yin
Xinpeng Zhang

With the rapid development of generative AI, image steganography has garnered widespread attention due to its unique concealment. Recent studies have demonstrated the practical advantages of Fixed Neural Network Steganography (FNNS), notably its ability to achieve stable information embedding and extraction without any additional network training. However, the stego images generated by FNNS still exhibit noticeable distortion and limited robustness. These drawbacks compromise the security of the embedded information and restrict the practical applicability of the method. To address these limitations, we propose Robust Fixed Neural Network Steganography (RFNNS). Specifically, a texture-aware localization technique selectively embeds perturbations carrying secret information into regions of complex textures, effectively preserving visual quality. Additionally, a robust steganographic perturbation generation (RSPG) strategy is designed to enhance the decoding accuracy, even under common and unknown attacks. These robust perturbations are combined with AI-generated cover images to produce stego images. Experimental results demonstrate that RFNNS significantly improves robustness compared to state-of-the-art FNNS methods, achieving an average increase in SSIM of 23% for recovered secret images under common attacks. Furthermore, the LPIPS value of recovered secrets images against previously unknown attacks achieved by RFNNS was reduced to 39% of the SOTA method, underscoring its practical value for covert communication.

PDF Details DOI

IROS Conference 2025 Conference Paper

A Mole-inspired Incisor-Burrowing Robotic Platform for Planetary Exploration

Ran Xu
Jiabin Liu
Zhaofeng Liang
Hongmin Zheng
Kunquan Zheng
Zibiao Chen
Jiawei Chen
Tao Zhang 0064

Planetary exploration requires efficient methods for subsurface sampling, especially in extreme energy limitations. Traditional drilling methods are often energy intensive and require large platforms, limiting their applicability. Bio-inspired burrowing techniques, inspired by animals like moles, offer lightweight, low-power alternatives suitable for small robotic platforms. This paper presents a novel bio-inspired robotic platform, the Mole-like Incisor-Burrowing Robotic Platform (MIRP), designed to mimic the incisor-burrowing behavior of naked mole rats. The MIRP features an 11 DOFs mechanism with a compact design (220 mm × 140 mm × 80 mm) and uses servomotors to achieve low energy consumption. The robot combines a qu0adrupedal locomotion mechanism with an incisor-burrowing mechanism, allowing it to navigate granular terrains and perform excavation tasks. Kinematic analysis, including inverse kinematics and close-chain analysis, was conducted to optimize the robot’s motion strategy. A prototype was developed and tested in a simulated lunar regolith environment to test its maneuverability and burrowing performance. The power consumption of the prototype is below 10 W. This work validates the feasibility of bio-inspired incisor-burrowing for planetary exploration, offering a cost-effective and efficient solution for future extraterrestrial missions.

Details

AAAI Conference 2025 Conference Paper

Advancing Loss Functions in Recommender Systems: A Comparative Study with a Rényi Divergence-Based Solution

Shengjia Zhang
Jiawei Chen
Changdong Li
Sheng Zhou
Qihao Shi
Yan Feng
Chun Chen
Can Wang

Loss functions play a pivotal role in optimizing recommendation models. Among various loss functions, Softmax Loss (SL) and Cosine Contrastive Loss (CCL) are particularly effective. Their theoretical connections and differences warrant in-depth exploration. This work conducts comprehensive analyses of these losses, yielding significant insights: 1) Common strengths --- both can be viewed as augmentations of traditional losses with Distributional Robust Optimization (DRO), enhancing robustness to distributional shifts; 2) Respective limitations --- stemming from their use of different distribution distance metrics in DRO optimization, SL exhibits high sensitivity to false negative instances, whereas CCL suffers from low data utilization. To address these limitations, this work proposes a new loss function, DrRL, which generalizes SL and CCL by leveraging Rényi-divergence in DRO optimization. DrRL incorporates the advantageous structures of both SL and CCL, and can be demonstrated to effectively mitigate their limitations. Extensive experiments have been conducted to validate the superiority of DrRL on both recommendation accuracy and robustness.

PDF Details DOI

AAAI Conference 2025 Conference Paper

BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation

Xiaolu Hou
Mingcheng Li
Dingkang Yang
Jiawei Chen
Ziyun Qian
Xiao Zhao
Yue Jiang
Jinjie Wei

With the widespread use of virtual reality applications, 3D scene generation has become a new challenging research frontier. 3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures. Many current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators. However, the generated scenes occupy large amounts of storage space and often lack effective regularisation methods, leading to geometric distortions. To this end, we propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation, which creates diverse and high-quality 3D scenes from text or image inputs. Specifically, a crossmodal progressive scene generation framework is proposed to generate coherent scenes utilizing incremental point cloud reconstruction and 3D Gaussian splatting. Additionally, we propose a hierarchical depth prior-based regularization mechanism that utilizes multi-level constraints on depth accuracy and smoothness to enhance the realism and continuity of the generated scenes. Ultimately, we propose a structured context-guided compression mechanism that exploits structured hash grids to model the context of unorganized anchor attributes, which significantly eliminates structural redundancy and reduces storage overhead. Comprehensive experiments across multiple scenes demonstrate the significant potential and advantages of our framework compared with several baselines.

PDF Details DOI

TMLR Journal 2025 Journal Article

Conditional Image Synthesis with Diffusion Models: A Survey

Zheyuan Zhan
Defang Chen
Jian-Ping Mei
Zhenghe Zhao
Jiawei Chen
Chun Chen
Siwei Lyu
Can Wang

Conditional image synthesis based on user-specified requirements is a key component in creating complex visual content. In recent years, diffusion-based generative modeling has become a highly effective way for conditional image synthesis, leading to exponential growth in the literature. However, the complexity of diffusion-based modeling, the wide range of image synthesis tasks, and the diversity of conditioning mechanisms present significant challenges for researchers to keep up with rapid developments and to understand the core concepts on this topic. In this survey, we categorize existing works based on how conditions are integrated into the two fundamental components of diffusion-based modeling, i.e., the denoising network and the sampling process. We specifically highlight the underlying principles, advantages, and potential challenges of various conditioning approaches during the training, re-purposing, and specialization stages to construct a desired denoising network. We also summarize six mainstream conditioning mechanisms in the sampling process. All discussions are centered around popular applications. Finally, we pinpoint several critical yet still unsolved problems and suggest some possible solutions for future research.

PDF Details

AAAI Conference 2025 Conference Paper

Debiased Multimodal Understanding for Human Language Sequences

Zhi Xu
Dingkang Yang
Mingcheng Li
Yuzheng Wang
Zhaoyu Chen
Jiawei Chen
Jinjie Wei
Lihua Zhang

Human multimodal language understanding (MLU) is an indispensable component of expression analysis (e.g., sentiment or humor) from heterogeneous modalities, including visual postures, linguistic contents, and acoustic behaviours. Existing works invariably focus on designing sophisticated structures or fusion strategies to achieve impressive improvements. Unfortunately, they all suffer from the subject variation problem due to data distribution discrepancies among subjects. Concretely, MLU models are easily misled by distinct subjects with different expression customs and characteristics in the training data to learn subject-specific spurious correlations, limiting performance and generalizability across new subjects. Motivated by this observation, we introduce a recapitulative causal graph to formulate the MLU procedure and analyze the confounding effect of subjects. Then, we propose SuCI, a simple yet effective causal intervention module to disentangle the impact of subjects acting as unobserved confounders and achieve model training via true causal effects. As a plug-and-play component, SuCI can be widely applied to most methods that seek unbiased predictions. Comprehensive experiments on several MLU benchmarks clearly show the effectiveness of the proposed module.

PDF Details DOI

ICML Conference 2025 Conference Paper

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

Dongya Jia
Zhuo Chen 0006
Jiawei Chen
Chenpeng Du
Jian Wu
Jian Cong
Xiaobin Zhuang
Chumin Li 0002

Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings, and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.

Details

NeurIPS Conference 2025 Conference Paper

Learning Counterfactual Outcomes Under Rank Preservation

Peng Wu
Haoxuan Li
Chunyuan Zheng
Yan Zeng
Jiawei Chen
Yang Liu
Ruocheng Guo
Kun Zhang

Counterfactual inference aims to estimate the counterfactual outcome at the individual level given knowledge of an observed treatment and the factual outcome, with broad applications in fields such as epidemiology, econometrics, and management science. Previous methods rely on a known structural causal model (SCM) or assume the homogeneity of the exogenous variable and strict monotonicity between the outcome and exogenous variable. In this paper, we propose a principled approach for identifying and estimating the counterfactual outcome. We first introduce a simple and intuitive rank preservation assumption to identify the counterfactual outcome without relying on a known structural causal model. Building on this, we propose a novel ideal loss for theoretically unbiased learning of the counterfactual outcome and further develop a kernel-based estimator for its empirical estimation. Our theoretical analysis shows that the rank preservation assumption is not stronger than the homogeneity and strict monotonicity assumptions, and shows that the proposed ideal loss is convex, and the proposed estimator is unbiased. Extensive semi-synthetic and real-world experiments are conducted to demonstrate the effectiveness of the proposed method.

PDF Details

NeurIPS Conference 2025 Conference Paper

Making Classic GNNs Strong Baselines Across Varying Homophily: A Smoothness–Generalization Perspective

Ming Gu
Zhuonan Zheng
Sheng Zhou
Meihan Liu
Jiawei Chen
Qiaoyu Tan
Liangcheng Li
Jiajun Bu

Graph Neural Networks (GNNs) have achieved great success but are often considered to be challenged by varying levels of homophily in graphs. Recent empirical studies have surprisingly shown that homophilic GNNs can perform well across datasets of different homophily levels with proper hyperparameter tuning, but the underlying theory and effective architectures remain unclear. To advance GNN universality across varying homophily, we theoretically revisit GNN message passing and uncover a novel \textit{smoothness-generalization dilemma}, where increasing hops inevitably enhances smoothness at the cost of generalization. This dilemma hinders learning in high-order homophilic neighborhoods and all heterophilic ones, where generalization is critical due to complex neighborhood class distributions that are sensitive to shifts induced by noise or sparsity. To address this, we introduce the Inceptive Graph Neural Network (IGNN) built on three simple yet effective design principles, which alleviate the dilemma by enabling distinct hop-wise generalization alongside improved overall generalization with adaptive smoothness. Benchmarking against 30 baselines demonstrates IGNN's superiority and reveals notable universality in certain homophilic GNN variants. Our code and datasets are available at \href{https: //github. com/galogm/IGNN}{https: //github. com/galogm/IGNN}.

PDF Details

ICML Conference 2025 Conference Paper

Sounding that Object: Interactive Object-Aware Image to Audio Generation

Tingle Li
Baihe Huang
Xiaobin Zhuang
Dongya Jia
Jiawei Chen
Yuping Wang 0005
Zhuo Chen 0006
Gopala Anumanchipalli

Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an interactive object-aware audio generation model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the object level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds.

Details

NeurIPS Conference 2025 Conference Paper

Tree of Preferences for Diversified Recommendation

Hanyang Yuan
Ning Tang
Tongya Zheng
Jiarong Xu
Xintong Hu
Renhong Huang
Shunyu Liu
Jiacong Hu

Diversified recommendation has attracted increasing attention from both researchers and practitioners, which can effectively address the homogeneity of recommended items. Existing approaches predominantly aim to infer the diversity of user preferences from observed user feedback. Nonetheless, due to inherent data biases, the observed data may not fully reflect user interests, where underexplored preferences can be overwhelmed or remain unmanifested. Failing to capture these preferences can lead to suboptimal diversity in recommendations. To fill this gap, this work aims to study diversified recommendation from a data-bias perspective. Inspired by the outstanding performance of large language models (LLMs) in zero-shot inference leveraging world knowledge, we propose a novel approach that utilizes LLMs' expertise to uncover underexplored user preferences from observed behavior, ultimately providing diverse and relevant recommendations. To achieve this, we first introduce Tree of Preferences (ToP), an innovative structure constructed to model user preferences from coarse to fine. ToP enables LLMs to systematically reason over the user's rationale behind their behavior, thereby uncovering their underexplored preferences. To guide diversified recommendations using uncovered preferences, we adopt a data-centric approach, identifying candidate items that match user preferences and generating synthetic interactions that reflect underexplored preferences. These interactions are integrated to train a general recommender for diversification. Moreover, we scale up overall efficiency by dynamically selecting influential users during optimization. Extensive evaluations of both diversity and relevance show that our approach outperforms existing methods in most cases and achieves near-optimal performance in others, with reasonable inference latency.

PDF Details

NeurIPS Conference 2025 Conference Paper

Understanding and Enhancing Message Passing on Heterophilic Graphs via Compatibility Matrix

Zhuonan Zheng
Yuanchen Bei
Zhiyao Zhou
Sheng Zhou
Yao Ma
Ming Gu
HONGJIA XU
Jiawei Chen

Graph Neural Networks (GNNs) excel in graph mining tasks thanks to their message-passing mechanism, which aligns with the homophily assumption. However, connected nodes can also exhibit inconsistent behaviors, termed heterophilic patterns, sparking interest in heterophilic GNNs (HTGNNs). Although the message-passing mechanism seems unsuitable for heterophilic graphs owing to the propagation of dissimilar messages, it is still popular in HTGNNs and consistently achieves notable success. Some efforts have investigated such an interesting phenomenon, but are limited in the data perspective. The model-perspective understanding remains largely unexplored, which is conducive to guiding the designs of HTGNNs. To fill this gap, we build the connection between node discriminability and the compatibility matrix (CM). We reveal that the effectiveness of the message passing in HTGNNs may be credited to increasing the proposed Compatibility Matrix Discriminability (CMD). However, the issues of sparsity and noise pose great challenges to leveraging CM. Thus, we propose CMGNN, a novel approach to alleviate these issues while enhancing the CM and node embeddings explicitly. A thorough evaluation involving 13 datasets and comparison against 20 well-established baselines highlights the superiority of CMGNN.

PDF Details

NeurIPS Conference 2024 Conference Paper

Addressing Spatial-Temporal Heterogeneity: General Mixed Time Series Analysis via Latent Continuity Recovery and Alignment

Jiawei Chen
Chunhui Zhao

Mixed time series (MiTS) comprising both continuous variables (CVs) and discrete variables (DVs) are frequently encountered yet under-explored in time series analysis. Essentially, CVs and DVs exhibit different temporal patterns and distribution types. Overlooking these heterogeneities would lead to insufficient and imbalanced representation learning, bringing biased results. This paper addresses the problem with two insights: 1) DVs may originate from intrinsic latent continuous variables (LCVs), which lose fine-grained information due to extrinsic discretization; 2) LCVs and CVs share similar temporal patterns and interact spatially. Considering these similarities and interactions, we propose a general MiTS analysis framework MiTSformer, which recovers LCVs behind DVs for sufficient and balanced spatial-temporal modeling by designing two essential inductive biases: 1) hierarchically aggregating multi-scale temporal context information to enrich the information granularity of DVs; 2) adaptively learning the aggregation processes via the adversarial guidance from CVs. Subsequently, MiTSformer captures complete spatial-temporal dependencies within and across LCVs and CVs via cascaded self- and cross-attention blocks. Empirically, MiTSformer achieves consistent SOTA on five mixed time series analysis tasks, including classification, extrinsic regression, anomaly detection, imputation, and long-term forecasting. The code is available at https: //github. com/chunhuiz/MiTSformer.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Benchmarking Large Language Models in Retrieval-Augmented Generation

Jiawei Chen
Hongyu Lin
Xianpei Han
Le Sun

Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.

PDF Details DOI

ICRA Conference 2024 Conference Paper

Dynamic Interaction Control in Legged Mobile Manipulators: A Decoupled Approach

Qikai Li
Qinchen Meng
Yuxing Qin
Jiawei Chen
Xilun Ding
Kun Xu 0007

Legged mobile manipulators are receiving much more attention. Mobile platforms can infinitely expand the workspace of robotic arms, providing more possibilities for robot application scenarios. Compared with wheeled mobile manipulators, legged mobile manipulators have higher requirements for cooperative control of legged robots and robotic arms. This work decouples the control of the robotic arm and the legged robot. On the legged robot side, we explicitly estimate the wrench exerted by the robotic arm on the base and bring it into the legged robot’s dynamics, and then use a nonlinear model predictive controller (NMPC) to control the legged robot. On the robotics arm side, we adopt an impedance controller to realize the end-effector’s force control, and the introduction of impedance control has improved the safety and interactivity of legged mobile manipulators. We conducted experiments on physical robot to compare the differences between decoupled control and independent control, and the results show that the stability and robustness of robot systems have improved using decoupled control.

Details

EAAI Journal 2024 Journal Article

Generating and encouraging: An effective framework for solving class imbalance in multimodal emotion recognition conversation

Qianer Li
Peijie Huang
Yuhong Xu
Jiawei Chen
Yuyang Deng
Shangjian Yin

Details DOI

NeurIPS Conference 2024 Conference Paper

PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

Dingkang Yang
Jinjie Wei
Dongling Xiao
Shunli Wang
Tong Wu
Gang Li
Mingcheng Li
Shuaibing Wang

Developing intelligent pediatric consultation systems offers promising prospects for improving diagnostic efficiency, especially in China, where healthcare resources are scarce. Despite recent advances in Large Language Models (LLMs) for Chinese medicine, their performance is sub-optimal in pediatric applications due to inadequate instruction data and vulnerable training procedures. To address the above issues, this paper builds PedCorpus, a high-quality dataset of over 300, 000 multi-task instructions from pediatric textbooks, guidelines, and knowledge graph resources to fulfil diverse diagnostic demands. Upon well-designed PedCorpus, we propose PediatricsGPT, the first Chinese pediatric LLM assistant built on a systematic and robust training pipeline. In the continuous pre-training phase, we introduce a hybrid instruction pre-training mechanism to mitigate the internal-injected knowledge inconsistency of LLMs for medical domain adaptation. Immediately, the full-parameter Supervised Fine-Tuning (SFT) is utilized to incorporate the general medical knowledge schema into the models. After that, we devise a direct following preference optimization to enhance the generation of pediatrician-like humanistic responses. In the parameter-efficient secondary SFT phase, a mixture of universal-specific experts strategy is presented to resolve the competency conflict between medical generalist and pediatric expertise mastery. Extensive results based on the metrics, GPT-4, and doctor evaluations on distinct downstream tasks show that PediatricsGPT consistently outperforms previous Chinese medical LLMs. The project and data will be released at https: //github. com/ydk122024/PediatricsGPT.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

PSL: Rethinking and Improving Softmax Loss from Pairwise Perspective for Recommendation

Weiqin Yang
Jiawei Chen
Xin Xin
Sheng Zhou
Binbin Hu
Yan Feng
Chun Chen
Can Wang

Softmax Loss (SL) is widely applied in recommender systems (RS) and has demonstrated effectiveness. This work analyzes SL from a pairwise perspective, revealing two significant limitations: 1) the relationship between SL and conventional ranking metrics like DCG is not sufficiently tight; 2) SL is highly sensitive to false negative instances. Our analysis indicates that these limitations are primarily due to the use of the exponential function. To address these issues, this work extends SL to a new family of loss functions, termed Pairwise Softmax Loss (PSL), which replaces the exponential function in SL with other appropriate activation functions. While the revision is minimal, we highlight three merits of PSL: 1) it serves as a tighter surrogate for DCG with suitable activation functions; 2) it better balances data contributions; and 3) it acts as a specific BPR loss enhanced by Distributionally Robust Optimization (DRO). We further validate the effectiveness and robustness of PSL through empirical experiments. The code is available at https: //github. com/Tiny-Snow/IR-Benchmark.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Self-Retrieval: End-to-End Information Retrieval with One Large Language Model

Qiaoyu Tang
Jiawei Chen
Zhuoqun Li
Bowen Yu
Yaojie Lu
Cheng Fu
Haiyang Yu
Hongyu Lin

The rise of large language models (LLMs) has significantly transformed both the construction and application of information retrieval (IR) systems. However, current interactions between IR systems and LLMs remain limited, with LLMs merely serving as part of components within IR systems, and IR systems being constructed independently of LLMs. This separated architecture restricts knowledge sharing and deep collaboration between them. In this paper, we introduce Self-Retrieval, a novel end-to-end LLM-driven information retrieval architecture. Self-Retrieval unifies all essential IR functions within a single LLM, leveraging the inherent capabilities of LLMs throughout the IR process. Specifically, Self-Retrieval internalizes the retrieval corpus through self-supervised learning, transforms the retrieval process into sequential passage generation, and performs relevance assessment for reranking. Experimental results demonstrate that Self-Retrieval not only outperforms existing retrieval approaches by a significant margin, but also substantially enhances the performance of LLM-driven downstream applications like retrieval-augmented generation.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

Mingcheng Li
Dingkang Yang
Yang Liu
Shunli Wang
Jiawei Chen
Shuaibing Wang
Jinjie Wei
Yue Jiang

Multimodal Sentiment Analysis (MSA) is an important research area that aims to understand and recognize human sentiment through multiple modalities. The complementary information provided by multimodal fusion promotes better sentiment analysis compared to utilizing only a single modality. Nevertheless, in real-world applications, many unavoidable factors may lead to situations of uncertain modality missing, thus hindering the effectiveness of multimodal modeling and degrading the model’s performance. To this end, we propose a Hierarchical Representation Learning Framework (HRLF) for the MSA task under uncertain missing modalities. Specifically, we propose a fine-grained representation factorization module that sufficiently extracts valuable sentiment information by factorizing modality into sentiment-relevant and modality-specific representations through crossmodal translation and sentiment semantic reconstruction. Moreover, a hierarchical mutual information maximization mechanism is introduced to incrementally maximize the mutual information between multi-scale representations to align and reconstruct the high-level semantics in the representations. Ultimately, we propose a hierarchical adversarial learning mechanism that further aligns and adapts the latent distribution of sentiment-relevant representations to produce robust joint multimodal representations. Comprehensive experiments on three datasets demonstrate that HRLF significantly improves MSA performance under uncertain modality missing cases.

PDF Details DOI

ICRA Conference 2024 Conference Paper

Unlocking Versatile Locomotion: A Novel Quadrupedal Robot with 4-DoFs Legs for Roller Skating

Jiawei Chen
Ripeng Qin
Longfei Huang
Zongbo He
Kun Xu 0007
Xilun Ding

Roller skating with passive wheels on a quadrupedal robot is more efficient than traditional walking. However, the typical mammalian quadruped robot with 3-DoFs legs can only perform one dynamic roller skating gait and has difficulty achieving turning motion. To address this limitation, we designed a novel quadrupedal robot with each leg having 4-DoFs to enable various roller skating locomotion including Swizzling, Stroking, and trot-like gaits while easily achieving turning motions. We considered the geometrical characteristics of the passive wheel and used the Levenberg-Marquardt method in robot kinematics to improve precision for both roller skating kinematics and contact point position for the dynamics controller. The position of the robot foot and the yaw angle of the passive wheel are decoupled for motion planning of all proposed gaits. Our proposed kinematics with wheeled geometry was verified through experiments to have higher precision, while the feasibility of all proposed roller-skating gaits was confirmed during straight motion and turning motion with a small radius on our prototype robot. Finally, we discussed the mobility efficiency of different roller skating gaits which were found to be more efficient than walking.

Details

IJCAI Conference 2023 Conference Paper

Discriminative-Invariant Representation Learning for Unbiased Recommendation

Hang Pan
Jiawei Chen
Fuli Feng
Wentao Shi
Junkang Wu
Xiangnan He

Selection bias hinders recommendation models from learning unbiased user preference. Recent works empirically reveal that pursuing invariant user and item representation across biased and unbiased data is crucial for counteracting selection bias. However, our theoretical analysis reveals that simply optimizing representation invariance is insufficient for addressing the selection bias — recommendation performance is bounded by both representation invariance and discriminability. Worse still, current invariant representation learning methods in recommendation neglect even hurt the representation discriminability due to data sparsity and label shift. In this light, we propose a new Discriminative-Invariant Representation Learning framework for unbiased recommendation, which incorporates label-conditional clustering and prior-guided contrasting into conventional invariant representation learning to mitigate the impact of data sparsity and label shift, respectively. We conduct extensive experiments on three real-world datasets, validating the rationality and effectiveness of the proposed framework. Code and supplementary materials are available at: https: //github. com/HungPaan/DIRL.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

OpenGSL: A Comprehensive Benchmark for Graph Structure Learning

Zhiyao Zhou
Sheng Zhou
Bochao Mao
Xuanyi Zhou
Jiawei Chen
Qiaoyu Tan
Daochen Zha
Yan Feng

Graph Neural Networks (GNNs) have emerged as the de facto standard for representation learning on graphs, owing to their ability to effectively integrate graph topology and node attributes. However, the inherent suboptimal nature of node connections, resulting from the complex and contingent formation process of graphs, presents significant challenges in modeling them effectively. To tackle this issue, Graph Structure Learning (GSL), a family of data-centric learning approaches, has garnered substantial attention in recent years. The core concept behind GSL is to jointly optimize the graph structure and the corresponding GNN models. Despite the proposal of numerous GSL methods, the progress in this field remains unclear due to inconsistent experimental protocols, including variations in datasets, data processing techniques, and splitting strategies. In this paper, we introduce OpenGSL, the first comprehensive benchmark for GSL, aimed at addressing this gap. OpenGSL enables a fair comparison among state-of-the-art GSL methods by evaluating them across various popular datasets using uniform data processing and splitting strategies. Through extensive experiments, we observe that existing GSL methods do not consistently outperform vanilla GNN counterparts. We also find that there is no significant correlation between the homophily of the learned structure and task performance, challenging the common belief. Moreover, we observe that the learned graph structure demonstrates a strong generalization ability across different GNN models, despite the high computational and space consumption. We hope that our open-sourced library will facilitate rapid and equitable evaluation and inspire further innovative research in this field. The code of the benchmark can be found in https: //github. com/OpenGSL/OpenGSL.

PDF Details

AAAI Conference 2023 Conference Paper

Robust Sequence Networked Submodular Maximization

Qihao Shi
Bingyang Fu
Can Wang
Jiawei Chen
Sheng Zhou
Yan Feng
Chun Chen

In this paper, we study the Robust optimization for sequence Networked submodular maximization (RoseNets) problem. We interweave the robust optimization with the sequence networked submodular maximization. The elements are connected by a directed acyclic graph and the objective function is not submodular on the elements but on the edges in the graph. Under such networked submodular scenario, the impact of removing an element from a sequence depends both on its position in the sequence and in the network. This makes the existing robust algorithms inapplicable and calls for new robust algorithms. In this paper, we take the first step to study the RoseNets problem. We design a robust greedy algorithms, which is robust against the removal of an arbitrary subset of the selected elements. The approximation ratio of the algorithm depends both on the number of the removed elements and the network topology. We further conduct experiments on real applications of recommendation and link prediction. The experimental results demonstrate the effectiveness of the proposed algorithm.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Understanding Contrastive Learning via Distributionally Robust Optimization

Junkang Wu
Jiawei Chen
Jiancan Wu
Wentao Shi
Xiang Wang
Xiangnan He

This study reveals the inherent tolerance of contrastive learning (CL) towards sampling bias, wherein negative samples may encompass similar semantics (\eg labels). However, existing theories fall short in providing explanations for this phenomenon. We bridge this research gap by analyzing CL through the lens of distributionally robust optimization (DRO), yielding several key insights: (1) CL essentially conducts DRO over the negative sampling distribution, thus enabling robust performance across a variety of potential distributions and demonstrating robustness to sampling bias; (2) The design of the temperature $\tau$ is not merely heuristic but acts as a Lagrange Coefficient, regulating the size of the potential distribution set; (3) A theoretical connection is established between DRO and mutual information, thus presenting fresh evidence for ``InfoNCE as an estimate of MI'' and a new estimation approach for $\phi$-divergence-based generalized mutual information. We also identify CL's potential shortcomings, including over-conservatism and sensitivity to outliers, and introduce a novel Adjusted InfoNCE loss (ADNCE) to mitigate these issues. It refines potential distribution, improving performance and accelerating convergence. Extensive experiments on various domains (image, sentence, and graph) validate the effectiveness of the proposal.

PDF Details

NeurIPS Conference 2022 Conference Paper

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

Yichong Leng
Zehua Chen
Junliang Guo
Haohe Liu
Jiawei Chen
Xu Tan
Danilo Mandic
Lei He

Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtration, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we formulate the synthesis process from a different perspective by decomposing the binaural audio into a common part that shared by the left and right channels as well as a specific part that differs in each channel. Accordingly, we propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Specifically, in the first stage, the common information of the binaural audio is generated with a single-channel diffusion model conditioned on the mono audio, based on which the binaural audio is generated by a two-channel diffusion model in the second stage. Combining this novel perspective of two-stage synthesis with advanced generative models (i. e. , the diffusion models), the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples. Experiment results show that on a benchmark dataset, BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics (Wave L2: $0. 128$ vs. $0. 157$, MOS: $3. 80$ vs. $3. 61$). The generated audio samples\footnote{\url{https: //speechresearch. github. io/binauralgrad}} and code\footnote{\url{https: //github. com/microsoft/NeuralSpeech/tree/master/BinauralGrad}} are available online.

PDF Details

JBHI Journal 2022 Journal Article

Mix-and-Interpolate: A Training Strategy to Deal With Source-Biased Medical Data

Yuexiang Li
Jiawei Chen
Dong Wei
Yanchun Zhu
Jianrong Wu
Junfeng Xiong
Yadong Gang
Wenbo Sun

Till March 31st, 2021, the coronavirus disease 2019 (COVID-19) had reportedly infected more than 127 million people and caused over 2. 5 million deaths worldwide. Timely diagnosis of COVID-19 is crucial for management of individual patients as well as containment of the highly contagious disease. Having realized the clinical value of non-contrast chest computed tomography (CT) for diagnosis of COVID-19, deep learning (DL) based automated methods have been proposed to aid the radiologists in reading the huge quantities of CT exams as a result of the pandemic. In this work, we address an overlooked problem for training deep convolutional neural networks for COVID-19 classification using real-world multi-source data, namely, the data source bias problem. The data source bias problem refers to the situation in which certain sources of data comprise only a single class of data, and training with such source-biased data may make the DL models learn to distinguish data sources instead of COVID-19. To overcome this problem, we propose MIx-aNd-Interpolate (MINI), a conceptually simple, easy-to-implement, efficient yet effective training strategy. The proposed MINI approach generates volumes of the absent class by combining the samples collected from different hospitals, which enlarges the sample space of the original source-biased dataset. Experimental results on a large collection of real patient data (1, 221 COVID-19 and 1, 520 negative CT images, and the latter consisting of 786 community acquired pneumonia and 734 non-pneumonia) from eight hospitals and health institutions show that: 1) MINI can improve COVID-19 classification performance upon the baseline (which does not deal with the source bias), and 2) MINI is superior to competing methods in terms of the extent of improvement.

Details DOI

TCS Journal 2021 Journal Article

Profit maximization for competitive social advertising

Qihao Shi
Can Wang
Deshi Ye
Jiawei Chen
Sheng Zhou
Yan Feng
Chun Chen
Yanhao Huang

Details DOI

NeurIPS Conference 2021 Conference Paper

Speech-T: Transducer for Text to Speech and Beyond

Jiawei Chen
Xu Tan
Yichong Leng
Jin Xu
Guihua Wen
Tao Qin
Tie-Yan Liu

Neural Transducer (e. g. , RNN-T) has been widely used in automatic speech recognition (ASR) due to its capabilities of efficiently modeling monotonic alignments between input and output sequences and naturally supporting streaming inputs. Considering that monotonic alignments are also critical to text to speech (TTS) synthesis and streaming TTS is also an important application scenario, in this work, we explore the possibility of applying Transducer to TTS and more. However, it is challenging because it is difficult to trade off the emission (continuous mel-spectrogram prediction) probability and transition (ASR Transducer predicts blank token to indicate transition to next input) probability when calculating the output probability lattice in Transducer, and it is not easy to learn the alignments between text and speech through the output probability lattice. We propose SpeechTransducer (Speech-T for short), a Transformer based Transducer model that 1) uses a new forward algorithm to separate the transition prediction from the continuous mel-spectrogram prediction when calculating the output probability lattice, and uses a diagonal constraint in the probability lattice to help the alignment learning; 2) supports both full-sentence or streaming TTS by adjusting the look-ahead context; and 3) further supports both TTS and ASR together for the first time, which enjoys several advantages including fewer parameters as well as streaming synthesis and recognition in a single model. Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the benefit of joint modeling TTS and ASR in a single network.

PDF Details

AAAI Conference 2021 Conference Paper

Time Series Domain Adaptation via Sparse Associative Structure Alignment

Ruichu Cai
Jiawei Chen
Zijian Li
Wei Chen
Keli Zhang
Junjian Ye
Zhuozhang Li
Xiaoyan Yang

Domain adaptation on time series data is an important but challenging task. Most of the existing works in this area are based on the learning of the domain-invariant representation of the data with the help of restrictions like MMD. However, such extraction of the domain-invariant representation is a non-trivial task for time series data, due to the complex dependence among the timestamps. In detail, in the fully dependent time series, a small change of the time lags or the offsets may lead to difficulty in the domain invariant extraction. Fortunately, the stability of the causality inspired us to explore the domain invariant structure of the data. To reduce the difficulty in the discovery of causal structure, we relax it to the sparse associative structure and propose a novel sparse associative structure alignment model for domain adaptation. First, we generate the segment set to exclude the obstacle of offsets. Second, the intra-variables and inter-variables sparse attention mechanisms are devised to extract associative structure time-series data with considering time lags. Finally, the associative structure alignment is used to guide the transfer of knowledge from the source domain to the target one. Experimental studies not only verify the good performance of our methods on three real-world datasets but also provide some insightful discoveries on the transferred knowledge.

PDF Details

AAAI Conference 2020 Conference Paper

DGE: Deep Generative Network Embedding Based on Commonality and Individuality

Sheng Zhou
Xin Wang
Jiajun Bu
Martin Ester
Pinggang Yu
Jiawei Chen
Qihao Shi
Can Wang

Network embedding plays a crucial role in network analysis to provide effective representations for a variety of learning tasks. Existing attributed network embedding methods mainly focus on preserving the observed node attributes and network topology in the latent embedding space, with the assumption that nodes connected through edges will share similar attributes. However, our empirical analysis of real-world datasets shows that there exist both commonality and individuality between node attributes and network topology. On the one hand, similar nodes are expected to share similar attributes and have edges connecting them (commonality). On the other hand, each information source may maintain individual differences as well (individuality). Simultaneously capturing commonality and individuality is very challenging due to their exclusive nature and existing work fail to do so. In this paper, we propose a deep generative embedding (DGE) framework which simultaneously captures commonality and individuality between network topology and node attributes in a generative process. Stochastic gradient variational Bayesian (SGVB) optimization is employed to infer model parameters as well as the node embeddings. Extensive experiments on four real-world datasets show the superiority of our proposed DGE framework in various tasks including node classiﬁcation and link prediction.

PDF Details

JBHI Journal 2020 Journal Article

Efficient and Effective Training of COVID-19 Classification Networks With Self-Supervised Dual-Track Learning to Rank

Yuexiang Li
Dong Wei
Jiawei Chen
Shilei Cao
Hongyu Zhou
Yanchun Zhu
Jianrong Wu
Lan Lan

Coronavirus Disease 2019 (COVID-19) has rapidly spread worldwide since first reported. Timely diagnosis of COVID-19 is crucial both for disease control and patient care. Non-contrast thoracic computed tomography (CT) has been identified as an effective tool for the diagnosis, yet the disease outbreak has placed tremendous pressure on radiologists for reading the exams and may potentially lead to fatigue-related mis-diagnosis. Reliable automatic classification algorithms can be really helpful; however, they usually require a considerable number of COVID-19 cases for training, which is difficult to acquire in a timely manner. Meanwhile, how to effectively utilize the existing archive of non-COVID-19 data (the negative samples) in the presence of severe class imbalance is another challenge. In addition, the sudden disease outbreak necessitates fast algorithm development. In this work, we propose a novel approach for effective and efficient training of COVID-19 classification networks using a small number of COVID-19 CT exams and an archive of negative samples. Concretely, a novel self-supervised learning method is proposed to extract features from the COVID-19 and negative samples. Then, two kinds of soft-labels (‘difficulty’ and ‘diversity’) are generated for the negative samples by computing the earth mover's distances between the features of the negative and COVID-19 samples, from which data ‘values’ of the negative samples can be assessed. A pre-set number of negative samples are selected accordingly and fed to the neural network for training. Experimental results show that our approach can achieve superior performance using about half of the negative samples, substantially reducing model training time.

Details DOI

AAAI Conference 2020 Conference Paper

Fast Adaptively Weighted Matrix Factorization for Recommendation with Implicit Feedback

Jiawei Chen
Can Wang
Sheng Zhou
Qihao Shi
Jingbang Chen
Yan Feng
Chun Chen

Recommendation from implicit feedback is a highly challenging task due to the lack of the reliable observed negative data. A popular and effective approach for implicit recommendation is to treat unobserved data as negative but downweight their conﬁdence. Naturally, how to assign conﬁdence weights and how to handle the large number of the unobserved data are two key problems for implicit recommendation models. However, existing methods either pursuit fast learning by manually assigning simple conﬁdence weights, which lacks ﬂexibility and may create empirical bias in evaluating user’s preference; or adaptively infer personalized con- ﬁdence weights but suffer from low efﬁciency. To achieve both adaptive weights assignment and efﬁcient model learning, we propose a fast adaptively weighted matrix factorization (FAWMF) based on variational auto-encoder. The personalized data conﬁdence weights are adaptively assigned with a parameterized neural network (function) and the network can be inferred from the data. Further, to support fast and stable learning of FAWMF, a new speciﬁc batchbased learning algorithm fBGD has been developed, which trains on all feedback data but its complexity is linear to the number of observed data. Extensive experiments on realworld datasets demonstrate the superiority of the proposed FAWMF and its learning algorithm fBGD.

PDF Details

AAAI Conference 2020 Conference Paper

Generative Adversarial Networks for Video-to-Video Domain Adaptation

Jiawei Chen
Yuexiang Li
Kai Ma
Yefeng Zheng

Endoscopic videos from multicentres often have different imaging conditions, e. g. , color and illumination, which make the models trained on one domain usually fail to generalize well to another. Domain adaptation is one of the potential solutions to address the problem. However, few of existing works focused on the translation of video-based data. In this work, we propose a novel generative adversarial network (GAN), namely VideoGAN, to transfer the video-based data across different domains. As the frames of a video may have similar content and imaging conditions, the proposed VideoGAN has an X-shape generator to preserve the intravideo consistency during translation. Furthermore, a loss function, namely color histogram loss, is proposed to tune the color distribution of each translated frame. Two colonoscopic datasets from different centres, i. e. , CVC-Clinic and ETIS- Larib, are adopted to evaluate the performance of domain adaptation of our VideoGAN. Experimental results demonstrate that the adapted colonoscopic video generated by our VideoGAN can signiﬁcantly boost the segmentation accuracy, i. e. , an improvement of 5%, of colorectal polyps on multicentre datasets. As our VideoGAN is a general network architecture, we also evaluate its performance with the CamVid driving video dataset on the cloudy-to-sunny translation task. Comprehensive experiments show that the domain gap could be substantially narrowed down by our VideoGAN.

PDF Details