Author name cluster

Lihua Zhang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers

2 author rows

TAAS Journal 2026 Journal Article

Auto-Follower: A Person-Following System for Urban Ackermann Human–Machine Collaborative Robotics

Zhijian Li
Dongliang Kou
Yizhao Wang
Wei Li
Zhiyan Dong
Lihua Zhang

Industry 5.0 is emerging as the next phase of industrial evolution, emphasizing human-centric manufacturing through close human–robot collaboration and the deployment of intelligent autonomous systems. As a representative example of such autonomy, person-following robots are typically implemented on differential-drive or omnidirectional mobile bases. However, certain tasks require Ackermann-steered robots, which face unique challenges due to limited maneuverability and the complexity of urban environments, often leading to target loss or navigation into non-drivable areas. To address these issues, we propose Auto-Follower, a person-following framework with enhanced perception and navigation capabilities. Auto-Follower integrates a vision–LiDAR servo tracker that fuses camera images with LiDAR points from a motorized rotating sensor, enabling 360° target perception. Instead of relying on a global map, the system employs real-time LiDAR-based local mapping for efficient path planning. In addition, an Iterative Radius Points Search (IRPS) method is developed to identify obstacle-free navigation goals when the target enters non-drivable regions, ensuring safe and continuous following. The framework has been validated extensively in both laboratory and urban environments and demonstrates robust, reliable performance, with strong potential for adaptation to diverse real-world person-following applications.

Details DOI

AAAI Conference 2026 Conference Paper

SatireDecoder: Visual Cascaded Decoupling for Enhancing Satirical Image Comprehension

Yue Jiang
Haiwei Xue
Minghao Han
Mingcheng Li
Xiaolu Hou
Dingkang Yang
Lihua Zhang
Xu Zheng

Satire, a form of artistic expression combining humor with implicit critique, holds significant social value by illuminating societal issues. Despite its cultural and societal significance, satire comprehension, particularly in purely visual forms, remains a challenging task for current vision-language models. This task requires not only detecting satire but also deciphering its nuanced meaning and identifying the implicated entities. Existing models often fail to effectively integrate local entity relationships with global context, leading to misinterpretation, comprehension biases, and hallucinations. To address these limitations, we propose SatireDecoder, a training-free framework designed to enhance satirical image comprehension. Our approach proposes a multi-agent system performing visual cascaded decoupling to decompose images into fine-grained local and global semantic representations. In addition, we introduce a chain-of-thought reasoning strategy guided by uncertainty analysis, which breaks down the complex satire comprehension process into sequential subtasks with minimized uncertainty. Our method significantly improves interpretive accuracy while reducing hallucinations. Experimental results validate that SatireDecoder outperforms existing baselines in comprehending visual satire, offering a promising direction for vision-language reasoning in nuanced, high-level semantic tasks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SpaCRD: Multimodal Deep Fusion of Histology and Spatial Transcriptomics for Cancer Region Detection

Shuailin Xue
Jun Wan
Lihua Zhang
Wenwen Min

Accurate detection of cancer tissue regions (CTR) enables deeper analysis of the tumor microenvironment and offers crucial insights into treatment response. Traditional CTR detection methods, which typically rely on the rich cellular morphology in histology images, are susceptible to a high rate of false positives due to morphological similarities across different tissue regions. The groundbreaking advances in spatial transcriptomics (ST) provide detailed cellular phenotypes and spatial localization information, offering new opportunities for more accurate cancer region detection. However, current methods are unable to effectively integrate histology images with ST data, especially in the context of cross-sample and cross-platform/batch settings for accomplishing the CTR detection. To address this challenge, we propose SpaCRD, a transfer learning-based method that deeply integrates histology images and ST data to enable reliable CTR detection across diverse samples, platforms, and batches. Once trained on source data, SpaCRD can be readily generalized to accurately detect cancerous regions across samples from different platforms and batches. The core of SpaCRD is a category-regularized variational reconstruction-guided bidirectional cross-attention fusion network, which enables the model to adaptively capture latent co-expression patterns between histological features and gene expression from multiple perspectives. Extensive benchmark analysis on 23 matched histology-ST datasets spanning various disease types, platforms, and batches demonstrates that SpaCRD consistently outperforms existing eight state-of-the-art methods in CTR detection.

PDF Details DOI

AAAI Conference 2026 Conference Paper

UniMGS: Unifying Mesh and 3D Gaussian Splatting with Single-Pass Rasterization and Proxy-Based Deformation

Zeyu Xiao
Mingyang Sun
Yimin Cong
Lintao Wang
Dongliang Kou
Zhenyi Wu
Dingkang Yang
Peng Zhai

Joint rendering and deformation of mesh and 3D Gaussian Splatting (3DGS) have significant value as both representations offer complementary advantages for graphics applications. However, due to differences in representation and rendering pipelines, existing studies render meshes and 3DGS separately, making it difficult to accurately handle occlusions and transparency. Moreover, the deformed 3DGS still suffers from visual artifacts due to the sensitivity to the topology quality of the proxy mesh. These issues pose serious obstacles to the joint use of 3DGS and meshes, making it difficult to adapt 3DGS to conventional mesh-oriented graphics pipelines. We propose UniMGS, the first unified framework for rasterizing mesh and 3DGS in a single-pass anti-aliased manner, with a novel binding strategy for 3DGS deformation based on proxy mesh. Our key insight is to blend the colors of both triangle and Gaussian fragments by anti-aliased α-blending in a single pass, achieving visually coherent results with precise handling of occlusion and transparency. To improve the visual appearance of the deformed 3DGS, our Gaussian-centric binding strategy employs a proxy mesh and spatially associates Gaussians with the mesh faces, significantly reducing rendering artifacts. With these two components, UniMGS enables the visualization and manipulation of 3D objects represented by mesh or 3DGS within a unified framework, opening up new possibilities in embodied AI, virtual reality, and gaming. We will release our source code to facilitate future research.

PDF Details DOI

AAAI Conference 2025 Conference Paper

BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation

Xiaolu Hou
Mingcheng Li
Dingkang Yang
Jiawei Chen
Ziyun Qian
Xiao Zhao
Yue Jiang
Jinjie Wei

With the widespread use of virtual reality applications, 3D scene generation has become a new challenging research frontier. 3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures. Many current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators. However, the generated scenes occupy large amounts of storage space and often lack effective regularisation methods, leading to geometric distortions. To this end, we propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation, which creates diverse and high-quality 3D scenes from text or image inputs. Specifically, a crossmodal progressive scene generation framework is proposed to generate coherent scenes utilizing incremental point cloud reconstruction and 3D Gaussian splatting. Additionally, we propose a hierarchical depth prior-based regularization mechanism that utilizes multi-level constraints on depth accuracy and smoothness to enhance the realism and continuity of the generated scenes. Ultimately, we propose a structured context-guided compression mechanism that exploits structured hash grids to model the context of unorganized anchor attributes, which significantly eliminates structural redundancy and reduces storage overhead. Comprehensive experiments across multiple scenes demonstrate the significant potential and advantages of our framework compared with several baselines.

PDF Details DOI

EAAI Journal 2025 Journal Article

Collaborative optimization of vessel scheduling and tugboat allocation in seaports using a multi-action deep reinforcement learning framework

Wenqiang Guo
Xinyu Zhang
Yuankui Li
Lihua Zhang

Details DOI

AAAI Conference 2025 Conference Paper

Debiased Multimodal Understanding for Human Language Sequences

Zhi Xu
Dingkang Yang
Mingcheng Li
Yuzheng Wang
Zhaoyu Chen
Jiawei Chen
Jinjie Wei
Lihua Zhang

Human multimodal language understanding (MLU) is an indispensable component of expression analysis (e.g., sentiment or humor) from heterogeneous modalities, including visual postures, linguistic contents, and acoustic behaviours. Existing works invariably focus on designing sophisticated structures or fusion strategies to achieve impressive improvements. Unfortunately, they all suffer from the subject variation problem due to data distribution discrepancies among subjects. Concretely, MLU models are easily misled by distinct subjects with different expression customs and characteristics in the training data to learn subject-specific spurious correlations, limiting performance and generalizability across new subjects. Motivated by this observation, we introduce a recapitulative causal graph to formulate the MLU procedure and analyze the confounding effect of subjects. Then, we propose SuCI, a simple yet effective causal intervention module to disentangle the impact of subjects acting as unobserved confounders and achieve model training via true causal effects. As a plug-and-play component, SuCI can be widely applied to most methods that seek unbiased predictions. Comprehensive experiments on several MLU benchmarks clearly show the effectiveness of the proposed module.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Improving Factuality in Large Language Models via Decoding-Time Hallucinatory and Truthful Comparators

Dingkang Yang
Dongling Xiao
Jinjie Wei
Mingcheng Li
Zhaoyu Chen
Ke Li
Lihua Zhang

Despite their remarkable capabilities, Large Language Models (LLMs) are prone to generate responses that contradict verifiable facts, i.e., unfaithful hallucination content. Existing efforts generally focus on optimizing model parameters or editing semantic representations, which compromise the internal factual knowledge of target LLMs. In addition, hallucinations typically exhibit multifaceted patterns in downstream tasks, limiting the model's holistic performance across tasks. In this paper, we propose a Comparator-driven Decoding-Time (CDT) framework to alleviate the response hallucination. Firstly, we construct hallucinatory and truthful comparators with multi-task fine-tuning samples. In this case, we present an instruction prototype-guided mixture of experts strategy to enhance the ability of the corresponding comparators to capture different hallucination or truthfulness patterns in distinct task instructions. CDT constrains next-token predictions to factuality-robust distributions by contrasting the logit differences between the target LLMs and these comparators. Systematic experiments on multiple downstream tasks show that our framework can significantly improve the model performance and response factuality.

PDF Details DOI

AAAI Conference 2025 Conference Paper

MMPF: Multi-Modal Perception Framework for Abnormal Medical Condition Detection

Chuyi Zhong
Dingkang Yang
Peng Zhai
Lihua Zhang

As the global population ages and the incidence of chronic diseases increases, the demand for early detection of abnormal medical conditions is increasing. Traditional health monitoring methods often require significant resources and specialized personnel, limiting their widespread use. Leveraging advancements in AI technologies, this study proposes a non-invasive method for detecting abnormal medical conditions from image data. A multimodal perception framework is introduced, integrating features from various modalities, including facial expressions and body postures, to enhance detection accuracy. The framework employs a Cascaded Squeeze-Excitation (CSE) module, consisting of Adaptive and Multi-modal Squeeze-Excitation components, to capture complex feature dependencies and improve cross-modal performance. Extensive experiments demonstrate the effectiveness of this approach, showing improved performance over existing methods. In addition, a new dataset that encompasses a wide range of medical conditions has been released, providing a valuable resource for future research in this domain.

PDF Details DOI

IROS Conference 2025 Conference Paper

Robust Reinforcement Learning based on Momentum Adversarial Training

Li He
Hanchen Liu
Junru Sheng
Lihua Zhang
Zhiyan Dong

Reinforcement learning (RL) is a fundamental and pivotal algorithm in the advancement of autonomous intelligence, including Embodied Intelligence and Physical Intelligence. The performance of RL directly influences the quality and efficiency of a robot’s decision-making and execution during interactions with its environment. Moreover, the robustness of RL remains a critical challenge that needs to be addressed. A promising approach to enhancing robustness is adversarial reinforcement learning. However, the existing methods primarily focus on perturbations in the state space, while perturbations in the action space have been relatively underexplored. The action space in RL is as crucial as the state space in autonomous intelligence. Furthermore, action-space perturbations provide a more comprehensive evaluation of RL robustness. Therefore, it is necessary and valuable to investigate RL robustness under action-space perturbations for the development of autonomous intelligence. To this end, we propose an adversarial learning framework that employs momentum-based gradient descent to model perturbations in the action space, such as actuator disturbances. Furthermore, we introduce an improved optimization method that integrates historical gradient information into conventional Stochastic Gradient Descent (SGD). This approach enhances training stability and improves perturbation efficiency. The proposed method is evaluated through simulations in the MuJoCo environment and UAV control experiments in GymFC, demonstrating significant improvements in robustness and adaptability under action-space perturbations. Additionally, real-world UAV flight tests are conducted to further validate the effectiveness of the proposed framework. The results confirm that the Sim-to-Real transfer is successful, providing empirical evidence for the applicability of our method in real-world scenarios. This study shows that enhancing RL robustness through action-space perturbations is feasible and effective. More importantly, our findings contribute to the future development of autonomous intelligence, particularly in improving its resilience to uncertainties and dynamic environments.

Details

AAAI Conference 2024 Conference Paper

A Unified Self-Distillation Framework for Multimodal Sentiment Analysis with Uncertain Missing Modalities

Mingcheng Li
Dingkang Yang
Yuxuan Lei
Shunli Wang
Shuaibing Wang
Liuzhen Su
Kun Yang
Yuzheng Wang

Multimodal Sentiment Analysis (MSA) has attracted widespread research attention recently. Most MSA studies are based on the assumption of modality completeness. However, many inevitable factors in real-world scenarios lead to uncertain missing modalities, which invalidate the fixed multimodal fusion approaches. To this end, we propose a Unified multimodal Missing modality self-Distillation Framework (UMDF) to handle the problem of uncertain missing modalities in MSA. Specifically, a unified self-distillation mechanism in UMDF drives a single network to automatically learn robust inherent representations from the consistent distribution of multimodal data. Moreover, we present a multi-grained crossmodal interaction module to deeply mine the complementary semantics among modalities through coarse- and fine-grained crossmodal attention. Eventually, a dynamic feature integration module is introduced to enhance the beneficial semantics in incomplete modalities while filtering the redundant information therein to obtain a refined and robust multimodal representation. Comprehensive experiments on three datasets demonstrate that our framework significantly improves MSA performance under both uncertain missing-modality and complete-modality testing conditions.

PDF Details DOI

EAAI Journal 2024 Journal Article

Expression guided medical condition detection via the Multi-Medical Condition Image Dataset

Chuyi Zhong
Dingkang Yang
Shunli Wang
Peng Zhai
Lihua Zhang

Details DOI

IJCAI Conference 2024 Conference Paper

Optimal Auction Design with User Coupons in Advertising Systems

Xiaodong Liu
Zhikang Fan
Yiming Ding
Yuan Guo
Lihua Zhang
Changcheng Li
Dongying Kong
Han Li

Online advertising is a major revenue source for most Internet companies. The advertising opportunities are usually sold to advertisers through auctions that take into account the bids of the advertisers and the click-through rates (CTRs) and the conversion rates (CVRs) of the users. Standard auction design theory perceives both the CTRs and the CVRs as constants. We consider a new auction mechanism that offers coupons to users when displaying the ads. Such coupons allow the user to buy the advertisers' products or services at a lower price, which increases both the CTRs and the CVRs of the ads. In this paper, we formulate the problem mathematically and perform a systematic analysis. We characterize the set of individually rational and incentive compatible mechanisms in our setting. Based on the characterization, we identify the optimal strategy of offering coupons that maximizes the platform's expected revenue. We also conduct extensive experiments on both synthetic data and industrial data. Our experiment results show that our mechanism significantly improves both the revenue and welfare of the platform, thereby creating a win-win situation for all parties including the platform, the advertisers, and the user.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

Dingkang Yang
Jinjie Wei
Dongling Xiao
Shunli Wang
Tong Wu
Gang Li
Mingcheng Li
Shuaibing Wang

Developing intelligent pediatric consultation systems offers promising prospects for improving diagnostic efficiency, especially in China, where healthcare resources are scarce. Despite recent advances in Large Language Models (LLMs) for Chinese medicine, their performance is sub-optimal in pediatric applications due to inadequate instruction data and vulnerable training procedures. To address the above issues, this paper builds PedCorpus, a high-quality dataset of over 300, 000 multi-task instructions from pediatric textbooks, guidelines, and knowledge graph resources to fulfil diverse diagnostic demands. Upon well-designed PedCorpus, we propose PediatricsGPT, the first Chinese pediatric LLM assistant built on a systematic and robust training pipeline. In the continuous pre-training phase, we introduce a hybrid instruction pre-training mechanism to mitigate the internal-injected knowledge inconsistency of LLMs for medical domain adaptation. Immediately, the full-parameter Supervised Fine-Tuning (SFT) is utilized to incorporate the general medical knowledge schema into the models. After that, we devise a direct following preference optimization to enhance the generation of pediatrician-like humanistic responses. In the parameter-efficient secondary SFT phase, a mixture of universal-specific experts strategy is presented to resolve the competency conflict between medical generalist and pediatric expertise mastery. Extensive results based on the metrics, GPT-4, and doctor evaluations on distinct downstream tasks show that PediatricsGPT consistently outperforms previous Chinese medical LLMs. The project and data will be released at https: //github. com/ydk122024/PediatricsGPT.

PDF Details DOI

AAMAS Conference 2024 Conference Paper

Successively Pruned Q-Learning: Using Self Q-function to Reduce the Overestimation

Zhaolin Xue
Lihua Zhang
Zhiyan Dong

It’s well-known that the Q-learning algorithm suffers the overestimation owing to using the maximum state-action value as an approximation of the maximum expected state-action value. Double Q-learning and other algorithms have been proposed as efficient solutions to alleviate the overestimation. However, these proposed methods intend to utilize multiple Q-functions to reduce the overestimation and ignore the information of single Q-function. In this paper, 1) we reinterpret the update process of Q-learning, build a more precise model compatible with previous model. 2) We propose a novel and simple method to control the maximum bias by employing the information of single Q-function. 3) Our method not only balances between the overestimation and the underestimation, but also attains the minimum bias under proper hyper-parameters. 4) Moreover, it can be naturally generalized to the discrete control domain and continuous control tasks. We reveal that our algorithms outperform Double DQN and other algorithms on some representative games and some classical off-policy actor-critic algorithms can also gain benefits from our method.

PDF

NeurIPS Conference 2024 Conference Paper

Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

Mingcheng Li
Dingkang Yang
Yang Liu
Shunli Wang
Jiawei Chen
Shuaibing Wang
Jinjie Wei
Yue Jiang

Multimodal Sentiment Analysis (MSA) is an important research area that aims to understand and recognize human sentiment through multiple modalities. The complementary information provided by multimodal fusion promotes better sentiment analysis compared to utilizing only a single modality. Nevertheless, in real-world applications, many unavoidable factors may lead to situations of uncertain modality missing, thus hindering the effectiveness of multimodal modeling and degrading the model’s performance. To this end, we propose a Hierarchical Representation Learning Framework (HRLF) for the MSA task under uncertain missing modalities. Specifically, we propose a fine-grained representation factorization module that sufficiently extracts valuable sentiment information by factorizing modality into sentiment-relevant and modality-specific representations through crossmodal translation and sentiment semantic reconstruction. Moreover, a hierarchical mutual information maximization mechanism is introduced to incrementally maximize the mutual information between multi-scale representations to align and reconstruct the high-level semantics in the representations. Ultimately, we propose a hierarchical adversarial learning mechanism that further aligns and adapts the latent distribution of sentiment-relevant representations to produce robust joint multimodal representations. Comprehensive experiments on three datasets demonstrate that HRLF significantly improves MSA performance under uncertain modality missing cases.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

How2comm: Communication-Efficient and Collaboration-Pragmatic Multi-Agent Perception

Dingkang Yang
Kun Yang
Yuzheng Wang
Jing Liu
Zhi Xu
Rongbin Yin
Peng Zhai
Lihua Zhang

Multi-agent collaborative perception has recently received widespread attention as an emerging application in driving scenarios. Despite the advancements in previous efforts, challenges remain due to various noises in the perception procedure, including communication redundancy, transmission delay, and collaboration heterogeneity. To tackle these issues, we propose \textit{How2comm}, a collaborative perception framework that seeks a trade-off between perception performance and communication bandwidth. Our novelties lie in three aspects. First, we devise a mutual information-aware communication mechanism to maximally sustain the informative features shared by collaborators. The spatial-channel filtering is adopted to perform effective feature sparsification for efficient communication. Second, we present a flow-guided delay compensation strategy to predict future characteristics from collaborators and eliminate feature misalignment due to temporal asynchrony. Ultimately, a pragmatic collaboration transformer is introduced to integrate holistic spatial semantics and temporal context clues among agents. Our framework is thoroughly evaluated on several LiDAR-based collaborative detection datasets in real-world and simulated scenarios. Comprehensive experiments demonstrate the superiority of How2comm and the effectiveness of all its vital components. The code will be released at https: //github. com/ydk122024/How2comm.

PDF Details

AAAI Conference 2022 Conference Paper

Robust Adversarial Reinforcement Learning with Dissipation Inequation Constraint

Peng Zhai
Jie Luo
Zhiyan Dong
Lihua Zhang
Shunli Wang
Dingkang Yang

Robust adversarial reinforcement learning is an effective method to train agents to manage uncertain disturbance and modeling errors in real environments. However, for systems that are sensitive to disturbances or those that are difficult to stabilize, it is easier to learn a powerful adversary than establish a stable control policy. An improper strong adversary can destabilize the system, introduce biases in the sampling process, make the learning process unstable, and even reduce the robustness of the policy. In this study, we consider the problem of ensuring system stability during training in the adversarial reinforcement learning architecture. The dissipative principle of robust H∞ control is extended to the Markov Decision Process, and robust stability constraints are obtained based on L2 gain performance in the reinforcement learning system. Thus, we propose a dissipation-inequationconstraint-based adversarial reinforcement learning architecture. This architecture ensures the stability of the system during training by imposing constraints on the normal and adversarial agents. Theoretically, this architecture can be applied to a large family of deep reinforcement learning algorithms. Results of experiments in MuJoCo and GymFc environments show that our architecture effectively improves the robustness of the controller against environmental changes and adapts to more powerful adversaries. Results of the flight experiments on a real quadcopter indicate that our method can directly deploy the policy trained in the simulation environment to the real environment, and our controller outperforms the PID controller based on hardware-in-the-loop. Both our theoretical and empirical results provide new and critical outlooks on the adversarial reinforcement learning architecture from a rigorous robust control perspective.

PDF Details