Author name cluster

Han Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

34 papers

2 author rows

EAAI Journal 2026 Journal Article

An enhanced you only look once model for multi-class apple detection in natural orchard environments

Xiaohang Liu
Zhao Zhang
Jiangfan Yu
Wanjia Hua
Xu Li
Han Li
Man Zhang
Chayan Kumer Saha

Multi-class apple detection can improve automatic apple-picking robots' efficiency. Existing studies classified apples into four occlusion types but struggled with clustered fruits and could not balance precision, speed, and model size. A robust Apple State You Only Look Once version 8 medium (AS-YOLOv8m) model was thus proposed for detecting apples into 11 classes according to the apples’ occlusion and clustering conditions. Core innovations included: (i) A cross-stage partial bottleneck module with the deformable convolution was designed to enhance feature extraction and geometric transformation modeling capabilities; (ii) the space-to-depth convolution module was embedded in the backbone network to improve small target detection; (iii) the large-target detection head was removed to lighten the model size; and (iv) the wise intersection over union box loss function was used to balance the loss of high- and low-quality anchor boxes. The model was trained (5, 845 images), validated (1, 948 images), and tested (1, 950 images) using 9, 743 apple images, which were augmented from 1, 149 original captures collected from commercial orchards under diverse lighting conditions. Results showed that AS-YOLOv8m achieved a higher mean average precision of 95. 8% in 11 classes than that of 95. 4% in 4 classes, which also outperformed other comparison models (<95. 1%) and prior research results (<91. 3%). The detection speed was 76. 9 frames per second, and the model size was 36. 2 megabytes. With its real-time capability, small model size, and high detection precision, the AS-YOLOv8m model stands as a promising multi-class apple detection method for the further improvement of robot picking effect and efficiency.

AAAI Conference 2026 Conference Paper

CommitMoE: Efficient Fallback-Free MoE Inference with Offloading Under GPU Memory Constraints

Han Li
Jingwei Sun
Junqing Lin
Guangzhong Sun

Mixture of Experts (MoE) models have emerged as a promising approach to scale language models efficiently by activating only a subset of parameters for each input. However, deploying these models under GPU memory constraints remains challenging, as existing offloading strategies incur significant overhead from CPU-GPU data transfers. While prior work has explored prefetching techniques to mitigate this bottleneck, these methods require costly fallback mechanisms when predictions fail. Since expert transfers cannot be canceled once initiated, the correct experts need to be loaded on demand sequentially, introducing additional latency. To address this, we present CommitMoE, a novel approach featuring a Commit Router that makes execution decisions based on expert predictions without fallback mechanisms. Our key insight reveals that router certainty strongly correlates with prediction accuracy, while in low-certainty scenarios, the model output demonstrates inherent robustness to expert selection. Leveraging this insight to design a systems-level solution, CommitMoE achieves 1.3× to 9.4× faster inference across different environments and datasets compared to state-of-the-art offloading frameworks while maintaining model quality.

PDF Details DOI

AAAI Conference 2026 Conference Paper

CroPS: Improving Dense Retrieval with Cross-Perspective Positive Samples in Short-Video Search

Ao Xie
Jiahui Chen
Quanzhi Zhu
Xiaoze Jiang
Zhiheng Qin
Enyun Yu
Han Li

Dense retrieval has become a foundational paradigm in modern search systems, especially on short-video platforms. However, most industrial systems adopt a self-reinforcing training pipeline that relies on historically exposed user interactions for supervision. This paradigm inevitably leads to a filter bubble effect, where potentially relevant but previously unseen content is excluded from the training signal, biasing the model toward narrow and conservative retrieval. In this paper, we present CroPS (Cross-Perspective Positive Samples), a novel retrieval data engine designed to alleviate this problem by introducing diverse and semantically meaningful positive examples from multiple perspectives. CroPS enhances training with positive signals derived from user query reformulation behavior (query-level), engagement data in recommendation streams (system-level), and world knowledge synthesized by large language models (knowledge-level). To effectively utilize these heterogeneous signals, we introduce a Hierarchical Label Assignment (HLA) strategy and a corresponding H-InfoNCE loss that together enable fine-grained, relevance-aware optimization. Extensive experiments conducted on Kuaishou Search, a large-scale commercial short-video search platform, demonstrate that CroPS significantly outperforms strong baselines both offline and in live A/B tests, achieving superior retrieval performance and reducing query reformulation rates. CroPS is now fully deployed in Kuaishou Search, serving hundreds of millions of users daily.

PDF Details DOI

AAAI Conference 2026 Conference Paper

DIMM: Decoupled Multi-hierarchy Kalman Filter via Reinforcement Learning

Jirong Zha
Yuxuan Fan
Kai Li
Han Li
Chen Gao
Xinlei Chen

State estimation is challenging for target tracking with high maneuverability, as the target's state transition function changes rapidly, irregularly, and is unknown to the estimator. Existing work based on interacting multiple model (IMM) achieves more accurate estimation than single-filter approaches through model combination, aligning appropriate models for different motion modes of the target over time. However, two limitations of conventional IMM remain unsolved. First, the solution space of the model combination is constrained as the target's diverse kinematic properties in different directions are ignored. Second, the model combination weights calculated by the observation likelihood are not accurate enough due to the measurement uncertainty. In this paper, we propose a novel framework, DIMM, to effectively combine estimates from different motion models in each direction, thus increasing the target tracking accuracy. First, DIMM extends the model combination solution space of conventional IMM from a hyperplane to a hypercube by designing a 3D-decoupled multi-hierarchy filter bank, which describes the target's motion with various-order linear models. Second, DIMM generates more reliable combination weight matrices through a differentiable adaptive fusion network for importance allocation rather than solely relying on the observation likelihood; it contains an attention-based twin delayed deep deterministic policy gradient (TD3) method with a hierarchical reward. Experiments demonstrate that DIMM significantly improves the tracking accuracy of existing state estimation methods by 31.61%~99.23%.

PDF Details DOI

AAAI Conference 2026 Conference Paper

DS-ProGen: A Dual-Structure Deep Language Model for Functional Protein Design

Yanting Li
Zikang Wang
Jiyue Jiang
Ziqian Lin
Dongchen He
Yuheng Shan
Yanruisheng Shao
Jiayi Li

Inverse Protein Folding (IPF) is a critical subtask in the field of protein design, aiming to engineer amino acid sequences capable of folding correctly into a specified three-dimensional (3D) conformation. Although substantial progress has been achieved in recent years, existing methods generally rely on either backbone coordinates or molecular surface features alone, which restricts their ability to fully capture the complex chemical and geometric constraints necessary for precise sequence prediction. To address this limitation, we present DS-ProGen, a dual-structure deep language model for functional protein design, which integrates both backbone geometry and surface-level representations. By incorporating backbone coordinates as well as surface chemical and geometric descriptors into a next-amino-acid prediction paradigm, DS-ProGen is able to generate functionally relevant and structurally stable sequences while satisfying both global and local conformational constraints. On the PRIDE dataset, DS-ProGen attains the current state-of-the-art recovery rate of 61.47%, demonstrating the synergistic advantage of multi-modal structural encoding in protein design. Furthermore, DS-ProGen excels in predicting interactions with a variety of biological partners, including ligands, ions, and RNA, confirming its robust functional retention capabilities.

PDF Details DOI

AAAI Conference 2026 Conference Paper

OneSug: The Unified End-to-End Generative Framework for E-commerce Query Suggestion

Xian Guo
Ben Chen
Siyuan Wang
Ying Yang
Mingyue Cheng
Chenyi Lei
Yuqing Ding
Han Li

Query suggestion plays a crucial role in enhancing user experience in e-commerce search systems by providing relevant query recommendations that align with users' initial input. This module helps users navigate towards personalized preference needs and reduces typing effort, thereby improving search experience. Traditional query suggestion modules usually adopt multi-stage cascading architectures, for making a well trade-off between system response time and business conversion. But they often suffer from inefficiencies and suboptimal performance due to inconsistent optimization objectives across stages. To address these, we propose OneSug, the first end-to-end generative framework for e-commerce query suggestion. OneSug incorporates a prefix2query representation enhancement module to enrich prefixes using semantically and interactively related queries to bridge content and business characteristics, an encoder-decoder generative model that unifies the query suggestion process, and a reward-weighted ranking strategy with behavior-level weights to capture fine-grained user preferences. Extensive evaluations on large-scale industry datasets demonstrate OneSug's ability for effective and efficient query suggestion. Furthermore, OneSug has been successfully deployed for the entire traffic on the e-commerce search engine in TEST platform for over 1 month, with statistically significant improvements in user top click position (-9.33%), CTR (+2.01%), Order (+2.04%), and Revenue (+1.69%) over the online multi-stage strategy, showing great potential in e-commercial conversion.

PDF Details DOI

ICML Conference 2025 Conference Paper

Adversarial Robust Generalization of Graph Neural Networks

Chang Cao
Han Li
Yulong Wang
Rui Wu
Hong Chen

While Graph Neural Networks (GNNs) have shown outstanding performance in node classification tasks, they are vulnerable to adversarial attacks, which are imperceptible changes to input samples. Adversarial training, as a widely used tool to enhance the adversarial robustness of GNNs, has presented remarkable effectiveness in node classification tasks. However, the generalization properties for explaining their behaviors remain not well understood from the theoretical viewpoint. To fill this gap, we develop a high probability generalization bound of general GNNs in adversarial learning through covering number analysis. We estimate the covering number of the GNN model class based on the entire perturbed feature matrix by constructing a cover for the perturbation set. Our results are generally applicable to a series of GNNs. We demonstrate their applicability by investigating the generalization performance of several popular GNN models under adversarial attacks, which reveal the architecture-related factors influencing the generalization gap. Our experimental results on benchmark datasets provide evidence that supports the established theoretical findings.

IJCAI Conference 2025 Conference Paper

Adversarial Training for Graph Convolutional Networks: Stability and Generalization Analysis

Chang Cao
Han Li
Yulong Wang
Rui Wu
Hong Chen

Recently, numerous methods have been proposed to enhance the robustness of the Graph Convolutional Networks (GCNs) for their vulnerability against adversarial attacks. Despite their empirical success, a significant gap remains in understanding GCNs' adversarial robustness from the theoretical perspective. This paper addresses this gap by analyzing generalization against both node and structure attacks for multi-layer GCNs through the framework of uniform stability. Under the smoothness assumption of the loss function, we establish the first adversarial generalization bound of GCNs in expectation. Our theoretical analysis contributes to a deeper understanding of how adversarial perturbations and graph architectures influence generalization performance, which provides meaningful insights for designing robust models. Experimental results on benchmark datasets confirm the validity of our theoretical findings, highlighting their practical significance.

PDF Details DOI

ICML Conference 2025 Conference Paper

CaDA: Cross-Problem Routing Solver with Constraint-Aware Dual-Attention

Han Li
Fei Liu 0044
Zhi Zheng 0009
Yu Zhang 0226
Zhenkun Wang 0001

Vehicle routing problems (VRPs) are significant combinatorial optimization problems (COPs) holding substantial practical importance. Recently, neural combinatorial optimization (NCO), which involves training deep learning models on extensive data to learn vehicle routing heuristics, has emerged as a promising approach due to its efficiency and the reduced need for manual algorithm design. However, applying NCO across diverse real-world scenarios with various constraints necessitates cross-problem capabilities. Current cross-problem NCO methods for VRPs typically employ a constraint-unaware model, limiting their cross-problem performance. Furthermore, they rely solely on global connectivity, which fails to focus on key nodes and leads to inefficient representation learning. This paper introduces a Constraint-Aware Dual-Attention Model (CaDA), designed to address these limitations. CaDA incorporates a constraint prompt that efficiently represents different problem variants. Additionally, it features a dual-attention mechanism with a global branch for capturing broader graph-wide information and a sparse branch that selectively focuses on the key node connections. We comprehensively evaluate our model on 16 different VRPs and compare its performance against existing cross-problem VRP solvers. CaDA achieves state-of-the-art results across all tested VRPs. Our ablation study confirms that each component contributes to its cross-problem learning performance. The source code for CaDA is publicly available at https: //github. com/CIAM-Group/CaDA.

NeurIPS Conference 2025 Conference Paper

Fading to Grow: Growing Preference Ratios via Preference Fading Discrete Diffusion for Recommendation

Guoqing Hu
An Zhang
Shuchang Liu
Wenyu Mao
Jiancan Wu
Xun Yang
Xiang Li
Lantao Hu

Recommenders aim to rank items from a discrete item corpus in line with user interests, yet suffer from extremely sparse user preference data. Recent advances in diffusion models have inspired diffusion-based recommenders, which alleviate sparsity by injecting noise during a forward process to prevent collapse of perturbed preference distributions. However, current diffusion‑based recommenders predominantly rely on continuous Gaussian noise, which is intrinsically mismatched with the discrete nature of user preference data in recommendation. In this paper, building upon recent advances in discrete diffusion, we propose \textbf{PreferGrow}, a discrete diffusion-based recommender modeling preference ratios by fading and growing user preferences over the discrete item corpus. PreferGrow differs from existing diffusion-based recommenders in three core aspects: (1) Discrete modeling of preference ratios: PreferGrow models relative preference ratios between two items, where a positive value indicates a more preferred one over another less preferred. This formulation aligns naturally with the discrete and ranking-oriented nature of recommendation tasks. (2) Perturbing via preference fading: Instead of injecting continuous noise, PreferGrow fades user preferences by replacing the preferred item with alternatives---physically akin to negative sampling---thereby eliminating the need for any prior noise assumption. (3) Preference reconstruction via growing: PreferGrow reconstructs user preferences by iteratively growing the preference signal from the estimated ratios. We further provide theoretical analysis showing that PreferGrow preserves key properties of discrete diffusion processes. PreferGrow provides a well-defined matrix‑based formulation for discrete diffusion-based recommendation and empirically outperforms existing diffusion‑based recommenders across five benchmark datasets, underscoring its superior effectiveness. Our codes are available at \url{https: //anonymous. 4open. science/r/PreferGrow_Commit-2259/}.

AAAI Conference 2025 Conference Paper

LEARN: Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application

Jian Jia
Yipei Wang
Yan Li
Honggang Chen
Xuehan Bai
Zhaocheng Liu
Jian Liang
Quan Chen

Contemporary recommendation systems predominantly rely on ID embedding to capture latent associations among users and items. However, this approach overlooks the wealth of semantic information embedded within textual descriptions of items, leading to suboptimal performance and poor generalizations. Leveraging the capability of large language models to comprehend and reason about textual content presents a promising avenue for advancing recommendation systems. To achieve this, we propose an Llm-driven knowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world knowledge with collaborative knowledge. We address computational complexity concerns by utilizing pretrained LLMs as item encoders and freezing LLM parameters to avoid catastrophic forgetting and preserve open-world knowledge. To bridge the gap between the open-world and collaborative domains, we design a twin-tower structure supervised by the recommendation task and tailored for practical industrial application. Through experiments on the real large-scale industrial dataset and online A/B tests, we demonstrate the efficacy of our approach in industry application. We also achieve state-of-the-art performance on six Amazon Review datasets to verify the superiority of our method.

PDF Details DOI

ICML Conference 2025 Conference Paper

Noise Conditional Variational Score Distillation

Xinyu Peng
Ziyang Zheng
Yaoming Wang
Han Li
Nuowen Kan
Wenrui Dai
Chenglin Li
Junni Zou

We propose Noise Conditional Variational Score Distillation (NCVSD), a novel method for distilling pretrained diffusion models into generative denoisers. We achieve this by revealing that the unconditional score function implicitly characterizes the score function of denoising posterior distributions. By integrating this insight into the Variational Score Distillation (VSD) framework, we enable scalable learning of generative denoisers capable of approximating samples from the denoising posterior distribution across a wide range of noise levels. The proposed generative denoisers exhibit desirable properties that allow fast generation while preserve the benefit of iterative refinement: (1) fast one-step generation through sampling from pure Gaussian noise at high noise levels; (2) improved sample quality by scaling the test-time compute with multi-step sampling; and (3) zero-shot probabilistic inference for flexible and controllable sampling. We evaluate NCVSD through extensive experiments, including class-conditional image generation and inverse problem solving. By scaling the test-time compute, our method outperforms teacher diffusion models and is on par with consistency models of larger sizes. Additionally, with significantly fewer NFEs than diffusion-based methods, we achieve record-breaking LPIPS on inverse problems.

ICLR Conference 2025 Conference Paper

On Disentangled Training for Nonlinear Transform in Learned Image Compression

Han Li
Shaohui Li
Wenrui Dai
Maida Cao
Nuowen Kan
Chenglin Li
Junni Zou
Hongkai Xiong

Learned image compression (LIC) has demonstrated superior rate-distortion (R-D) performance compared to traditional codecs, but is challenged by training inefficiency that could incur more than two weeks to train a state-of-the-art model from scratch. Existing LIC methods overlook the slow convergence caused by compacting energy in learning nonlinear transforms. In this paper, we first reveal that such energy compaction consists of two components, \emph{i.e.}, feature decorrelation and uneven energy modulation. On such basis, we propose a linear auxiliary transform (AuxT) to disentangle energy compaction in training nonlinear transforms. The proposed AuxT obtains coarse approximation to achieve efficient energy compaction such that distribution fitting with the nonlinear transforms can be simplified to fine details. We then develop wavelet-based linear shortcuts (WLSs) for AuxT that leverages wavelet-based downsampling and orthogonal linear projection for feature decorrelation and subband-aware scaling for uneven energy modulation. AuxT is lightweight and plug-and-play to be integrated into diverse LIC models to address the slow convergence issue. Experimental results demonstrate that the proposed approach can accelerate training of LIC models by 2 times and simultaneously achieves an average 1\% BD-rate reduction. To our best knowledge, this is one of the first successful attempt that can significantly improve the convergence of LIC with comparable or superior rate-distortion performance.

ICLR Conference 2025 Conference Paper

Towards Generalization Bounds of GCNs for Adversarially Robust Node Classification

Wen Wen 0013
Han Li
Tieliang Gong
Hong Chen 0004

Adversarially robust generalization of Graph Convolutional Networks (GCNs) has garnered significant attention in various security-sensitive application areas, driven by intrinsic adversarial vulnerability. Albeit remarkable empirical advancement, theoretical understanding of the generalization behavior of GCNs subjected to adversarial attacks remains elusive. To make progress on the mystery, we establish unified high-probability generalization bounds for GCNs in the context of node classification, by leveraging adversarial Transductive Rademacher Complexity (TRC) and developing a novel contraction technique on graph convolution. Our bounds capture the interaction between generalization error and adversarial perturbations, revealing the importance of key quantities in mitigating the negative effects of perturbations, such as low-dimensional feature projection, perturbation-dependent norm regularization, normalized graph matrix, proper number of network layers, etc. Furthermore, we provide TRC-based bounds of popular GCNs with $\ell_r$-norm-additive perturbations for arbitrary $r\geq 1$. A comparison of theoretical results demonstrates that specific network architectures (e.g., residual connection) can help alleviate the cumulative effect of perturbations during the forward propagation of deep GCNs. Experimental results on benchmark datasets validate our theoretical findings.

ECAI Conference 2025 Conference Paper

Upright Adjustment of Panoramic Images Based on 3D Coordinate Mapping Matrix

Han Li
Yilin Guo
Lei Zhong
Jianfeng Li 0003

Non-upright panoramic images often suffer from distortion due to camera tilt, which compromises the accuracy of downstream tasks. We propose a novel panoramic upright adjustment method based on 3D coordinate mapping estimation, which fundamentally reformulates the task from a 2D projection problem to a 3D unit spherical mapping problem. Our method employs an end-to-end neural network to directly generate an upright panoramic image from a non-upright input. The key innovation of our approach lies in the use of a 3D Coordinate Mapping Matrix (3D CMMatrix) instead of the traditional 2D CMMatrix. By leveraging the inherent 3D structure of panoramic images, our method effectively captures the spatial continuity of the entire spherical space, eliminating the discontinuous issues that arise at the edges of non-upright panoramic images when using 2D coordinate mapping. The network consists of an encoder that extracts tilt features from the non-upright image and transforms them into a 3D CMMatrix, and a decoder that gradually upsamples the 3D CMMatrix to match the resolution of the original image. This 3D-based approach not only resolves edge artifacts but also significantly improves the overall quality of the upright image. Experimental results demonstrate that our proposed method achieves state-of-the-art performance, outperforming existing methods.

NeurIPS Conference 2025 Conference Paper

Who You Are Matters: Bridging Interests and Social Roles via LLM-Enhanced Logic Recommendation

Qing Yu
Xiaobei Wang
Shuchang Liu
Xiaoyu Yang
Xueliang Wang
Chang Meng
Shanshan Wu
Bin Wen

Recommender systems filter contents/items valuable to users by inferring preferences from user features and historical behaviors. Mainstream approaches follow the learning-to-rank paradigm, which focus on discovering and modeling item topics (e. g. , categories), and capturing user preferences on these topics based on historical interactions. However, this paradigm often neglects the modeling of user characteristics and their social roles, which are logical confounders influencing the correlated interest and user preference transition. To bridge this gap, we introduce the user role identification task and the behavioral logic modeling task that aim to explicitly model user roles and learn the logical relations between item topics and user social roles. We show that it is possible to explicitly solve these tasks through an efficient integration framework of Large Language Model (LLM) and recommendation systems, for which we propose TagCF. On the one hand, TagCF exploits the (Multi-modal) LLM's world knowledge and logic inference ability to extract realistic tag-based virtual logic graphs that reveal dynamic and expressive knowledge of users, refining our understanding of user behaviors. On the other hand, TagCF presents empirically effective integration modules that take advantage of the extracted tag-logic information, augmenting the recommendation performance. We conduct both online experiments and offline experiments with industrial and public datasets as verification of TagCF's effectiveness, and we empirically show that the user role modeling strategy is potentially a better choice than the modeling of item topics. Additionally, we provide evidence that the extracted logic graphs are empirically a general and transferable knowledge that can benefit a wide range of recommendation tasks. Our code is available in https: //github. com/Code2Q/TagCF.

EAAI Journal 2024 Journal Article

Building thermal dynamics modeling with deep transfer learning using a large residential smart thermostat dataset

Han Li
Giuseppe Pinto
Marco Savino Piscitelli
Alfonso Capozzoli
Tianzhen Hong

Understanding thermal dynamics and obtaining the computational model of residential buildings enable its scaled application in energy retrofits, control optimization and decarbonization. In this paper, we present a deep learning approach to model building thermal dynamics with smart thermostat data collected from residential buildings, with the goal to investigate model generalizability. In the first stage, we developed and compared different Deep Learning architectures including Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) models and CNN-LSTM to predict indoor air temperature in a multi-step time horizon. In the second stage, we implemented a Transfer Learning (TL) process, which aims to improve the prediction performance on a new set of buildings (targets), exploiting the knowledge of related or similar buildings (sources). Different TL strategies and source model identification methods were investigated. The study showed that the CNN-LSTM performed the best among the architectures compared, with an average Mean Absolute Error (MAE) of 0. 26 °C for one-hour-ahead (twelve 5-min future steps) predictions. Furthermore, the results showed that freezing the LSTM layer and fine-tuning the other layers of the CNN-LSTM achieved the best performance among four TL strategies, which further improved the performance with respect to a machine learning approach by 10%, and proving the effectiveness and generalizability of the proposed approach. A comparison of three different source model identification methods showed that randomly selecting source models constrained by similar building characteristics can provide good TL performance while retaining simplicity comparing with other quantitative source identification methods.

ICLR Conference 2024 Conference Paper

Frequency-Aware Transformer for Learned Image Compression

Han Li
Shaohui Li
Wenrui Dai
Chenglin Li
Junni Zou
Hongkai Xiong

Learned image compression (LIC) has gained traction as an effective solution for image storage and transmission in recent years. However, existing LIC methods are redundant in latent representation due to limitations in capturing anisotropic frequency components and preserving directional details. To overcome these challenges, we propose a novel frequency-aware transformer (FAT) block that for the first time achieves multiscale directional ananlysis for LIC. The FAT block comprises frequency-decomposition window attention (FDWA) modules to capture multiscale and directional frequency components of natural images. Additionally, we introduce frequency-modulation feed-forward network (FMFFN) to adaptively modulate different frequency components, improving rate-distortion performance. Furthermore, we present a transformer-based channel-wise autoregressive (T-CA) model that effectively exploits channel dependencies. Experiments show that our method achieves state-of-the-art rate-distortion performance compared to existing LIC methods, and evidently outperforms latest standardized codec VTM-12.1 by 14.5\%, 15.1\%, 13.0\% in BD-rate on the Kodak, Tecnick, and CLIC datasets.

ICLR Conference 2024 Conference Paper

Interpretable Diffusion via Information Decomposition

Xianghao Kong
Ollie Liu
Han Li
Dani Yogatama
Greg Ver Steeg

Denoising diffusion models enable conditional generation and density modeling of complex relationships like images and text. However, the nature of the learned relationships is opaque making it difficult to understand precisely what relationships between words and parts of an image are captured, or to predict the effect of an intervention. We illuminate the fine-grained relationships learned by diffusion models by noticing a precise relationship between diffusion and information decomposition. Exact expressions for mutual information and conditional mutual information can be written in terms of the denoising model. Furthermore, ${pointwise}$ estimates can be easily estimated as well, allowing us to ask questions about the relationships between specific images and captions. Decomposing information even further to understand which variables in a high-dimensional space carry information is a long-standing problem. For diffusion models, we show that a natural non-negative decomposition of mutual information emerges, allowing us to quantify informative relationships between words and pixels in an image. We exploit these new relations to measure the compositional understanding of diffusion models, to do unsupervised localization of objects in images, and to measure effects when selectively editing images through prompt interventions.

IJCAI Conference 2024 Conference Paper

Optimal Auction Design with User Coupons in Advertising Systems

Xiaodong Liu
Zhikang Fan
Yiming Ding
Yuan Guo
Lihua Zhang
Changcheng Li
Dongying Kong
Han Li

Online advertising is a major revenue source for most Internet companies. The advertising opportunities are usually sold to advertisers through auctions that take into account the bids of the advertisers and the click-through rates (CTRs) and the conversion rates (CVRs) of the users. Standard auction design theory perceives both the CTRs and the CVRs as constants. We consider a new auction mechanism that offers coupons to users when displaying the ads. Such coupons allow the user to buy the advertisers' products or services at a lower price, which increases both the CTRs and the CVRs of the ads. In this paper, we formulate the problem mathematically and perform a systematic analysis. We characterize the set of individually rational and incentive compatible mechanisms in our setting. Based on the characterization, we identify the optimal strategy of offering coupons that maximizes the platform's expected revenue. We also conduct extensive experiments on both synthetic data and industrial data. Our experiment results show that our mechanism significantly improves both the revenue and welfare of the platform, thereby creating a win-win situation for all parties including the platform, the advertisers, and the user.

PDF Details DOI

ICRA Conference 2024 Conference Paper

RadarCam-Depth: Radar-Camera Fusion for Depth Estimation with Learned Metric Scale

Han Li
Yukai Ma
Yaqing Gu
Kewei Hu
Yong Liu 0007
Xingxing Zuo 0001

We present a novel approach for metric dense depth estimation based on the fusion of a single-view image and a sparse, noisy Radar point cloud. The direct fusion of heterogeneous Radar and image data, or their encodings, tends to yield dense depth maps with significant artifacts, blurred boundaries, and suboptimal accuracy. To circumvent this issue, we learn to augment versatile and robust monocular depth prediction with the dense metric scale induced from sparse and noisy Radar data. We propose a Radar-Camera framework for highly accurate and fine-detailed dense depth estimation with four stages, including monocular depth prediction, global scale alignment of monocular depth with sparse Radar points, quasi-dense scale estimation through learning the association between Radar points and image patches, and local scale refinement of dense depth using a scale map learner. Our proposed method significantly outperforms the state-of-the-art Radar-Camera depth estimation methods by reducing the mean absolute error (MAE) of depth estimation by 25. 6% and 40. 2% on the challenging nuScenes dataset and our self-collected ZJU-4DRadarCam dataset, respectively. Our code and dataset will be released at https://github.com/MMOCKING/RadarCam-Depth.

JBHI Journal 2024 Journal Article

Time-Frequency-Space EEG Decoding Model Based on Dense Graph Convolutional Network for Stroke

Jiancai Leng
Han Li
Weiyou Shi
Licai Gao
Chengyan Lv
Chen Wang
Fangzhou Xu
Yang Zhang

Stroke, a sudden cerebrovascular ailment resulting from brain tissue damage, has prompted the use of motor imagery (MI)-based Brain-Computer Interface (BCI) systems in stroke rehabilitation. However, analyzing EEG signals from stroke patients is challenging because of their low signal-to-noise ratio and high variability. Therefore, we propose a novel approach that combines the modified S-transform (MST) and a dense graph convolutional network (DenseGCN) algorithm to enhance the MI-BCI performance across time, frequency, and space domains. MST is a time-frequency analysis method that efficiently concentrates energy in EEG signals, while DenseGCN is a deep learning model that uses EEG feature maps from each layer as inputs for subsequent layers, facilitating feature reuse and hyper-parameters optimization. Our approach outperforms conventional networks, achieving a peak classification accuracy of 90. 22% and an average information transfer rate (ITR) of 68. 52 bits per minute. Moreover, we conduct an in-depth analysis of the event-related desynchronization/event-related synchronization (ERD/ERS) phenomenon in the deep-level EEG features of stroke patients. Our experimental results confirm the feasibility and efficacy of the proposed approach for MI-BCI rehabilitation systems.

AAAI Conference 2024 Conference Paper

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Kaibin Tian
Yanhua Cheng
Yi Liu
Xinglin Hou
Quan Chen
Han Li

In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Towards Sharper Generalization Bounds for Adversarial Contrastive Learning

Wen Wen
Han Li
Tieliang Gong
Hong Chen

Recently, the enhancement on the adversarial robustness of machine learning algorithms has gained significant attention across various application domains. Given the widespread label scarcity issue in real-world data, adversarial contrastive learning (ACL) has been proposed to adversarially train robust models using unlabeled data. Despite the empirical success, its generalization behavior remains poorly understood and far from being well-characterized. This paper aims to address this issue from a learning theory perspective. We establish novel high-probability generalization bounds for the general Lipschitz loss functions. The derived bounds scale O(log(k)) with respect to the number of negative samples k, which improves the existing linear dependency bounds. Our results are generally applicable to many prediction models, including linear models and deep neural networks. In particular, we obtain an optimistic generalization bound O(1/n) under the smoothness assumption of the loss function on the sample size n. To the best of our knowledge, this is the first fast-rate bound valid for ACL. Empirical evaluations on real-world datasets verify our theoretical findings.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Shaofei Huang
Han Li
Yuqing Wang
Hongji Zhu
Jiao Dai
Jizhong Han
Wenge Rong
Si Liu

Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video. To distinguish the sounding objects from silent ones, both audio-visual semantic correspondence and temporal interaction are required. The previous method applies multi-frame cross-modal attention to conduct pixel-level interactions between audio features and visual features of multiple frames simultaneously, which is both redundant and implicit. In this paper, we propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information and associate each of them to particular sounding objects. Explicit object-level semantic correspondence between audio and visual modalities is established by gathering object information from visual features with predefined audio queries. Besides, an Audio-Bridged Temporal Interaction module is proposed to exchange sounding object-relevant information among multiple frames with the bridge of audio features. Extensive experiments are conducted on two AVS benchmarks to show that our method achieves state-of-the-art performances, especially 7. 1% M_J and 7. 6% M_F gains on the MS3 setting.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Generalization Bounds for Adversarial Metric Learning

Wen Wen
Han Li
Hong Chen
Rui Wu
Lingjuan Wu
Liangxuan Zhu

Recently, adversarial metric learning has been proposed to enhance the robustness of the learned distance metric against adversarial perturbations. Despite rapid progress in validating its effectiveness empirically, theoretical guarantees on adversarial robustness and generalization are far less understood. To fill this gap, this paper focuses on unveiling the generalization properties of adversarial metric learning by developing the uniform convergence analysis techniques. Based on the capacity estimation of covering numbers, we establish the first high-probability generalization bounds with order O(n^{-1/2}) for adversarial metric learning with pairwise perturbations and general losses, where n is the number of training samples. Moreover, we obtain the refined generalization bounds with order O(n^{-1}) for the smooth loss by using local Rademacher complexity, which is faster than the previous result of adversarial pairwise learning, e. g. , adversarial bipartite ranking. Experimental evaluation on real-world datasets validates our theoretical findings.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Pose-Oriented Transformer with Uncertainty-Guided Refinement for 2D-to-3D Human Pose Estimation

Han Li
Bowen Shi
Wenrui Dai
Hongwei Zheng
Botao Wang
Yu Sun
Min Guo
Chenglin Li

There has been a recent surge of interest in introducing transformers to 3D human pose estimation (HPE) due to their powerful capabilities in modeling long-term dependencies. However, existing transformer-based methods treat body joints as equally important inputs and ignore the prior knowledge of human skeleton topology in the self-attention mechanism. To tackle this issue, in this paper, we propose a Pose-Oriented Transformer (POT) with uncertainty guided refinement for 3D HPE. Specifically, we first develop novel pose-oriented self-attention mechanism and distance-related position embedding for POT to explicitly exploit the human skeleton topology. The pose-oriented self-attention mechanism explicitly models the topological interactions between body joints, whereas the distance-related position embedding encodes the distance of joints to the root joint to distinguish groups of joints with different difficulties in regression. Furthermore, we present an Uncertainty-Guided Refinement Network (UGRN) to refine pose predictions from POT, especially for the difficult joints, by considering the estimated uncertainty of each joint with uncertainty-guided sampling strategy and self-attention mechanism. Extensive experiments demonstrate that our method significantly outperforms the state-of-the-art methods with reduced model parameters on 3D HPE benchmarks such as Human3.6M and MPI-INF-3DHP.

PDF Details DOI

ICRA Conference 2023 Conference Paper

RoLM: Radar on LiDAR Map Localization

Yukai Ma
Xiangrui Zhao
Han Li
Yaqing Gu
Xiaolei Lang
Yong Liu 0007

Multi-sensor fusion-based localization technology has achieved high accuracy in autonomous systems. How to improve the robustness is the main challenge at present. The most commonly used LiDAR and camera are weather-sensitive, while the FMCW radar has strong adaptability but suffers from noise and ghost effects. In this paper, we propose a heterogeneous localization method of Radar on LiDAR Map (RoLM), which can eliminate the accumulated error of radar odometry in real-time to achieve higher localization accuracy without dependence on loop closures. We embed the two sensor modalities into a density map and calculate the spatial vector similarity with offset to seek the corresponding place index in the candidates and calculate the rotation and translation. We use the ICP to pursue perfect matching on the LiDAR submap based on the coarse alignment. Extensive experiments on Mulran Radar Dataset, Oxford Radar RobotCar Dataset, and our data verify the feasibility and effectiveness of our approach.

IROS Conference 2023 Conference Paper

Towards Safe and Aggressive Motion Generation for Dynamic Targets Pick-and-Place

Jun Shao
Jianfeng Liao
Han Li
Haoyang Zhang
Shiqiang Zhu
Wei Song 0008
Yinchun Huang

In this paper, we present a framework to generate time-optimal trajectories for dynamic target pick-and-place tasks. We develop an optimization-based trajectory generation method for manipulators, which can conduct spatial-temporal deformation under user-defined requirements. We formulate the problem of dynamic target pick-and-place, in which the trajectory duration and jerk are optimized and terminal states are adjusted instead of being fixed. The motions are constrained within the mechanical limits and to avoid collisions. Constraints transcription is adopted to convert constraints to weighted penalties. Then the problem can be solved based on the trajectory generation method with a high-level optimizer. We integrate the proposed method with online perception into a robot arm platform, in which a conveyor belt is used to transport the objects. Simulations and real-world experiments are conducted under a range of object speeds. Results show that the proposed method achieves online grasping under the object velocity up to 0. 5m/s with an average computing time of 190ms.

NeurIPS Conference 2021 Conference Paper

Pareto Domain Adaptation

Fangrui Lv
Jian Liang
Kaixiong Gong
Shuang Li
Chi Harold Liu
Han Li
Di Liu
Guoren Wang

Domain adaptation (DA) attempts to transfer the knowledge from a labeled source domain to an unlabeled target domain that follows different distribution from the source. To achieve this, DA methods include a source classification objective to extract the source knowledge and a domain alignment objective to diminish the domain shift, ensuring knowledge transfer. Typically, former DA methods adopt some weight hyper-parameters to linearly combine the training objectives to form an overall objective. However, the gradient directions of these objectives may conflict with each other due to domain shift. Under such circumstances, the linear optimization scheme might decrease the overall objective value at the expense of damaging one of the training objectives, leading to restricted solutions. In this paper, we rethink the optimization scheme for DA from a gradient-based perspective. We propose a Pareto Domain Adaptation (ParetoDA) approach to control the overall optimization direction, aiming to cooperatively optimize all training objectives. Specifically, to reach a desirable solution on the target domain, we design a surrogate loss mimicking target classification. To improve target-prediction accuracy to support the mimicking, we propose a target-prediction refining mechanism which exploits domain labels via Bayes’ theorem. On the other hand, since prior knowledge of weighting schemes for objectives is often unavailable to guide optimization to approach the optimal solution on the target domain, we propose a dynamic preference mechanism to dynamically guide our cooperative optimization by the gradient of the surrogate loss on a held-out unlabeled target dataset. Our theoretical analyses show that the held-out data can guide but will not be over-fitted by the optimization. Extensive experiments on image classification and semantic segmentation benchmarks demonstrate the effectiveness of ParetoDA

EAAI Journal 2020 Journal Article

An adaptive switchover hybrid particle swarm optimization algorithm with local search strategy for constrained optimization problems

Zhao Liu
Zhiwei Qin
Ping Zhu
Han Li

Practical engineering optimization problems are almost constrained optimization problems and difficult to be solved effectively, therefore, how to handle these problems has attracted more and more attention. Particle swarm optimization (PSO) is one of the most popular algorithms in solving the complicated optimization problems due to its relatively strong global optimization capability and low requirement for computing resources. However, PSO is easy to converge prematurely like other swarm intelligence algorithms due to the loss of diversity among particles. This article proposes an adaptive switchover hybrid PSO framework with local search process (ASHPSO), which adaptively switches the optimization searching process between the standard PSO and the differential evolution (DE) modified by a full dimension crossover strategy to avoid the premature convergence problem. Moreover, a local search strategy is employed to improve the boundary search capability of PSO in consideration of the engineering problems characteristics. Experiments on 28 well-known benchmark functions, 5 engineering problems and a full vehicle multi-disciplinary optimization problem demonstrate the effectiveness of the proposed algorithm compared with other hybrid variants.

IJCAI Conference 2020 Conference Paper

Learning to Accelerate Heuristic Searching for Large-Scale Maximum Weighted b-Matching Problems in Online Advertising

Xiaotian Hao
Junqi Jin
Jianye Hao
Jin Li
Weixun Wang
Yi Ma
Zhenzhe Zheng
Han Li

Bipartite b-matching is fundamental in algorithm design, and has been widely applied into diverse applications, such as economic markets, labor markets, etc. These practical problems usually exhibit two distinct features: large-scale and dynamic, which requires the matching algorithm to be repeatedly executed at regular intervals. However, existing exact and approximate algorithms usually fail in such settings due to either requiring intolerable running time or too much computation resource. To address this issue, based on a key observation that the matching instances vary not too much, we propose NeuSearcher which leverage the knowledge learned from previously instances to solve new problem instances. Specifically, we design a multichannel graph neural network to predict the threshold of the matched edges, by which the search region could be significantly reduced. We further propose a parallel heuristic search algorithm to iteratively improve the solution quality until convergence. Experiments on both open and industrial datasets demonstrate that NeuSearcher can speed up 2 to 3 times while achieving exactly the same matching solution compared with the state-of-the-art approximation approaches.

PDF Details DOI

NeurIPS Conference 2019 Conference Paper

Joint Optimization of Tree-based Index and Deep Model for Recommender Systems

Han Zhu
Daqing Chang
Ziru Xu
Pengye Zhang
Xiang Li
Jie He
Han Li
Jian Xu

Large-scale industrial recommender systems are usually confronted with computational problems due to the enormous corpus size. To retrieve and recommend the most relevant items to users under response time limits, resorting to an efficient index structure is an effective and practical solution. The previous work Tree-based Deep Model (TDM) \cite{zhu2018learning} greatly improves recommendation accuracy using tree index. By indexing items in a tree hierarchy and training a user-node preference prediction model satisfying a max-heap like property in the tree, TDM provides logarithmic computational complexity w. r. t. the corpus size, enabling the use of arbitrary advanced models in candidate retrieval and recommendation. In tree-based recommendation methods, the quality of both the tree index and the user-node preference prediction model determines the recommendation accuracy for the most part. We argue that the learning of tree index and preference model has interdependence. Our purpose, in this paper, is to develop a method to jointly learn the index structure and user preference prediction model. In our proposed joint optimization framework, the learning of index and user preference prediction model are carried out under a unified performance measure. Besides, we come up with a novel hierarchical user preference representation utilizing the tree index hierarchy. Experimental evaluations with two large-scale real-world datasets show that the proposed method improves recommendation accuracy significantly. Online A/B test results at a display advertising platform also demonstrate the effectiveness of the proposed method in production environments.

IROS Conference 2006 Conference Paper

Avoiding Static and Dynamic Objects in Navigation

Han Li
Yili Fu
He Xu
Yulin Ma

Real-time collision free path planning involves avoidance of static as well as dynamic objects in unknown environment. Strategies suitable for stationary navigation cannot be suitable for the dynamic environment. Behavior-based control combined with fuzzy control to avoid dynamic and static obstacle is described in this paper. Behavior-based control helps the robot get over complex static environment or avoid dynamic objects according to different collision situation. Double-layered fuzzy logic control helps figure out velocity and steering angle of the robot based on some uncertain information. The method has been tested effectively through simulation by a mobile robot navigating amidst multiple static and dynamic environments