Arrow Research search

Author name cluster

Hao Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

116 papers
2 author rows

Possible papers

116

AAAI Conference 2026 Conference Paper

Agentmandering: A Game-Theoretic Framework for Fair Redistricting via Large Language Model Agents

  • Hao Li
  • Haotian Chen
  • Ruoyuan Gong
  • Juanjuan Wang
  • Hao Jiang

Redistricting plays a central role in shaping how votes are translated into political power. While existing computational methods primarily aim to generate large ensembles of legally valid districting plans, they often neglect the strategic dynamics involved in the selection process. This oversight creates opportunities for partisan actors to cherry-pick maps that, while technically compliant, are politically advantageous. Simply satisfying formal constraints does not ensure fairness when the selection process itself can be manipulated. We propose Agentmandering, a framework that reimagines redistricting as a turn-based negotiation between two agents representing opposing political interests. Drawing inspiration from game-theoretic ideas, particularly the Choose-and-Freeze protocol, our method embeds strategic interaction into the redistricting process via large language model (LLM) agents. Agents alternate between selecting and freezing districts from a small set of candidate maps, gradually partitioning the state through constrained and interpretable choices. Evaluation on post-2020 U.S. Census data across all states shows that Agentmandering significantly reduces partisan bias and unfairness, while achieving 2 to 3 orders of magnitude lower variance than standard baselines. These results demonstrate both fairness and stability, especially in swing-state scenarios.

AAAI Conference 2026 Conference Paper

CADTrack: Learning Contextual Aggregation with Deformable Alignment for Robust RGBT Tracking

  • Hao Li
  • Yuhao Wang
  • Xiantao Hu
  • Wenning Hao
  • Pingping Zhang
  • Dong Wang
  • Huchuan Lu

RGB-Thermal (RGBT) tracking aims to exploit visible and thermal infrared modalities for robust all-weather object tracking. However, existing RGBT trackers struggle to resolve modality discrepancies, which poses great challenges for robust feature representation. This limitation hinders effective cross-modal information propagation and fusion, which significantly reduces the tracking accuracy. To address this limitation, we propose a novel Contextual Aggregation with Deformable Alignment framework called CADTrack for RGBT Tracking. To be specific, we first deploy the Mamba-based Feature Interaction (MFI) that establishes efficient feature interaction via state space models. This interaction module can operate with linear complexity, reducing computational cost and improving feature discrimination. Then, we propose the Contextual Aggregation Module (CAM) that dynamically activates backbone layers through sparse gating based on the Mixture-of-Experts (MoE). This module can encode complementary contextual information from cross-layer features. Finally, we propose the Deformable Alignment Module (DAM) to integrate deformable sampling and temporal propagation, mitigating spatial misalignment and localization drift. With the above components, our CADTrack achieves robust and accurate tracking in complex scenarios. Extensive experiments on five RGBT tracking benchmarks verify the effectiveness of our proposed method.

AAAI Conference 2026 Conference Paper

DcSplat: Dual-Constraint Human Gaussian Splatting with Latent Multi-View Consistency

  • Tengfei Xiao
  • Yue Wu
  • Zhigang Gao
  • Yongzhe Yuan
  • Can Qin
  • Hao Li
  • Mingyang Zhang

Human Novel View Synthesis (HNVS) aims to synthesize photorealistic human images from novel viewpoints given observations from known views. Despite significant advances achieved by existing methods such as NeRF, diffusion models, and 3DGS, they still face substantial challenges in achieving stable modeling from a single image. In this paper, we introduce Dual-Constraint Human Gaussian Splatting (DcSplat), a novel, simple, and efficient 3D Gaussian-based framework for single-view 3D human reconstruction. To address occlusion-induced texture missing and depth ambiguities, we introduce two key components: a Latent Multi-View Consistency Constraint Mechanism and a Geometric Constraint Module. The former employs a Latent-space Appearance Transformer (LatentFormer) to learn semantically coherent, view-consistent appearance priors via SMPL-guided pseudo-view fusion. The latter refines noisy SMPL-based depth through a U-Net-like structure conditioned on latent appearance features. These two modules are jointly optimized to generate high-quality Gaussian parameters in a unified latent space. Extensive experiments demonstrate that DcSplat outperforms existing SOTA methods in both geometry and texture quality, while achieving fast inference and lower computational cost.

AAAI Conference 2026 Conference Paper

FDP: A Frequency-Decomposition Preprocessing Pipeline for Unsupervised Anomaly Detection in Brain MRI

  • Hao Li
  • Zhenfeng Zhuang
  • Jingyu Lin
  • Yu Liu
  • Yifei Chen
  • Qiong Peng
  • Lequan Yu
  • Liansheng Wang

Due to the diversity of brain anatomy and the scarcity of annotated data, supervised anomaly detection for brain MRI remains challenging, driving the development of unsupervised anomaly detection (UAD) approaches. Current UAD methods typically utilize synthetically generated noise perturbations on healthy MRIs to train generative models for normal anatomy reconstruction, enabling anomaly detection via residual maps. However, such simulated anomalies lack the biophysical fidelity and morphological complexity characteristic of true clinical lesions. To advance UAD in brain MRI, we conduct the first systematic frequency-domain analysis of pathological signatures, revealing two key properties: (1) anomalies exhibit unique frequency patterns distinguishable from normal anatomy, and (2) low-frequency signals maintain consistent representations across healthy scans. These insights motivate our Frequency-Decomposition Preprocessing (FDP) framework—the first UAD method to leverage frequency-domain reconstruction for simultaneous pathology suppression and anatomical preservation. FDP can integrate seamlessly with existing anomaly simulation techniques, consistently enhancing detection performance across diverse architectures while maintaining diagnostic fidelity. Experimental results demonstrate that FDP consistently improves anomaly detection performance when integrated with existing methods. Notably, FDP achieves a 17.63% increase in DICE score with LDM while maintaining robust improvements across multiple baselines.

JBHI Journal 2026 Journal Article

Few-Shot Class-Incremental Learning With Dynamic Prototype Refinement for Brain Activity Classification

  • Lei Cao
  • Hao Li
  • Yilin Dong
  • Tianyu Liu
  • Jie Li

The brain-computer interface (BCI) system facilitates efficient communication and control, with Electroencephalography (EEG) signals as a vital component. Traditional EEG signal classification, based on static deep-learning models, presents a challenge when new classes of the subject’s brain activity emerge. The goal is to develop a model that can recognize new few-shot classes while preserving its ability to discriminate between existing ones. This scenario is referred to as Few-Shot Class-Incremental Learning (FSCIL). This work introduces IncrementEEG, a novel framework meticulously designed to tackle the distinct challenges of FSCIL in EEG-based brain activity classification, focusing specifically on emotion recognition and steady-state visual evoked potential (SSVEP). Our work analyzes the role of additive angular margin loss in improving the model’s discrimination capabilities. The proposed method is designed to demonstrate robustness in open-world conditions and adaptability to new tasks. Furthermore, we introduce a prototype refinement module comprising a prototype augmentation block and an update block. The prototype augmentation block in the deep feature space preserves the decision boundary for prior tasks, and the prototype update block utilizes a shared embedding space to compute the relation matrix for bootstrapping prototype updates. Extensive experiments conducted across multiple datasets show the superior performance of the IncrementEEG framework compared to state-of-the-art methods. The proposed method advances FSCIL brain activity classification, offering promising potential for applications in Brain-Computer Interface systems.

AAAI Conference 2026 Conference Paper

From Sampling to Cognition: Modeling Internal Cognitive Confidence in Language Models for Robust Uncertainty Calibration

  • Hao Li
  • Tao He
  • Jiafeng Liang
  • Zheng Chu
  • Ming Liu

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet they generally lack self-awareness, often displaying overconfidence when confronted with questions beyond their knowledge boundaries. This limitation severely hinders their trustworthiness in high-stakes scenarios. Existing calibration methods typically rely on sampling accuracy, derived from multiple outputs, as a proxy for model confidence. However, this coarse-grained metric fails to capture the model’s internal cognitive states, such as confusion, hallucination, or persistent belief in false knowledge. To address this, we propose CogConf (Cognitive Confidence), a cognitively grounded uncertainty signal that extends sampling accuracy by incorporating the semantic diversity of incorrect answers and the model’s abstention behaviors. By shifting the focus from sampling-based to cognition-oriented uncertainty modeling, CogConf offers a more faithful reflection of the model's internal beliefs. Building on this signal, we introduce CogAlign, a simple yet effective alignment framework that explicitly aligns the model’s verbalized confidence with CogConf, thereby producing uncertainty estimates that better reflect the model’s internal cognition. Experimental results on six knowledge-intensive in-domain and out-of-domain QA datasets demonstrate that CogConf robustly characterizes the model's internal uncertainty. Building on this foundation, CogAlign guides the model's expression to significantly enhance the trustworthiness and utility of its uncertainty calibration without compromising its underlying QA capabilities, while also demonstrating strong cross-task generalization and output stability. Offering a new pathway toward building more trustworthy LLMs.

AAAI Conference 2026 Conference Paper

Hybrid Vector-Occupancy Field for Robust Implicit 3D Surface Reconstruction

  • Yue Wu
  • Zhigang Gao
  • Tengfei Xiao
  • Can Qin
  • Yongzhe Yuan
  • Hao Li
  • Kaiyuan Feng
  • Wenping Ma

We introduce the Hybrid Vector-Occupancy Field (HVOF), a new implicit 3D representation for reconstructing both open and closed surfaces from sparse point clouds. Existing approaches, such as occupancy field and signed distance fields, face severe limitations. They struggle with open surfaces, while unsigned distance field and neural vector field exhibit directional instability in complex topologies and ridge regions. HVOF addresses these challenges by incorporating a smoothly decaying occupancy field around the surface, while capturing precise local geometry using truncated displacement vectors, naturally mitigating direction-field ambiguities near ridge regions. This unified design forms a robust hybrid representation that leverages both occupancy and vector fields. To fulfill it, we design a Hybrid Field variational autoencoder including a hierarchical cross-attention encoder and dual-branch decoder that jointly learn occupancy and vector fields through continuous weighting. Extensive experiments demonstrate that HVOF consistently outperforms state-of-the-art methods across ShapeNet, ABC, and MGN datasets, accurately reconstructing both open and closed surfaces while preserving fine geometric details in complex regions.

AAAI Conference 2026 Conference Paper

ICL-Router: In-Context Learned Model Representations for LLM Routing

  • Chenxu Wang
  • Hao Li
  • Yiqun Zhang
  • Linyao Chen
  • Jianhao Chen
  • Ping Jian
  • Qiaosheng Zhang
  • Shuyue Hu

Large language models (LLMs) often exhibit complementary strengths. Model routing harnesses these strengths by dynamically directing each query to the most suitable model, given a candidate model pool. However, routing performance relies on accurate model representations, and adding new models typically requires retraining, limiting scalability. To address these challenges, we propose a novel routing method using in-context vectors to represent model capabilities. The method proceeds in two stages. First, queries are embedded and projected into vectors, with a projector and LLM-based router trained to reconstruct the original queries, aligning vector representations with the router’s semantic space. Second, each candidate model is profiled on a query set, and the router learns---based on in-context vectors of query and model performance---to predict whether each model can correctly answer new queries. Extensive experiments demonstrate that our method achieves state-of-the-art routing performance in both in-distribution and out-of-distribution tasks. Moreover, our method allows for seamless integration of new models without retraining the router.

AAAI Conference 2026 Conference Paper

Identity-Aware Vision-Language Model for Explainable Face Forgery Detection

  • Junhao Xu
  • Jingjing Chen
  • Yang Jiao
  • Jiacheng Zhang
  • Zhiyu Tan
  • Hao Li
  • Yu-Gang Jiang

Recent advances in generative artificial intelligence have enabled the creation of highly realistic image forgeries, raising significant concerns about digital media authenticity. While existing detection methods demonstrate promising results on benchmark datasets, they face critical limitations in real-world applications. First, existing detectors typically fail to detect semantic inconsistencies with the person’s identity, such as implausible behaviors or incompatible environmental contexts in given images. Second, these methods rely heavily on low-level visual cues, making them effective for known forgeries but less reliable against new or unseen manipulation techniques. To address these challenges, we present a novel personalized vision-language model (VLM) that integrates low-level visual artifact analysis and high-level semantic inconsistency detection. Unlike previous VLM-based methods, our approach avoids resource-intensive supervised fine-tuning that often struggles to preserve distinct identity characteristics. Instead, we employ a lightweight method that dynamically encodes identity-specific information into specialized identifier tokens. This design enables the model to learn distinct identity characteristics while maintaining robust generalization capabilities. We further enhance detection capabilities through a lightweight detection adapter that extracts fine-grained information from shallow features of the vision encoder, preserving critical low-level evidence. Comprehensive experiments demonstrate that our approach achieves 94.25% accuracy and 94.08% F1 score, outperforming both traditional forgery detectors and general VLMs while requiring only 10 extra tokens.

AAAI Conference 2026 Conference Paper

MetaDiT: Enabling Fine-grained Constraints in High-degree-of Freedom Metasurface Design

  • Hao Li
  • Andrey Bogdanov

Metasurfaces are ultrathin, engineered materials composed of nanostructures that manipulate light in ways unattainable by natural materials. Recent advances have leveraged computational optimization, machine learning, and deep learning to automate their design. However, existing approaches exhibit two fundamental limitations: (1) they often restrict the model to generating only a subset of design parameters, and (2) they rely on heavily downsampled spectral targets, which compromises both the novelty and accuracy of the resulting structures. The core challenge lies in developing a generative model capable of exploring a large, unconstrained design space while precisely capturing the intricate physical relationships between material parameters and their high-resolution spectral responses. In this paper, we introduce MetaDiT, a novel framework for high-fidelity metasurface design that addresses these limitations. Our approach leverages a robust spectrum encoder pretrained with contrastive learning, providing strong conditional guidance to a Diffusion Transformer-based backbone. Experiments demonstrate that MetaDiT outperforms existing baselines in spectral accuracy, we further validate our method through extensive ablation studies.

JBHI Journal 2026 Journal Article

Multimodal Integration of A Novel Gait State Time Interval Signal Generation Method and Insole Sensor Data-based Body Intelligence: Application in Parkinson's Disease

  • Hao Li
  • Illa Baryskievic
  • Anatoliy Baryskievic
  • Viktar Tsviatkou

Multidomain and multimodal identification of the walking gait-cycle states is important for detecting and monitoring locomotion disorders such as Parkinson's disease (PD). We propose a novel multizonal clustering and multi-level thresholding method based on analyzing multizonal plantar load distribution for generating a discrete gait-state time-interval (GSTI) signal to improve PD diagnosis accuracy and the effectiveness of rehabilitation through personalized strategies. Multidomain analysis of the GSTI signal shows a novel coupled I. Baryskievic-H. Li bio-oscillator interpreted as a GSTI-derived signal-level oscillatory signature that may be associated with a central nervous system (CNS)-related locomotor rhythm organization. The bio-oscillator consists of two interconnected oscillations with distinct resonant spectral peaks at specific natural frequencies and phase coupling (nonlinearity) between two frequency components. We propose a multidomain feature level of layered Integrative Body Intelligence (IBI) framework to identify lower and higher-order interactions between gait cycle states. The proposed multimodal data level of IBI involves the proposed acoustic and visual biofeedback based on a novel acoustic harmonic plantar pressure model and a 3D gait state portrait of the GSTI signal used for walking gait monitoring and personalized rehabilitation assessment in PD. Experiments on a publicly available PD plantar-insole dataset show that the Multilayer Perceptron (MLP) model based on the selected multidomain (time-interval, spectral, and bispectral) feature subset achieves classification accuracy (94. 44%), and offers a trade-off between model complexity and performance for PD recognition. This result suggests that it is possible to accurately diagnose early-stage PD through merely testing patients' GSTI signal.

AAAI Conference 2026 Conference Paper

The Avengers: A Routing Recipe for Collective Intelligence in Language Models

  • Yiqun Zhang
  • Hao Li
  • Chenxu Wang
  • Linyao Chen
  • Qiaosheng Zhang
  • Peng Ye
  • Shi Feng
  • Xinrun Wang

Proprietary models are increasingly dominating the race for ever-larger language models. Can open-source, smaller models remain competitive across a broad range of tasks? In this paper, we present the Avengers---a lightweight framework that leverages the collective intelligence of these smaller models. The Avengers builds upon four lightweight operations: (i) embedding: encode queries using a text embedding model; (ii) clustering: group queries based on their semantic similarity; (iii) scoring: scores each model's performance within each cluster; and (iv) voting: improve outputs via repeated sampling and voting. At inference time, each query is embedded and assigned to its nearest cluster. The top-performing model(s) within that cluster are selected to generate the response with repeated sampling. Remarkably, with 10 open-source models (~7B parameters each), the Avengers surpasses GPT-4o, 4.1, and 4.5 in average performance across 15 diverse datasets spanning mathematics, coding, logical reasoning, general knowledge, and affective tasks. In particular, it surpasses GPT-4.1 on mathematics tasks by 18.21% and on code tasks by 7.46%. Furthermore, the Avengers delivers superior out-of-distribution generalization, and remains robust across various embedding models, clustering algorithms, ensemble strategies, data efficiency, and values of its sole parameter---the number of clusters.

AAAI Conference 2026 Conference Paper

Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

  • Hao Li
  • Shuai Yang
  • Yilun Chen
  • Xinyi Chen
  • Xiaoda Yang
  • Yang Tian
  • Hanqing Wang
  • Tai WANG

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR, showing the promise of efficient multi-frame adaptation for real-world VLA deployment.

AAAI Conference 2025 Conference Paper

AdvDisplay: Adversarial Display Assembled by Thermoelectric Cooler for Fooling Thermal Infrared Detectors

  • Hao Li
  • Fanggao Wan
  • Yue Su
  • Yue Wu
  • Mingyang Zhang
  • Maoguo Gong

When the current physical adversarial patches cannot deceive thermal infrared detectors, the existing techniques implement adversarial attacks from scratch, such as digital patch generation, material production, and physical deployment. Besides, it is difficult to finely regulate infrared radiation. To address these issues, this paper designs an adversarial thermal display (AdvDisplay ) by assembling thermoelectric coolers (TECs) as an array. Specifically, to reduce the gap between patches in the physical and digital worlds and decrease the power of AdvDisplay device, heat transfer loss and electric power loss are designed to guide the patch optimization. In addition, a precise temperature control scheme for AdvDisplay is proposed based on proportional-integral-derivative (PID) control. Due to the accurate temperature regulation and the reusability of AdvDisplay, our method is able to improve the attack success rate and the efficiency of physical deployments. Extensive experimental results indicate that the proposed method possesses superior adversarial effectiveness compared to other methods and demonstrates strong robustness in physical attacks.

JBHI Journal 2025 Journal Article

Cognitive Load Prediction From Multimodal Physiological Signals Using Multiview Learning

  • Yingxin Liu
  • Yang Yu
  • Hong Tao
  • Zeqi Ye
  • Si Wang
  • Hao Li
  • Dewen Hu
  • Zongtan Zhou

Predicting cognitive load is a crucial issue in the emerging field of human-computer interaction and holds significant practical value, particularly in flight scenarios. Although previous studies have realized efficient cognitive load classification, new research is still needed to adapt the current state-of-the-art multimodal fusion methods. Here, we proposed a feature selection framework based on multiview learning to address the challenges of information redundancy and reveal the common physiological mechanisms underlying cognitive load. Specifically, the multimodal signal features [electroencephalogram (EEG), electrodermal activity (EDA), electrocardiogram (ECG), electrooculogram (EOG), & eye movements] at three cognitive load levels were estimated during multiattribute task battery (MATB) tasks performed by 22 healthy participants and fed into a feature selection-multiview classification with cohesion and diversity (FS-MCCD) framework. The optimized feature set was extracted from the original feature set by integrating the weight of each view and the feature weights to formulate the ranking criteria. The cognitive load prediction model, evaluated using real-time classification results, achieved an average accuracy of 81. 08% and an average F1-score of 80. 94% for three-class classification among 22 participants. Furthermore, the weights of the physiological signal features revealed the physiological mechanisms related to cognitive load. Specifically, heightened cognitive load was linked to amplified $\delta$ and $\theta$ power in the frontal lobe, reduced $\alpha$ power in the parietal lobe, and an increase in pupil diameter. Thus, the proposed multimodal feature fusion framework emphasizes the effectiveness and efficiency of using these features to predict cognitive load.

AAAI Conference 2025 Conference Paper

Deconfound Semantic Shift and Incompleteness in Incremental Few-shot Semantic Segmentation

  • Yirui Wu
  • Yuhang Xia
  • Hao Li
  • Lixin Yuan
  • Junyang Chen
  • Jun Liu
  • Tong Lu
  • Shaohua Wan

Incremental few-shot semantic segmentation (IFSS) expands segmentation capacity of the trained model to segment new-class images with few samples. However, semantic meanings may shift from background to object class or vice versa during incremental learning. Moreover, new-class samples often lack representative attribute features when the new class greatly differs from the pre-learned old class. In this paper, we propose a causal framework to discuss the cause of semantic shift and incompleteness in IFSS, and we deconfound the revealed causal effects from two aspects. First, we propose a Causal Intervention Module (CIM) to resist semantic shift. CIM progressively and adaptively updates prototypes of old class, and removes the confounder in an intervention manner. Second, a Prototype Refinement Module (PRM) is proposed to complete the missing semantics. In PRM, knowledge gained from the episode learning scheme assists in fusing features of new-class and old-class prototypes. Experiments on both PASCAL-VOC 2012 and ADE20k benchmarks demonstrate the outstanding performance of our method.

NeurIPS Conference 2025 Conference Paper

DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents

  • Hao Li
  • Xiaogeng Liu
  • CHIU Chun
  • Dianqi Li
  • Ning Zhang
  • Chaowei Xiao

Large Language Models (LLMs) are increasingly central to agentic systems due to their strong reasoning and planning capabilities. By interacting with external environments through predefined tools, these agents can carry out complex user tasks. Nonetheless, this interaction also introduces the risk of prompt injection attacks, where malicious inputs from external sources can mislead the agent’s behavior, potentially resulting in economic loss, privacy leakage, or system compromise. System-level defenses have recently shown promise by enforcing static or predefined policies, but they still face two key challenges: the ability to dynamically update security rules and the need for memory stream isolation. To address these challenges, we propose DRIFT, a Dynamic Rule-based Isolation Framework for Trustworthy agentic systems, which enforces both control- and data-level constraints. A Secure Planner first constructs a minimal function trajectory and a JSON-schema-style parameter checklist for each function node based on the user query. A Dynamic Validator then monitors deviations from the original plan, assessing whether changes comply with privilege limitations and the user's intent. Finally, an \textit{Injection Isolator} detects and masks any instructions that may conflict with the user query from the memory stream to mitigate long-term risks. We empirically validate the effectiveness of DRIFT on the AgentDojo and ASB benchmark, demonstrating its strong security performance while maintaining high utility across diverse models—showcasing both its robustness and adaptability. The code is released at https: //github. com/SaFoLab-WISC/DRIFT.

NeurIPS Conference 2025 Conference Paper

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

  • Xiangyu Zhao
  • Peiyuan Zhang
  • Kexian Tang
  • Xiaorong Zhu
  • Hao Li
  • Wenhao Chai
  • Zicheng Zhang
  • Renqiu Xia

Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To study this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an robust evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and the LMM-as-a-judge approach. We conducted experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models. The evaluation results demonstrate that current models face significant challenges in reasoning-based editing tasks. Even the most powerful model evaluated, GPT-image-1, achieves an accuracy of merely 28. 8%. RISEBench effectively highlights the limitations of contemporary editing models, provides valuable insights, and indicates potential future directions for the field of reasoning-aware visual editing. Our code and data have been released at https: //github. com/PhoenixZ810/RISEBench.

NeurIPS Conference 2025 Conference Paper

EverybodyDance: Bipartite Graph–Based Identity Correspondence for Multi-Character Animation

  • Haotian Ling
  • Zequn Chen
  • Qiuying Chen
  • Donglin Di
  • Yongjia Ma
  • Hao Li
  • Chen Wei
  • Zhulin Tao

Consistent pose‐driven character animation has achieved remarkable progress in single‐character scenarios. However, extending these advances to multi‐character settings is non‐trivial, especially when position swap is involved. Beyond mere scaling, the core challenge lies in enforcing correct Identity Correspondence (IC) between characters in reference and generated frames. To address this, we introduce EverybodyDance, a systematic solution targeting IC correctness in multi-character animation. EverybodyDance is built around the Identity Matching Graph (IMG), which models characters in the generated and reference frames as two node sets in a weighted complete bipartite graph. Edge weights, computed via our proposed Mask–Query Attention (MQA), quantify the affinity between each pair of characters. Our key insight is to formalize IC correctness as a graph structural metric and to optimize it during training. We also propose a series of targeted strategies tailored for multi-character animation, including identity-embedded guidance, a multi-scale matching strategy, and pre-classified sampling, which work synergistically. Finally, to evaluate IC performance, we curate the Identity Correspondence Evaluation benchmark, dedicated to multi‐character IC correctness. Extensive experiments demonstrate that EverybodyDance substantially outperforms state‐of‐the‐art baselines in both IC and visual fidelity.

JBHI Journal 2025 Journal Article

Explicit Abnormality Extraction for Unsupervised Motion Artifact Reduction in Magnetic Resonance Imaging

  • Yusheng Zhou
  • Hao Li
  • Jianan Liu
  • Zhengmin Kong
  • Tao Huang
  • Euijoon Ahn
  • Zhihan Lv
  • Jinman Kim

Motion artifacts compromise the quality of magnetic resonance imaging (MRI) and pose challenges to achieving diagnostic outcomes and image-guided therapies. In recent years, supervised deep learning approaches have emerged as successful solutions for motion artifact reduction (MAR). One disadvantage of these methods is their dependency on acquiring paired sets of motion artifact-corrupted (MA-corrupted) and motion artifact-free (MA-free) MR images for training purposes. Obtaining such image pairs is difficult and therefore limits the application of supervised training. In this paper, we propose a novel UNsupervised Abnormality Extraction Network (UNAEN) to alleviate this problem. Our network is capable of working with unpaired MA-corrupted and MA-free images. It converts the MA-corrupted images to MA-reduced images by extracting abnormalities from the MA-corrupted images using a proposed artifact extractor, which intercepts the residual artifact maps from the MA-corrupted MR images explicitly, and a reconstructor to restore the original input from the MA-reduced images. The performance of UNAEN was assessed by experimenting with various publicly available MRI datasets and comparing them with state-of-the-art methods. The quantitative evaluation demonstrates the superiority of UNAEN over alternative MAR methods and visually exhibits fewer residual artifacts. Our results substantiate the potential of UNAEN as a promising solution applicable in real-world clinical environments, with the capability to enhance diagnostic accuracy and facilitate image-guided therapies.

NeurIPS Conference 2025 Conference Paper

FuXi-Ocean: A Global Ocean Forecasting System with Sub-Daily Resolution

  • Qiusheng Huang
  • Yuan Niu
  • Xiaohui Zhong
  • Lei Chen
  • dianjun zhang
  • Xuefeng Zhang
  • Hao Li

Accurate, high-resolution ocean forecasting is crucial for maritime operations and environmental monitoring. While traditional numerical models are capable of producing sub-daily, eddy-resolving forecasts, they are computationally intensive and face challenges in maintaining accuracy at fine spatial and temporal scales. In contrast, recent data-driven approaches offer improved computational efficiency and emerging potential, yet typically operate at daily resolution and struggle with sub-daily predictions due to error accumulation over time. We introduce FuXi-Ocean, the first data-driven global ocean forecasting model achieving six-hourly predictions at eddy-resolving 1/12° spatial resolution, reaching depths of up to 1500 meters. The model architecture integrates a context-aware feature extraction module with a predictive network employing stacked attention blocks. The core innovation is the Mixture-of-Time (MoT) module, which adaptively integrates predictions from multiple temporal contexts by learning variable-specific reliability, mitigating cumulative errors in sequential forecasting. Through comprehensive experimental evaluation, FuXi-Ocean demonstrates superior skill in predicting key variables, including temperature, salinity, and currents, across multiple depths.

NeurIPS Conference 2025 Conference Paper

GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing

  • Rongyao Fang
  • Chengqi Duan
  • Kun Wang
  • Linjiang Huang
  • Hao Li
  • Hao Tian
  • Shilin Yan
  • Weihao Yu

Current image generation and editing methods primarily process textual prompts as direct inputs without explicit reasoning about visual composition or operational steps. We present Generation Chain-of-Thought (GoT), a novel paradigm that empowers a Multimodal Large Language Model (MLLM) to first generate an explicit, structured reasoning chain in natural language—detailing semantic relationships, object attributes, and, crucially, precise spatial coordinates—before any image synthesis occurs. This intermediate reasoning output directly guides the subsequent visual generation or editing process. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over \textbf{9M} samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2. 5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. We will release our datasets and models to facilitate future research.

AAAI Conference 2025 Conference Paper

GRPose: Learning Graph Relations for Human Image Generation with Pose Priors

  • Xiangchen Yin
  • Donglin Di
  • Lei Fan
  • Hao Li
  • Wei Chen
  • Gouxiaofei
  • Yang Song
  • Xiao Sun

Recent methods using diffusion models have made significant progress in human image generation with various control signals such as pose priors. However, existing efforts are still struggling to generate high-quality images with consistent pose alignment, resulting in unsatisfactory output. In this paper, we propose a framework that delves into the graph relations of pose priors to provide control information for human image generation. The main idea is to establish a graph topological structure between the pose priors and latent representation of diffusion models to capture the intrinsic associations between different pose parts. A Progressive Graph Integrator (PGI) is designed to learn the spatial relationships of the pose priors with the graph structure, adopting a hierarchical strategy within an Adapter to gradually propagate information across different pose parts. Besides, a pose perception loss is introduced based on a pretrained pose estimation network to minimize the pose differences. Extensive qualitative and quantitative experiments conducted on the Human-Art and LAION-Human datasets clearly demonstrate that our model can achieve significant performance improvement over the latest benchmark models.

AAAI Conference 2025 Conference Paper

HieraFashDiff: Hierarchical Fashion Design with Multi-stage Diffusion Models

  • Zhifeng Xie
  • Hao Li
  • Huiming Ding
  • Mengtian Li
  • Xinhan Di
  • Ying Cao

Fashion design is a challenging and complex process. Recent works on fashion generation and editing are all agnostic of the actual fashion design process, which limits their usage in practice. In this paper, we propose a novel hierarchical diffusion-based framework tailored for fashion design, coined as HieraFashDiff. Our model is designed to mimic the practical fashion design workflow, by unraveling the denosing process into two successive stages: 1) an ideation stage that generates design proposals given high-level concepts and 2) an iteration stage that continuously refines the proposals using low-level attributes. Our model supports fashion design generation and fine-grained local editing in a single framework. To train our model, we contribute a new dataset of full-body fashion images annotated with hierarchical text descriptions. Extensive evaluations show that, as compared to prior approaches, our method can generate fashion designs and edited results with higher fidelity and better prompt adherence, showing its promising potential to augment the practical fashion design workflow.

NeurIPS Conference 2025 Conference Paper

Learning Crossmodal Interaction Patterns via Attributed Bipartite Graphs for Single-Cell Omics

  • Xiaotang Wang
  • Xuanwei Lin
  • Yun Zhu
  • Hao Li
  • Yongqi Zhang

Crossmodal matching in single-cell omics is essential for explaining biological regulatory mechanisms and enhancing downstream analyses. However, current single-cell crossmodal models often suffer from three limitations: sparse modality signals, underutilization of biological attributes, and insufficient modeling of regulatory interactions. These challenges hinder generalization in data-scarce settings and restrict the ability to uncover fine-grained biologically meaningful crossmodal relationships. Here, we present a novel framework which reformulates crossmodal matching as a graph classification task on Attributed Bipartite Graphs (ABGs). It models single-cell ATAC-RNA data as an ABG, where each expressed ATAC and RNA is treated as a distinct node with unique IDs and biological features. To model crossmodal interaction patterns on the constructed ABG, we propose $\text{Bi}^2\text{Former}$, a **bi**ologically-driven **bi**partite graph trans**former** that learns interpretable attention over ATAC–RNA pairs. This design enables the model to effectively learn and explain biological regulatory relationships between ATAC and RNA modalities. Extensive experiments demonstrate that $\text{Bi}^2\text{Former}$ achieves state-of-the-art performance in crossmodal matching across diverse datasets, remains robust under sparse training data, generalizes to unseen cell types and datasets, and reveals biologically meaningful regulatory patterns. This work pioneers an ABG-based approach for single-cell crossmodal matching, offering a powerful framework for uncovering regulatory interactions at the single-cell omics. Our code is available at: https: //github. com/wangxiaotang0906/Bi2Former.

NeurIPS Conference 2025 Conference Paper

MIRA: Medical Time Series Foundation Model for Real-World Health Data

  • Hao Li
  • Bowen Deng
  • Chang Xu
  • ZhiYuan Feng
  • Viktor Schlegel
  • Yu-Hao Huang
  • Yizheng Sun
  • Jingyuan Sun

A unified foundation model for medical time series—pretrained on open access and ethically reviewed medical corpora—offers the potential to reduce annotation burdens, minimize model customization, and enable robust transfer across clinical institutions, modalities, and tasks, particularly in data-scarce or privacy-constrained environments. However, existing time series foundation models struggle to handle medical time series data due to its inherent challenges, including irregular intervals, heterogeneous sampling rates, and frequent missingness. To address these challenges, we introduce MIRA, a unified foundation model specifically designed for medical time series forecasting. MIRA incorporates a Continuous-Time Rotary Positional Encoding that enables fine-grained modeling of variable time intervals, a frequency-specific mixture-of-experts layer that routes computation across latent frequency regimes to further promote temporal specialization, and a Continuous Dynamics Extrapolation Block based on Neural ODE that models the continuous trajectory of latent states, enabling accurate forecasting at arbitrary target timestamps. Pretrained on a large-scale and diverse medical corpus comprising over 454 billion time points collect from publicly available datasets, MIRA achieving reductions in forecasting errors by an average of 8% and 6% in out-of-distribution and in-distribution scenarios, respectively. We also introduce a comprehensive benchmark spanning multiple downstream clinical tasks, establishing a foundation for future research in medical time series modeling.

AAAI Conference 2025 Conference Paper

MUCD: Unsupervised Point Cloud Change Detection via Masked Consistency

  • Yue Wu
  • Zhipeng Wang
  • Yongzhe Yuan
  • Maoguo Gong
  • Hao Li
  • Mingyang Zhang
  • Wenping Ma
  • Qiguang Miao

3D Change Detection (3DCD) has gradually become another research hotspot after image change detection. Recent works focus on using artificial labels for supervised or weakly-supervised training of siamese networks to segment changed points. However, labeling every points of multi-temporal point clouds is very expensive and time-consuming. In addition, these works lack effective self-supervised signals, and existing self-supervised signals often fail to capture sufficiently rich change information. To solve this problem, we assume that the powerful representation of 3D objects should model the consistency information of unchanged regions and distinguish different objects. Based on this assumption, we propose a new unsupervised framework called MUCD to learn change information of multi-temporal point clouds through bidirectional optimization of change segmentor and feature extractor. The training of network is divided into two stages. We first design a foreknowledge point contrastive loss based on the characteristics of the 3DCD task to initialize the feature extractor, and then propose a masked consistency loss to further learn the shared geometric information of unchanged regions in the multi-temporal point clouds, utilizing it as a free and powerful supervised signal to train a change segmentor. In the inference stage, only the segmentor is used to take multi-temporal point clouds as input and produce change segmentation result. Extensive experiments are conducted on SLPCCD and Urb3DCD, two real-world datasets of streets and urban buildings, to verify that our proposed unsupervised method is highly competitive and even outperforms supervised methods in scenes where semantic information changes occur, exhibiting better performance in generalization ability and robustness.

NeurIPS Conference 2025 Conference Paper

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

  • Changyao Tian
  • Hao Li
  • Gen Luo
  • Xizhou Zhu
  • Weijie Su
  • Hanming Deng
  • Jinguo Zhu
  • Jie Shao

Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i. e. , data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.

NeurIPS Conference 2025 Conference Paper

Omni-Mol: Multitask Molecular Model for Any-to-any Modalities

  • Chengxin Hu
  • Hao Li
  • Yihe Yuan
  • Zezheng Song
  • Chenyang Zhao
  • Haixin Wang

In the molecular domain, numerous studies have explored the use of multimodal large language models (LLMs) to construct a general-purpose, multi-task molecular model. However, these efforts are still far from achieving a truly universal molecular model. We identify three key challenges in this endeavor: (1) Existing molecular task datasets are typically small in scale and lack comprehensive domain coverage. (2) Tasks from different molecular subfields are difficult to effectively learn jointly through LLMs due to significant distributional shifts and competition among tasks, which introduces instability in the learning process. (3) Both inter-task and intra-task molecular representations demand different intrinsic dimensions in the language space, making it challenging to balance between redundancy and insufficiency in language model representations. To address these challenges, we innovatively categorize existing small-molecule tasks into four types: Mol2Mol, Mol2Text, Mol2Num, and Text2Mol. We then collect a dataset encompassing over 16 tasks with more than 1. 4 million samples, making it the largest molecular instruction-tuning dataset to date. Leveraging the extensive pretraining of LLMs on existing chemical literature, we propose a novel multimodal LLM framework, named Omni-Mol, which unifies all small-molecule tasks and supports both molecular generation and understanding. The core of Omni-Mol is our proposed MoGE, which dynamically adapts to the intrinsic rank of different tasks. This mixture-of-experts architecture enhances the model's ability to handle diverse tasks and modalities effectively. Our model achieves unified instruction tuning across 16 tasks and attains state-of-the-art performance on 13 of them. Extensive experiments further demonstrate the scalability and versatility of Omni-Mol.

AAAI Conference 2025 Conference Paper

Partial Point Cloud Registration with Multi-view 2D Image Learning

  • Yue Zhang
  • Yue Wu
  • Wenping Ma
  • Maoguo Gong
  • Hao Li
  • Biao Hou

Learning representations from numerous 2D image data has shown promising performance, yet very few works apply this representations to point cloud registration. In this paper, we explore how to leverage the 2D information to assist the point cloud registration, and propose IAPReg, an Image-Assisted Partial 3D point cloud Registration framework with the multi-view images generated by the input point cloud. It is expected to enrich 3D information with 2D knowledge, and leverage 2D knowledge to assist with point cloud registration. Specifically, we create multi-view depth maps by projecting the input point cloud from several specific views, and then extract 2D and 3D features using some well-established models. To fuse the information learned from 2D and 3D modalities, inter-modality multi-view learning module is proposed to enhance geometric information and complement semantic information. Weighted SVD is a common method to reduce the impact of inaccurate correspondences on registration. However, determining the correspondence weights is not trivial. Therefore, we design a 2D-weighted SVD method, where the 2D knowledge is employed to provide weight information of correspondences. Extensive experiments perform that our method outperform the state-of-the-art method without additional 2D training data.

AAAI Conference 2025 Conference Paper

Pioneer: Physics-informed Riemannian Graph ODE for Entropy-increasing Dynamics

  • Li Sun
  • Ziheng Zhang
  • Zixi Wang
  • Yujie Wang
  • Qiqi Wan
  • Hao Li
  • Hao Peng
  • Philip S. Yu

Dynamic interacting system modeling is important for understanding and simulating real world systems, e.g., meteorology and the spread of COVID. The system is typically described as a graph, where multiple objects dynamically interact with each other and evolve over time. In recent years, graph Ordinary Differential Equations (ODE) receive increasing research attentions. While achieving encouraging results, existing solutions prioritize the traditional Euclidean space, and neglect the intrinsic geometry of the system and physics laws, e.g., the principle of entropy increasing. The aforementioned limitations motivate us to rethink the system dynamics from a fresh perspective of Riemannian geometry, and pose a more realistic problem of physics-informed dynamic system modeling, considering the underlying geometry and physics law for the first time. In this paper, we present a novel physics-informed Riemannian graph ODE for a wide range of entropy-increasing dynamic systems (termed as Pioneer). In particular, we formulate a differential system on the Riemannian manifold, where a manifold-valued graph ODE is governed by the proposed constrained Ricci flow, and a manifold preserving Gyro-transform aware of system geometry. Theoretically, we report the provable entropy non-decreasing of our formulation, obeying the physics laws. Empirical results show the superiority of Pioneer on real datasets.

NeurIPS Conference 2025 Conference Paper

PointTruss: K-Truss for Point Cloud Registration

  • Yue Wu
  • Jun Jiang
  • Yongzhe Yuan
  • Maoguo Gong
  • Qiguang Miao
  • Hao Li
  • Mingyang Zhang
  • Wenping Ma

Point cloud registration is a fundamental task in 3D computer vision. Recent advances have shown that graph-based methods are effective for outlier rejection in this context. However, existing clique-based methods impose overly strict constraints and are NP-hard, making it difficult to achieve both robustness and efficiency. While the k-core reduces computational complexity, which only considers node degree and ignores higher-order topological structures such as triangles, limiting its effectiveness in complex scenarios. To overcome these limitations, we introduce the $k$-truss from graph theory into point cloud registration, leveraging triangle support as a constraint for inlier selection. We further propose a consensus voting-based low-scale sampling strategy to efficiently extract the structural skeleton of the point cloud prior to $k$-truss decomposition. Additionally, we design a spatial distribution score that balances coverage and uniformity of inliers, preventing selections that concentrate on sparse local clusters. Extensive experiments on KITTI, 3DMatch, and 3DLoMatch demonstrate that our method consistently outperforms both traditional and learning-based approaches in various indoor and outdoor scenarios, achieving state-of-the-art results.

AAAI Conference 2025 Conference Paper

Political Actor Agent: Simulating Legislative System for Roll Call Votes Prediction with Large Language Models

  • Hao Li
  • Ruoyuan Gong
  • Hao Jiang

Predicting roll call votes through modeling political actors has emerged as a focus in quantitative political science and computer science. Widely used embedding-based methods generate vectors for legislators from diverse data sets to predict legislative behaviors. However, these methods often contend with challenges such as the need for manually predefined features, reliance on extensive training data, and a lack of interpretability. Achieving more interpretable predictions under flexible conditions remains an unresolved issue. This paper introduces the Political Actor Agent (PAA), a novel agent-based framework that utilizes Large Language Models to overcome these limitations. By employing role-playing architectures and simulating legislative system, PAA provides a scalable and interpretable paradigm for predicting roll-call votes. Our approach not only enhances the accuracy of predictions but also offers multi-view, human-understandable decision reasoning, providing new insights into political actor behaviors. We conducted comprehensive experiments using voting records from the 117-118th U.S. House of Representatives, validating the superior performance and interpretability of PAA. This study not only demonstrates PAA's effectiveness but also its potential in political science research.

IROS Conference 2025 Conference Paper

Robust Stabilization of an Autonomous Underwater Vehicle in Specified Finite-time with Disturbance Rejection

  • Hongjiao Niu
  • Zhiyong Geng
  • Zhiyu Li
  • Hao Li
  • Xiayang Li

This study investigates the robust finite-time stabilization of an autonomous underwater vehicle (AUV) with disturbance rejection, where the finite-time can be predetermined. The AUV is modeled as a rigid body moving within fluids, and the systems dynamics involves uncertain parameters arising from the hydrodynamic coupling between the AUV and fluid, along with unknown external disturbances. Direct compensation for the system’s dynamics, a common approach in controller design, is ineffective for AUVs with uncertainties. Therefore, a robust anti-disturbance control method without compensating for unknown dynamics is proposed. To begin, a time rescaling method is introduced to convert the specified finite-time stabilization of the system into an asymptotic stabilization for a time rescaling system. Then, an exponential PID controller is designed for the time rescaling system to handle unknown constant disturbances while a high-gain control strategy is used to suppress the uncertain dynamics, which also enhances the robustness against the uncertain dynamics. The specified finite-time stabilization control law is ultimately derived from the designed exponential control law of the time rescaling system. Numerical simulations are conducted to verify the results.

ICML Conference 2025 Conference Paper

STAR: Learning Diverse Robot Skill Abstractions through Rotation-Augmented Vector Quantization

  • Hao Li
  • Qi Lv 0001
  • Rui Shao 0001
  • Xiang Deng 0002
  • Yinchuan Li
  • Jianye Hao
  • Liqiang Nie

Transforming complex actions into discrete skill abstractions has demonstrated strong potential for robotic manipulation. Existing approaches mainly leverage latent variable models, e. g. , VQ-VAE, to learn skill abstractions through learned vectors (codebooks), while they suffer from codebook collapse and modeling the causal relationship between learned skills. To address these limitations, we present S kill T raining with A ugmented R otation ( STAR ), a framework that advances both skill learning and composition to complete complex behaviors. Specifically, to prevent codebook collapse, we devise rotation-augmented residual skill quantization (RaRSQ). It encodes relative angles between encoder outputs into the gradient flow by rotation-based gradient mechanism. Points within the same skill code are forced to be either pushed apart or pulled closer together depending on gradient directions. Further, to capture the casual relationship between skills, we present causal skill transformer (CST) which explicitly models dependencies between skill representations through an autoregressive mechanism for coherent action generation. Extensive experiments demonstrate the superiority of STAR on both LIBERO benchmark and realworld tasks, with around 12% improvement over the baselines.

NeurIPS Conference 2025 Conference Paper

STRIDER: Navigation via Instruction-Aligned Structural Decision Space Optimization

  • Diqi He
  • Xuehao Gao
  • Hao Li
  • Junwei Han
  • Dingwen Zhang

The Zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) task requires agents to navigate previously unseen 3D environments using natural language instructions, without any scene-specific training. A critical challenge in this setting lies in ensuring agents’ actions align with both spatial structure and task intent over long-horizon execution. Existing methods often fail to achieve robust navigation due to a lack of structured decision-making and insufficient integration of feedback from previous actions. To address these challenges, we propose STRIDER (Instruction-Aligned Structural Decision Space Optimization), a novel framework that systematically optimizes the agent’s decision space by integrating spatial layout priors and dynamic task feedback. Our approach introduces two key innovations: 1) a Structured Waypoint Generator that constrains the action space through spatial structure, and 2) a Task-Alignment Regulator that adjusts behavior based on task progress, ensuring semantic alignment throughout navigation. Extensive experiments on the R2R-CE and RxR-CE benchmarks demonstrate that STRIDER significantly outperforms strong SOTA across key metrics; in particular, it improves Success Rate (SR) from 29\% to 35\%, a relative gain of 20. 7\%. Such results highlight the importance of spatially constrained decision-making and feedback-guided execution in improving navigation fidelity for zero-shot VLN-CE.

NeurIPS Conference 2025 Conference Paper

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

  • Dongzhi JIANG
  • Ziyu Guo
  • Renrui Zhang
  • Zhuofan Zong
  • Hao Li
  • Le Zhuo
  • Shilin Yan
  • Pheng-Ann Heng

Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generated CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX. 1. All the training code is in the supplementary material and will be made public.

JBHI Journal 2025 Journal Article

TKR-FSOD: Fetal Anatomical Structure Few-Shot Detection Utilizing Topological Knowledge Reasoning

  • Xi Li
  • Ying Tan
  • Bocheng Liang
  • Bin Pu
  • Jiewen Yang
  • Lei Zhao
  • Yanqing Kong
  • Lixian Yang

Fetal multi-anatomical structure detection in ultrasound (US) images can clearly present the relationship and influence between anatomical structures, providing more comprehensive information about fetal organ structures and assisting sonographers in making more accurate diagnoses, widely used in structure evaluation. Recently, deep learning methods have shown superior performance in detecting various anatomical structures in ultrasound images, but still have the potential for performance improvement in categories where it is difficult to obtain samples, such as rare diseases. Few-shot learning has attracted a lot of attention in medical image analysis due to its ability to solve the problem of data scarcity. However, existing few-shot learning research in medical image analysis focuses on classification and segmentation, and the research on object detection has been neglected. In this paper, we propose a novel fetal anatomical structure few-shot detection method in ultrasound images, TKR-FSOD, which learns topological knowledge through a Topological Knowledge Reasoning Module to help the model reason about and detect anatomical structures. Furthermore, we propose a Discriminate Ability Enhanced Feature Learning Module that extracts abundant discriminative features to enhance the model's discriminative ability. Experimental results demonstrate that our method outperforms the state-of-the-art baseline methods, exceeding the second-best method with a maximum margin of 4. 8% on 5-shot of split 1 under four-chamber cardiac view.

NeurIPS Conference 2025 Conference Paper

Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference

  • Jiayi Yuan
  • Hao Li
  • Xinheng Ding
  • Wenya Xie
  • Yu-Jhe Li
  • Wentian Zhao
  • Kun Wan
  • Jing Shi

Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration, such as evaluation batch size, GPU count, and GPU version, can introduce significant differences in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9\% variation in accuracy and 9, 000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision—while critical for reproducibility—is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https: //github. com/nanomaoli/llm_reproducibility.

AAAI Conference 2025 Conference Paper

VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence

  • Hao Li
  • Hao Fei
  • Zechao Hu
  • Zhengwei Yang
  • Zheng Wang

Social Intelligence Queries (Social-IQ) serve as the primary multimodal benchmark for evaluating a model’s social intelligence level. While impressive multiple-choice question (MCQ) accuracy is achieved by current solutions, increasing evidence shows that they are largely, and in some cases entirely, dependent on language modality, overlooking visual context. Additionally, the closed-set nature further prevents the exploration of whether and to what extent the reasoning path behind selection is correct. To address these limitations, we propose the Visually Explainable and Grounded Artificial Social Intelligence (VEGAS) model. As a generative multimodal model, VEGAS leverages open-ended answering to provide explainable responses, which enhances the clarity and evaluation of reasoning paths. To enable visually grounded answering, we propose a novel sampling strategy to provide the model with more relevant visual frames. We then enhance the model’s interpretation of these frames through Generalist Instruction Fine-Tuning (GIFT), which aims to: i) learn multimodal language transformations for fundamental emotional social traits, and ii) establish multimodal joint reasoning capabilities. Extensive experiments, comprising modality ablation, open-ended assessments, and supervised MCQ evaluations, consistently show that VEGAS effectively utilizes visual information in reasoning to produce correct and also credible answers. We expect this work to offer a new perspective on Social-IQ and advance the development of human-like social AI.

IJCAI Conference 2025 Conference Paper

Wave-wise Discriminative Tracking by Phase-Amplitude Separation, Augmentation and Mixture

  • Huibin Tan
  • Mingyu Cao
  • Kun Hu
  • Xihuai He
  • Zhe Wang
  • Hao Li
  • Long Lan
  • Mengzhu Wang

Distinguishing key features in complex visual tasks is challenging. A novel approach treats image patches (tokens) as waves. By using both phase and amplitude, it captures richer semantics and specific invariances compared to pixel-based methods, and allows for feature fusion across regions for a holistic image representation. Based on this, we propose the Wave-wise Discriminative Transformer Tracker (WDT). During tracking, WDT represents features via phase-amplitude separation, enhancement, and mixture. First, we designed a Mutual Exclusive Phase-Amplitude Extractor (MEPAE) to separate phase and amplitude features with distinct semantics, representing spatial target info and background brightness respectively. Then, Wave-wise Feature Augmentation is carried out with two submodules: Phase-Amplitude Feature Augmentation and Mixture. The augmentation module disrupts the separated features in the same batch, and the mixture module recombines them to generate positive and negative waves. The original features are aggregated into the original wave. Positive waves have the same phase but different amplitudes, and negative waves have different phase components. Finally, self-supervised and tracking-supervised losses guide the global and local representation learning for original, positive, and negative waves, enhancing wave-level discrimination. Experiments on five benchmarks prove the effectiveness of our method.

TMLR Journal 2024 Journal Article

Adaptively Robust and Sparse $K$-means Clustering

  • Hao Li
  • Shonosuke Sugasawa
  • Shota Katayama

While $K$-means is known to be a standard clustering algorithm, its performance may be compromised due to the presence of outliers and high-dimensional noisy variables. This paper proposes adaptively robust and sparse $K$-means clustering (ARSK) to address these practical limitations of the standard $K$-means algorithm. For robustness, we introduce a redundant error component for each observation, and this additional parameter is penalized using a group sparse penalty. To accommodate the impact of high-dimensional noisy variables, the objective function is modified by incorporating weights and implementing a penalty to control the sparsity of the weight vector. The tuning parameters to control the robustness and sparsity are selected by $\rm Gap$ statistics.Through simulation experiments and real data analysis, we demonstrate the proposed method's superiority to existing algorithms in identifying clusters without outliers and informative variables simultaneously.

NeurIPS Conference 2024 Conference Paper

Clustering then Propagation: Select Better Anchors for Knowledge Graph Embedding

  • Ke Liang
  • Yue Liu
  • Hao Li
  • Lingyuan Meng
  • Suyuan Liu
  • Siwei Wang
  • Sihang Zhou
  • Xinwang Liu

Traditional knowledge graph embedding (KGE) models map entities and relations to unique embedding vectors in a shallow lookup manner. As the scale of data becomes larger, this manner will raise unaffordable computational costs. Anchor-based strategies have been treated as effective ways to alleviate such efficiency problems by propagation on representative entities instead of the whole graph. However, most existing anchor-based KGE models select the anchors in a primitive manner, which limits their performance. To this end, we propose a novel anchor-based strategy for KGE, i. e. , a relational clustering-based anchor selection strategy (RecPiece), where two characteristics are leveraged, i. e. , (1) representative ability of the cluster centroids and (2) descriptive ability of relation types in KGs. Specifically, we first perform clustering over features of factual triplets instead of entities, where cluster number is naturally set as number of relation types since each fact can be characterized by its relation in KGs. Then, representative triplets are selected around the clustering centroids, further mapped into corresponding anchor entities. Extensive experiments on six datasets show that RecPiece achieves higher performances but comparable or even fewer parameters compared to previous anchor-based KGE models, indicating that our model can select better anchors in a more scalable way.

IJCAI Conference 2024 Conference Paper

Expressiveness is Effectiveness: Self-supervised Fashion-aware CLIP for Video-to-Shop Retrieval

  • Likai Tian
  • Zhengwei Yang
  • Zechao Hu
  • Hao Li
  • Yifang Yin
  • Zheng Wang

The rise of online shopping and social media has spurred the Video-to-Shop Retrieval (VSR) task, which involves identifying fashion items (e. g. , clothing) in videos and matching them with identical products provided by stores. In real-world scenarios, human movement in dynamic video scenes can cause substantial morphological alterations of fashion items with aspects of occlusion, shifting viewpoints (parallax), and partial visibility (truncation). This results in those high-quality frames being overwhelmed by a vast of redundant ones, which makes the retrieval less effectiveness. To this end, this paper introduces a framework, named Self-supervised Fashion-aware CLIP (SF-CLIP), for effective VSR. The SF-CLIP enables the discovery of salient frames with high fashion expressiveness via generating pseudo-labels from three key aspects of fashion expressiveness to assess occlusion, parallax, and truncation. With such pseudo-labels, the ability of CLIP is expanded to facilitate the discovery of salient frames. Furthermore, to encompass the comprehensive representations among salient frames, a dual-branch graph-based fusion module is proposed to extract and integrate inter-frame features. Extensive experiments demonstrate the superiority of SF-CLIP over the state-of-the-arts.

JBHI Journal 2024 Journal Article

GKE-TUNet: Geometry-Knowledge Embedded TransUNet Model for Retinal Vessel Segmentation Considering Anatomical Topology

  • Yunlong Qiu
  • Haifeng Zhang
  • Chonghui Song
  • Xiaolong Zhao
  • Hao Li
  • Xianbo Wang

Automated retinal vessel segmentation is crucial for computer-aided clinical diagnosis and retinopathy screening. However, deep learning faces challenges in extracting complex intertwined structures and subtle small vessels from densely vascularized regions. To address these issues, we propose a novel segmentation model, called Geometry-Knowledge Embedded TransUNet (GKE-TUNet), which incorporates explicit embedding of topological features of retinal vessel anatomy. In the proposed GKE-TUNet model, a skeleton extraction network is pre-trained to extract the anatomical topology of retinal vessels from refined segmentation labels. During vessel segmentation, the dense skeleton graph is sampled as a graph of key-points and connections and is incorporated into the skip connection layer of TransUNet. The graph vertices are used as node features and correspond to positions in the low-level feature maps. The graph attention network (GAT) is used as the graph convolution backbone network to capture the shape semantics of vessels and the interaction of key locations along the topological direction. Finally, the node features obtained by graph convolution are read out as a sparse feature map based on their corresponding spatial coordinates. To address the problem of sparse feature maps, we employ convolution operators to fuse sparse feature maps with low-level dense feature maps. This fusion is weighted and connected to deep feature maps. Experimental results on the DRIVE, CHASE-DB1, and STARE datasets demonstrate the competitiveness of our proposed method compared to existing ones.

AAAI Conference 2024 Conference Paper

Gradual Residuals Alignment: A Dual-Stream Framework for GAN Inversion and Image Attribute Editing

  • Hao Li
  • Mengqi Huang
  • Lei Zhang
  • Bo Hu
  • Yi Liu
  • Zhendong Mao

GAN-based image attribute editing firstly leverages GAN Inversion to project real images into the latent space of GAN and then manipulates corresponding latent codes. Recent inversion methods mainly utilize additional high-bit features to improve image details preservation, as low-bit codes cannot faithfully reconstruct source images, leading to the loss of details. However, during editing, existing works fail to accurately complement the lost details and suffer from poor editability. The main reason is they inject all the lost details indiscriminately at one time, which inherently induces the position and quantity of details to overfit source images, resulting in inconsistent content and artifacts in edited images. This work argues that details should be gradually injected into both the reconstruction and editing process in a multi-stage coarse-to-fine manner for better detail preservation and high editability. Therefore, a novel dual-stream framework is proposed to accurately complement details at each stage. The Reconstruction Stream is employed to embed coarse-to-fine lost details into residual features and then adaptively add them to the GAN generator. In the Editing Stream, residual features are accurately aligned by our Selective Attention mechanism and then injected into the editing process in a multi-stage manner. Extensive experiments have shown the superiority of our framework in both reconstruction accuracy and editing quality compared with existing methods.

NeurIPS Conference 2024 Conference Paper

Grasp as You Say: Language-guided Dexterous Grasp Generation

  • Yi-Lin Wei
  • Jian-Jian Jiang
  • Chengyi Xing
  • Xian-Tuo Tan
  • Xiao-ming Wu
  • Hao Li
  • Mark Cutkosky
  • Wei-Shi Zheng

This paper explores a novel task "Dexterous Grasp as You Say'' (DexGYS), enabling robots to perform dexterous grasping based on human commands expressed in natural language. However, the development of this field is hindered by the lack of datasets with natural human guidance; thus, we propose a language-guided dexterous grasp dataset, named DexGYSNet, offering high-quality dexterous grasp annotations along with flexible and fine-grained human language guidance. Our dataset construction is cost-efficient, with the carefully-design hand-object interaction retargeting strategy, and the LLM-assisted language guidance annotation system. Equipped with this dataset, we introduce the DexGYSGrasp framework for generating dexterous grasps based on human language instructions, with the capability of producing grasps that are intent-aligned, high quality and diversity. To achieve this capability, our framework decomposes the complex learning process into two manageable progressive objectives and introduce two components to realize them. The first component learns the grasp distribution focusing on intention alignment and generation diversity. And the second component refines the grasp quality while maintaining intention consistency. Extensive experiments are conducted on DexGYSNet and real world environments for validation.

IJCAI Conference 2024 Conference Paper

MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator

  • Xiao-Yin Liu
  • Xiao-Hu Zhou
  • Guotao Li
  • Hao Li
  • Mei-Jiang Gui
  • Tian-Yu Xiang
  • De-Xing Huang
  • Zeng-Guang Hou

Offline reinforcement learning (RL) faces a significant challenge of distribution shift. Model-free offline RL penalizes the Q value for out-of-distribution (OOD) data or constrains the policy closed to the behavior policy to tackle this problem, but this inhibits the exploration of the OOD region. Model-based offline RL, which uses the trained environment model to generate more OOD data and performs conservative policy optimization within that model, has become an effective method for this problem. However, the current model-based algorithms rarely consider agent robustness when incorporating conservatism into policy. Therefore, the new model-based offline algorithm with a conservative Bellman operator (MICRO) is proposed. This method trades off performance and robustness via introducing the robust Bellman operator into the algorithm. Compared with previous model-based algorithms with robust adversarial models, MICRO can significantly reduce the computation cost by only choosing the minimal Q value in the state uncertainty set. Extensive experiments demonstrate that MICRO outperforms prior RL algorithms in offline RL benchmark and is considerably robust to adversarial perturbations.

AAAI Conference 2024 Conference Paper

Negative Pre-aware for Noisy Cross-Modal Matching

  • Xu Zhang
  • Hao Li
  • Mang Ye

Cross-modal noise-robust learning is a challenging task since noisy correspondence is hard to recognize and rectify. Due to the cumulative and unavoidable negative impact of unresolved noise, existing methods cannot maintain a stable performance when the noise increases. In this paper, we present a novel Negative Pre-aware Cross-modal (NPC) matching solution for large visual-language model fine-tuning on noisy downstream tasks. It is featured in two aspects: (1) For noise recognition and resistance, previous methods usually directly filter out a noise subset, we propose to estimate the negative impact of each sample. It does not need additional correction mechanisms that may predict unreliable correction results, leading to self-reinforcing error. We assign a confidence weight to each sample according to its negative impact in the training process. This adaptively adjusts the contribution of each sample to avoid noisy accumulation. (2) For maintaining stable performance with increasing noise, we utilize the memorization effect of DNNs by maintaining a memory bank. Specifically, we apply GMM to select high-confident clean samples as the memory entry, where the memory entry is used to estimate the negative impact of each sample. Since clean samples are easier distinguished by GMM with increasing noise, the memory bank can still maintain high quality at a high noise ratio. Compared to the correction mechanism focusing on noise samples, memory bank-based estimation is more robust, which makes the model performance stable on noisy datasets. Extensive experiments demonstrate that our method significantly improves matching accuracy and performance stability at increasing noise ratio. Our approach also surpasses the state-of-the-art methods by a large margin. The code is available at: https://github.com/ZhangXu0963/NPC.

NeurIPS Conference 2024 Conference Paper

Parameter-Inverted Image Pyramid Networks

  • Xizhou Zhu
  • Xue Yang
  • Zhaokai Wang
  • Hao Li
  • Wenhan Dou
  • Junqi Ge
  • Lewei Lu
  • Yu Qiao

Image pyramids are commonly used in modern computer vision tasks to obtain multi-scale features for precise understanding of images. However, image pyramids process multiple resolutions of images using the same large-scale model, which requires significant computational cost. To overcome this issue, we propose a novel network architecture known as the Parameter-Inverted Image Pyramid Networks (PIIP). Our core idea is to use models with different parameter sizes to process different resolution levels of the image pyramid, thereby balancing computational efficiency and performance. Specifically, the input to PIIP is a set of multi-scale images, where higher resolution images are processed by smaller networks. We further propose a feature interaction mechanism to allow features of different resolutions to complement each other and effectively integrate information from different spatial scales. Extensive experiments demonstrate that the PIIP achieves superior performance in tasks such as object detection, segmentation, and image classification, compared to traditional image pyramid methods and single-branch networks, while reducing computational cost. Notably, when applying our method on a large-scale vision foundation model InternViT-6B, we improve its performance by 1\%-2\% on detection and segmentation with only 40\%-60\% of the original computation. These results validate the effectiveness of the PIIP approach and provide a new technical direction for future vision computing tasks.

ICML Conference 2024 Conference Paper

RoboMP2: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models

  • Qi Lv 0001
  • Hao Li
  • Xiang Deng 0002
  • Rui Shao 0001
  • Michael Yu Wang
  • Liqiang Nie

Multimodal Large Language Models (MLLMs) have shown impressive reasoning abilities and general intelligence in various domains. It inspires researchers to train end-to-end MLLMs or utilize large models to generate policies with human-selected prompts for embodied agents. However, these methods exhibit limited generalization capabilities on unseen tasks or scenarios, and overlook the multimodal environment information which is critical for robots to make decisions. In this paper, we introduce a novel Robo tic M ultimodal P erception- P lanning ( RoboMP$^2$ ) framework for robotic manipulation which consists of a Goal-Conditioned Multimodal Preceptor (GCMP) and a Retrieval-Augmented Multimodal Planner (RAMP). Specially, GCMP captures environment states by employing a tailored MLLMs for embodied agents with the abilities of semantic reasoning and localization. RAMP utilizes coarse-to-fine retrieval method to find the $k$ most-relevant policies as in-context demonstrations to enhance the planner. Extensive experiments demonstrate the superiority of RoboMP$^2$ on both VIMA benchmark and real-world tasks, with around 10% improvement over the baselines.

AAAI Conference 2024 Conference Paper

Robustly Train Normalizing Flows via KL Divergence Regularization

  • Kun Song
  • Ruben Solozabal
  • Hao Li
  • Martin Takáč
  • Lu Ren
  • Fakhri Karray

In this paper, we find that the training of Normalizing Flows (NFs) are easily affected by the outliers and a small number (or high dimensionality) of training samples. To solve this problem, we propose a Kullback–Leibler (KL) divergence regularization on the Jacobian matrix of NFs. We prove that such regularization is equivalent to adding a set of samples whose covariance matrix is the identity matrix to the training set. Thus, it reduces the negative influence of the outliers and the small sample number on the estimation of the covariance matrix, simultaneously. Therefore, our regularization makes the training of NFs robust. Ultimately, we evaluate the performance of NFs on out-of-distribution (OoD) detection tasks. The excellent results obtained demonstrate the effectiveness of the proposed regularization term. For example, with the help of the proposed regularization, the OoD detection score increases at most 30% compared with the one without the regularization.

AAAI Conference 2024 Conference Paper

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

  • Qianrui Zhou
  • Hua Xu
  • Hao Li
  • Hanlei Zhang
  • Xiaohan Zhang
  • Yifan Wang
  • Kai Gao

Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the modality-aware prompt and ground truth labels, the proposed token-level contrastive learning framework (TCL) constructs augmented samples and employs NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal textual semantic insights derived from intent labels to guide the learning processes of other modalities in return. Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning. The codes are released at https://github.com/thuiar/TCL-MAP.

AAAI Conference 2023 Conference Paper

Hybrid CNN-Transformer Feature Fusion for Single Image Deraining

  • Xiang Chen
  • Jinshan Pan
  • Jiyang Lu
  • Zhentao Fan
  • Hao Li

Since rain streaks exhibit diverse geometric appearances and irregular overlapped phenomena, these complex characteristics challenge the design of an effective single image deraining model. To this end, rich local-global information representations are increasingly indispensable for better satisfying rain removal. In this paper, we propose a lightweight Hybrid CNN-Transformer Feature Fusion Network (dubbed as HCT-FFN) in a stage-by-stage progressive manner, which can harmonize these two architectures to help image restoration by leveraging their individual learning strengths. Specifically, we stack a sequence of the degradation-aware mixture of experts (DaMoE) modules in the CNN-based stage, where appropriate local experts adaptively enable the model to emphasize spatially-varying rain distribution features. As for the Transformer-based stage, a background-aware vision Transformer (BaViT) module is employed to complement spatially-long feature dependencies of images, so as to achieve global texture recovery while preserving the required structure. Considering the indeterminate knowledge discrepancy among CNN features and Transformer features, we introduce an interactive fusion branch at adjacent stages to further facilitate the reconstruction of high-quality deraining results. Extensive evaluations show the effectiveness and extensibility of our developed HCT-FFN. The source code is available at https://github.com/cschenxiang/HCT-FFN.

JAAMAS Journal 2023 Journal Article

IOB: integrating optimization transfer and behavior transfer for multi-policy reuse

  • Siyuan Li
  • Hao Li
  • Chongjie Zhang

Abstract Humans have the ability to reuse previously learned policies to solve new tasks quickly, and reinforcement learning (RL) agents can do the same by transferring knowledge from source policies to a related target task. Transfer RL methods can reshape the policy optimization objective (optimization transfer) or influence the behavior policy (behavior transfer) using source policies. However, selecting the appropriate source policy with limited samples to guide target policy learning has been a challenge. Previous methods introduce additional components, such as hierarchical policies or estimations of source policies’ value functions, which can lead to non-stationary policy optimization or heavy sampling costs, diminishing transfer effectiveness. To address this challenge, we propose a novel transfer RL method that selects the source policy without training extra components. Our method utilizes the Q function in the actor-critic framework to guide policy selection, choosing the source policy with the largest one-step improvement over the current target policy. We integrate optimization transfer and behavior transfer (IOB) by regularizing the learned policy to mimic the guidance policy and combining them as the behavior policy. This integration significantly enhances transfer effectiveness, surpasses state-of-the-art transfer RL baselines in benchmark tasks, and improves final performance and knowledge transferability in continual learning scenarios. Additionally, we show that our optimization transfer technique is guaranteed to improve target policy learning.

NeurIPS Conference 2023 Conference Paper

JourneyDB: A Benchmark for Generative Image Understanding

  • Keqiang Sun
  • Junting Pan
  • Yuying Ge
  • Hao Li
  • Haodong Duan
  • Xiaoshi Wu
  • Renrui Zhang
  • Aojun Zhou

While recent advancements in vision-language models have had a transformative impact on multi-modal comprehension, the extent to which these models possess the ability to comprehend generated images remains uncertain. Synthetic images, in comparison to real data, encompass a higher level of diversity in terms of both content and style, thereby presenting significant challenges for the models to fully grasp. In light of this challenge, we introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images within the context of multi-modal visual understanding. Our meticulously curated dataset comprises 4 million distinct and high-quality generated images, each paired with the corresponding text prompts that were employed in their creation. Furthermore, we additionally introduce an external subset with results of another 22 text-to-image generative models, which makes JourneyDB a comprehensive benchmark for evaluating the comprehension of generated images. On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension in relation to both content and style interpretation. These benchmarks encompass prompt inversion, style retrieval, image captioning, and visual question answering. Lastly, we evaluate the performance of state-of-the-art multi-modal models when applied to the JourneyDB dataset, providing a comprehensive analysis of their strengths and limitations in comprehending generated content. We anticipate that the proposed dataset and benchmarks will facilitate further research in the field of generative content understanding. The dataset is publicly available at https: //journeydb. github. io.

AAAI Conference 2023 Conference Paper

Point-Teaching: Weakly Semi-supervised Object Detection with Point Annotations

  • Yongtao Ge
  • Qiang Zhou
  • Xinlong Wang
  • Chunhua Shen
  • Zhibin Wang
  • Hao Li

Point annotations are considerably more time-efficient than bounding box annotations. However, how to use cheap point annotations to boost the performance of semi-supervised object detection is still an open question. In this work, we present Point-Teaching, a weakly- and semi-supervised object detection framework to fully utilize the point annotations. Specifically, we propose a Hungarian-based point-matching method to generate pseudo labels for point-annotated images. We further propose multiple instance learning (MIL) approaches at the level of images and points to supervise the object detector with point annotations. Finally, we propose a simple data augmentation, named Point-Guided Copy-Paste, to reduce the impact of those unmatched points. Experiments demonstrate the effectiveness of our method on a few datasets and various data regimes. In particular, Point-Teaching outperforms the previous best method Group R-CNN by 3.1 AP with 5% fully labeled data and 2.3 AP with 30% fully labeled data on the MS COCO dataset. We believe that our proposed framework can largely lower the bar of learning accurate object detectors and pave the way for its broader applications. The code is available at https://github.com/YongtaoGe/Point-Teaching.

NeurIPS Conference 2023 Conference Paper

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

  • Hao Li
  • Jingkuan Song
  • Lianli Gao
  • Xiaosu Zhu
  • Hengtao Shen

Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e. g. , corrupt images, fast-paced videos, and non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity. Concretely, we first construct a set of various learnable prototypes for each modality to represent the entire semantics subspace. Then Dempster-Shafer Theory and Subjective Logic Theory are utilized to build an evidential theoretical framework by associating evidence with Dirichlet Distribution parameters. The PAU model induces accurate uncertainty and reliable predictions for cross-modal retrieval. Extensive experiments are performed on four major benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the effectiveness of our method. The code is accessible at https: //github. com/leolee99/PAU.

IJCAI Conference 2023 Conference Paper

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

  • Peng Jin
  • Hao Li
  • Zesen Cheng
  • Jinfa Huang
  • Zhennan Wang
  • Li Yuan
  • Chang Liu
  • Jie Chen

Text-video retrieval is a challenging cross-modal task, which aims to align visual entities with natural language descriptions. Current methods either fail to leverage the local details or are computationally expensive. What's worse, they fail to leverage the heterogeneous concepts in data. In this paper, we propose the Disentangled Conceptualization and Set-to-set Alignment (DiCoSA) to simulate the conceptualizing and reasoning process of human beings. For disentangled conceptualization, we divide the coarse feature into multiple latent factors related to semantic concepts. For set-to-set alignment, where a set of visual concepts correspond to a set of textual concepts, we propose an adaptive pooling method to aggregate semantic concepts to address the partial matching. In particular, since we encode concepts independently in only a few dimensions, DiCoSA is superior at efficiency and granularity, ensuring fine-grained interactions using a similar computational complexity as coarse-grained alignment. Extensive experiments on five datasets, including MSR-VTT, LSMDC, MSVD, ActivityNet, and DiDeMo, demonstrate that our method outperforms the existing state-of-the-art methods.

IJCAI Conference 2023 Conference Paper

TG-VQA: Ternary Game of Video Question Answering

  • Hao Li
  • Peng Jin
  • Zesen Cheng
  • Songyang Zhang
  • Kai Chen
  • Zhennan Wang
  • Chang Liu
  • Jie Chen

Video question answering aims at answering a question about the video content by reasoning the alignment semantics within them. However, since relying heavily on human instructions, i. e. , annotations or priors, current contrastive learning-based VideoQA methods remains challenging to perform fine-grained visual-linguistic alignments. In this work, we innovatively resort to game theory, which can simulate complicated relationships among multiple players with specific interaction strategies, e. g. , video, question, and answer as ternary players, to achieve fine-grained alignment for VideoQA task. Specifically, we carefully design a VideoQA-specific interaction strategy to tailor the characteristics of VideoQA, which can mathematically generate the fine-grained visual-linguistic alignment label without label-intensive efforts. Our TG-VQA outperforms existing state-of-the-art by a large margin (more than 5%) on long-term and short-term VideoQA datasets, verifying its effectiveness and generalization ability. Thanks to the guidance of game-theoretic interaction, our model impressively convergences well on limited data (10^4 videos), surpassing most of those pre-trained on large-scale data (10^7 videos).

IJCAI Conference 2023 Conference Paper

WiCo: Win-win Cooperation of Bottom-up and Top-down Referring Image Segmentation

  • Zesen Cheng
  • Peng Jin
  • Hao Li
  • Kehan Li
  • Siheng Li
  • Xiangyang Ji
  • Chang Liu
  • Jie Chen

The top-down and bottom-up methods are two mainstreams of referring segmentation, while both methods have their own intrinsic weaknesses. Top-down methods are chiefly disturbed by Polar Negative (PN) errors owing to the lack of fine-grained cross-modal alignment. Bottom-up methods are mainly perturbed by Inferior Positive (IP) errors due to the lack of prior object information. Nevertheless, we discover that two types of methods are highly complementary for restraining respective weaknesses but the direct average combination leads to harmful interference. In this context, we build Win-win Cooperation (WiCo) to exploit complementary nature of two types of methods on both interaction and integration aspects for achieving a win-win improvement. For the interaction aspect, Complementary Feature Interaction (CFI) introduces prior object information to bottom-up branch and provides fine-grained information to top-down branch for complementary feature enhancement. For the integration aspect, Gaussian Scoring Integration (GSI) models the gaussian performance distributions of two branches and weighted integrates results by sampling confident scores from the distributions. With our WiCo, several prominent bottom-up and top-down combinations achieve remarkable improvements on three common datasets with reasonable extra costs, which justifies effectiveness and generality of our method.

NeurIPS Conference 2022 Conference Paper

A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval

  • Hao Li
  • Jingkuan Song
  • Lianli Gao
  • Pengpeng Zeng
  • Haonan Zhang
  • Gongfu Li

Cross-modal retrieval aims to build correspondence between multiple modalities by learning a common representation space. Typically, an image can match multiple texts semantically and vice versa, which significantly increases the difficulty of this task. To address this problem, probabilistic embedding is proposed to quantify these many-to-many relationships. However, existing datasets (e. g. , MS-COCO) and metrics (e. g. , Recall@K) cannot fully represent these diversity correspondences due to non-exhaustive annotations. Based on this observation, we utilize semantic correlation computed by CIDEr to find the potential correspondences. Then we present an effective metric, named Average Semantic Precision (ASP), which can measure the ranking precision of semantic correlation for retrieval sets. Additionally, we introduce a novel and concise objective, coined Differentiable ASP Approximation (DAA). Concretely, DAA can optimize ASP directly by making the ranking function of ASP differentiable through a sigmoid function. To verify the effectiveness of our approach, extensive experiments are conducted on MS-COCO, CUB Captions, and Flickr30K, which are commonly used in cross-modal retrieval. The results show that our approach obtains superior performance over the state-of-the-art approaches on all metrics. The code and trained models are released at https: //github. com/leolee99/2022-NeurIPS-DAA.

NeurIPS Conference 2022 Conference Paper

Entropy-Driven Mixed-Precision Quantization for Deep Network Design

  • Zhenhong Sun
  • Ce Ge
  • Junyan Wang
  • Ming Lin
  • Hesen Chen
  • Hao Li
  • Xiuyu Sun

Deploying deep convolutional neural networks on Internet-of-Things (IoT) devices is challenging due to the limited computational resources, such as limited SRAM memory and Flash storage. Previous works re-design a small network for IoT devices, and then compress the network size by mixed-precision quantization. This two-stage procedure cannot optimize the architecture and the corresponding quantization jointly, leading to sub-optimal tiny deep models. In this work, we propose a one-stage solution that optimizes both jointly and automatically. The key idea of our approach is to cast the joint architecture design and quantization as an Entropy Maximization process. Particularly, our algorithm automatically designs a tiny deep model such that: 1) Its representation capacity measured by entropy is maximized under the given computational budget; 2) Each layer is assigned with a proper quantization precision; 3) The overall design loop can be done on CPU, and no GPU is required. More impressively, our method can directly search high-expressiveness architecture for IoT devices within less than half a CPU hour. Extensive experiments on three widely adopted benchmarks, ImageNet, VWW and WIDER FACE, demonstrate that our method can achieve the state-of-the-art performance in the tiny deep model regime. Code and pre-trained models are available at https: //github. com/alibaba/lightweight-neural-architecture-search.

IJCAI Conference 2022 Conference Paper

ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning

  • Jingyu Li
  • Zhendong Mao
  • Shancheng Fang
  • Hao Li

Image captioning (IC), bringing vision to language, has drawn extensive attention. Precisely describing visual relations between image objects is a key challenge in IC. We argue that the visual relations, that is geometric positions (i. e. , distance and size) and semantic interactions (i. e. , actions and possessives), indicate the mutual correlations between objects. Existing Transformer-based methods typically resort to geometric positions to enhance the representation of visual relations, yet only using the shallow geometric is unable to precisely cover the complex and actional correlations. In this paper, we propose to enhance the correlations between objects from a comprehensive view that jointly considers explicit semantic and geometric relations, generating plausible captions with accurate relationship predictions. Specifically, we propose a novel Enhanced-Adaptive Relation Self-Attention Network (ER-SAN). We design the direction-sensitive semantic-enhanced attention, which considers content objects to semantic relations and semantic relations to content objects attention to learn explicit semantic-aware relations. Further, we devise an adaptive re-weight relation module that determines how much semantic and geometric attention should be activated to each relation feature. Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our ER-SAN, with improvements of CIDEr from 128. 6% to 135. 3%, achieving state-of-the-art performance. Codes will be released \url{https: //github. com/CrossmodalGroup/ER-SAN}.

NeurIPS Conference 2022 Conference Paper

Improved Fine-Tuning by Better Leveraging Pre-Training Data

  • Ziquan Liu
  • Yi Xu
  • Yuanhong Xu
  • Qi Qian
  • Hao Li
  • Xiangyang Ji
  • Antoni Chan
  • Rong Jin

As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many deep learning applications, especially for small data sets. However, recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy once the number of training samples is increased in some vision tasks. In this work, we revisit this phenomenon from the perspective of generalization analysis by using excess risk bound which is popular in learning theory. The result reveals that the excess risk bound may have a weak dependency on the pre-trained model. The observation inspires us to leverage pre-training data for fine-tuning, since this data is also available for fine-tuning. The generalization result of using pre-training data shows that the excess risk bound on a target task can be improved when the appropriate pre-training data is included in fine-tuning. With the theoretical motivation, we propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task. Extensive experimental results for image classification tasks on 8 benchmark data sets verify the effectiveness of the proposed data selection based fine-tuning pipeline. Our code is available at https: //github. com/ziquanliu/NeurIPS2022 UOT fine_tuning.

IROS Conference 2022 Conference Paper

LiDAR-Aided Visual-Inertial Localization with Semantic Maps

  • Hao Li
  • Liangliang Pan
  • Ji Zhao 0001

Accurate and robust localization is an essential task for autonomous driving systems. In this paper, we propose a novel 3D LiDAR-aided visual-inertial localization method. Our method fully explores the complementarity of visual and LiDAR observations. On the one hand, the association between semantic features in images and a given semantic map provides constraints for the absolute pose. On the other hand, LiDAR odometry (LO) can provide an accurate and robust 6DOF relative pose. The Error State Kalman Filter (ESKF) framework is exploited to estimate the vehicle pose relative to the semantic map, which fuses the global constraints between the image and the semantic map, the relative pose from the LO, and the raw IMU data. The method achieves centimeter-level localization accuracy in a variety of challenging scenarios. We validate the robustness and accuracy of our method in real-world scenes over 50 km. The experimental results show that the proposed method is able to achieve an average lateral accuracy of 0. 059 m and longitudinal accuracy of 0. 158 m, which demonstrates the practicality of the proposed system in autonomous driving applications.

JBHI Journal 2022 Journal Article

MDADP: A Webserver Integrating Database and Prediction Tools for Microbe-Disease Associations

  • Lei Wang
  • Hao Li
  • Yuqi Wang
  • Yihong Tan
  • Zhiping Chen
  • Tingrui Pei
  • Quan Zou

More and more evidence has demonstrated that microbiota play important roles in the life processes of the human body. In recent years, various computational methods have been proposed for identifying potentially disease-associated microbes to save costs in traditional biological experiments. However, prediction performances of these methods are generally limited by outdated and incomplete datasets. And moreover, until now, there are limited studies that can provide visual predictive tools for inferring possible microbe-disease associations (MDAs) as well. Hence, in this manuscript, a novel webserver called MDADP will be proposed to identify latent MDAs, in which, a new MDA database together with interactive prediction tools for MDAs studies will be designed simultaneously. Especially, in the newly constructed MDA database, 2019 known MDAs between 58 diseases and 703 microbes have been manually collected first. And then, through adopting the average ranking method and the co-confidence method respectively, eight representative computational models have been integrated together to identify potential disease-related microbes. As a result, MDADP can provide not only interactive features for users to access and capture MDAs entities, but alsoeffective tools for users to identify candidate microbes for different diseases. To our knowledge, MDADP is the first online platform that incorporates a new MDA database with comprehensive MDA prediction tools. Therefore, we believe that it will be a valuable source of information for researches in microbiology and disease-related fields. MDADP can be accessed at http://mdadp.leelab2997.cn.

AAAI Conference 2022 Conference Paper

PMAL: Open Set Recognition via Robust Prototype Mining

  • Jing Lu
  • Yunlu Xu
  • Hao Li
  • Zhanzhan Cheng
  • Yi Niu

Open Set Recognition (OSR) has been an emerging topic. Besides recognizing predefined classes, the system needs to reject the unknowns. Prototype learning is a potential manner to handle the problem, as its ability to improve intra-class compactness of representations is much needed in discrimination between the known and the unknowns. In this work, we propose a novel Prototype Mining And Learning (PMAL) framework. It has a prototype mining mechanism before the phase of optimizing embedding space, explicitly considering two crucial properties, namely high-quality and diversity of the prototype set. Concretely, a set of high-quality candidates are firstly extracted from training samples based on data uncertainty learning, avoiding the interference from unexpected noise. Considering the multifarious appearance of objects even in a single category, a diversity-based strategy for prototype set filtering is proposed. Accordingly, the embedding space can be better optimized to discriminate therein the predefined classes and between known and unknowns. Extensive experiments verify the two good characteristics (i. e. , high-quality and diversity) embraced in prototype mining, and show the remarkable performance of the proposed framework compared to state-of-the-arts.

AAAI Conference 2022 Conference Paper

Scaled ReLU Matters for Training Vision Transformers

  • Pichao Wang
  • Xue Wang
  • Hao Luo
  • Jingkai Zhou
  • Zhipeng Zhou
  • Fan Wang
  • Hao Li
  • Rong Jin

Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, optimizer and warmup epoch. The reasons for training difficulty are empirically analysed in the paper Early Convolutions Help Transformers See Better, and the authors conjecture that the issue lies with the patchify-stem of ViT models. In this paper, we further investigate this problem and extend the above conclusion: only early convolutions do not help for stable training, but the scaled ReLU operation in the convolutional stem (conv-stem) matters. We verify, both theoretically and empirically, that scaled ReLU in conv-stem not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. In addition, extensive experiments are conducted to demonstrate that previous ViTs are far from being well trained, further showing that ViTs have great potential to be a better substitute of CNNs.

AAAI Conference 2022 Conference Paper

TransZero: Attribute-Guided Transformer for Zero-Shot Learning

  • Shiming Chen
  • Ziming Hong
  • Yang Liu
  • Guo-Sen Xie
  • Baigui Sun
  • Hao Li
  • Qinmu Peng
  • Ke Lu

Zero-shot learning (ZSL) aims to recognize novel classes by transferring semantic knowledge from seen classes to unseen ones. Semantic knowledge is learned from attribute descriptions shared between different classes, which act as strong priors for localizing object attributes that represent discriminative region features, enabling significant visual-semantic interaction. Although some attention-based models have attempted to learn such region features in a single image, the transferability and discriminative attribute localization of visual features are typically neglected. In this paper, we propose an attribute-guided Transformer network, termed TransZero, to refine visual features and learn attribute localization for discriminative visual embedding representations in ZSL. Specifically, TransZero takes a feature augmentation encoder to alleviate the cross-dataset bias between ImageNet and ZSL benchmarks, and improves the transferability of visual features by reducing the entangled relative geometry relationships among region features. To learn locality-augmented visual features, TransZero employs a visual-semantic decoder to localize the image regions most relevant to each attribute in a given image, under the guidance of semantic attribute information. Then, the locality-augmented visual features and semantic vectors are used to conduct effective visual-semantic interaction in a visual-semantic embedding network. Extensive experiments show that TransZero achieves the new state of the art on three ZSL benchmarks. The codes are available at: https: //github. com/shiming-chen/TransZero.

NeurIPS Conference 2022 Conference Paper

VTC-LFC: Vision Transformer Compression with Low-Frequency Components

  • Zhenyu Wang
  • Hao Luo
  • Pichao Wang
  • Feng Ding
  • Fan Wang
  • Hao Li

Although Vision transformers (ViTs) have recently dominated many vision tasks, deploying ViT models on resource-limited devices remains a challenging problem. To address such a challenge, several methods have been proposed to compress ViTs. Most of them borrow experience in convolutional neural networks (CNNs) and mainly focus on the spatial domain. However, the compression only in the spatial domain suffers from a dramatic performance drop without fine-tuning and is not robust to noise, as the noise in the spatial domain can easily confuse the pruning criteria, leading to some parameters/channels being pruned incorrectly. Inspired by recent findings that self-attention is a low-pass filter and low-frequency signals/components are more informative to ViTs, this paper proposes compressing ViTs with low-frequency components. Two metrics named low-frequency sensitivity (LFS) and low-frequency energy (LFE) are proposed for better channel pruning and token pruning. Additionally, a bottom-up cascade pruning scheme is applied to compress different dimensions jointly. Extensive experiments demonstrate that the proposed method could save 40% ~ 60% of the FLOPs in ViTs, thus significantly increasing the throughput on practical devices with less than 1% performance drop on ImageNet-1K.

IROS Conference 2021 Conference Paper

DT-Loc: Monocular Visual Localization on HD Vector Map Using Distance Transforms of 2D Semantic Detections

  • Chi Zhang 0069
  • Hao Liu 0007
  • Hao Li
  • Kun Guo
  • Kuiyuan Yang
  • Rui Cai 0002
  • Zhiwei Li 0006

Localizing a vehicle on a prebuilt HD vector map is a prerequisite for many autonomous driving applications. Existing visual localization approaches usually require a separate local feature layer to function. The separate localization layer suffers from the robustness issue inherited from the local features. Also, it could be difficult to create a feature layer that aligns perfectly with an existing vector map. In this paper, we propose a monocular visual localization method that exploits the vector map directly as the localization layer. The method detects semantic traffic elements from the images and matches them with the vectors in the map. To deal with the harmful problem of false matches, we propose to align the vector map to the distance transforms of the semantic detections, which enables a non-explicit and differentiable data association process. The system is able to achieve centimeter and sub-meter accuracies in lateral and longitudinal directions, respectively.

NeurIPS Conference 2021 Conference Paper

HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning

  • Shiming Chen
  • Guosen Xie
  • Yang Liu
  • Qinmu Peng
  • Baigui Sun
  • Hao Li
  • Xinge You
  • Ling Shao

Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones. Typically, to guarantee desirable knowledge transfer, a common (latent) space is adopted for associating the visual and semantic domains in ZSL. However, existing common space learning methods align the semantic and visual domains by merely mitigating distribution disagreement through one-step adaptation. This strategy is usually ineffective due to the heterogeneous nature of the feature representations in the two domains, which intrinsically contain both distribution and structure variations. To address this and advance ZSL, we propose a novel hierarchical semantic-visual adaptation (HSVA) framework. Specifically, HSVA aligns the semantic and visual domains by adopting a hierarchical two-step adaptation, i. e. , structure adaptation and distribution adaptation. In the structure adaptation step, we take two task-specific encoders to encode the source data (visual domain) and the target data (semantic domain) into a structure-aligned common space. To this end, a supervised adversarial discrepancy (SAD) module is proposed to adversarially minimize the discrepancy between the predictions of two task-specific classifiers, thus making the visual and semantic feature manifolds more closely aligned. In the distribution adaptation step, we directly minimize the Wasserstein distance between the latent multivariate Gaussian distributions to align the visual and semantic distributions using a common encoder. Finally, the structure and distribution adaptation are derived in a unified framework under two partially-aligned variational autoencoders. Extensive experiments on four benchmark datasets demonstrate that HSVA achieves superior performance on both conventional and generalized ZSL. The code is available at \url{https: //github. com/shiming-chen/HSVA}.

NeurIPS Conference 2020 Conference Paper

Fully Convolutional Mesh Autoencoder using Efficient Spatially Varying Kernels

  • Yi Zhou
  • Chenglei Wu
  • Zimo Li
  • Chen Cao
  • Yuting Ye
  • Jason Saragih
  • Hao Li
  • Yaser Sheikh

Learning latent representations of registered meshes is useful for many 3D tasks. Techniques have recently shifted to neural mesh autoencoders. Although they demonstrate higher precision than traditional methods, they remain unable to capture fine-grained deformations. Furthermore, these methods can only be applied to a template-specific surface mesh, and is not applicable to more general meshes, like tetrahedrons and non-manifold meshes. While more general graph convolution methods can be employed, they lack performance in reconstruction precision and require higher memory usage. In this paper, we propose a non-template-specific fully convolutional mesh autoencoder for arbitrary registered mesh data. It is enabled by our novel convolution and (un)pooling operators learned with globally shared weights and locally varying coefficients which can efficiently capture the spatially varying contents presented by irregular mesh connections. Our model outperforms state-of-the-art methods on reconstruction accuracy. In addition, the latent codes of our network are fully localized thanks to the fully convolutional structure, and thus have much higher interpolation capability than many traditional 3D mesh generation models.

ICLR Conference 2020 Conference Paper

Rethinking the Hyperparameters for Fine-tuning

  • Hao Li
  • Pratik Chaudhari
  • Hao Yang 0043
  • Michael Lam
  • Avinash Ravichandran
  • Rahul Bhotika
  • Stefano Soatto

Fine-tuning from pre-trained ImageNet models has become the de-facto standard for various computer vision tasks. Current practices for fine-tuning typically involve selecting an ad-hoc choice of hyperparameters and keeping them fixed to values normally used for training from scratch. This paper re-examines several common practices of setting hyperparameters for fine-tuning. Our findings are based on extensive empirical evaluation for fine-tuning on various transfer learning benchmarks. (1) While prior works have thoroughly investigated learning rate and batch size, momentum for fine-tuning is a relatively unexplored parameter. We find that the value of momentum also affects fine-tuning performance and connect it with previous theoretical findings. (2) Optimal hyperparameters for fine-tuning, in particular, the effective learning rate, are not only dataset dependent but also sensitive to the similarity between the source domain and target domain. This is in contrast to hyperparameters for training from scratch. (3) Reference-based regularization that keeps models close to the initial model does not necessarily apply for "dissimilar" datasets. Our findings challenge common practices of fine-tuning and encourages deep learning practitioners to rethink the hyperparameters for fine-tuning.

NeurIPS Conference 2019 Conference Paper

Learning to Infer Implicit Surfaces without 3D Supervision

  • Shichen Liu
  • Shunsuke Saito
  • Weikai Chen
  • Hao Li

Recent advances in 3D deep learning have shown that it is possible to train highly effective deep models for 3D shape generation, directly from 2D images. This is particularly interesting since the availability of 3D models is still limited compared to the massive amount of accessible 2D images, which is invaluable for training. The representation of 3D surfaces itself is a key factor for the quality and resolution of the 3D output. While explicit representations, such as point clouds and voxels, can span a wide range of shape variations, their resolutions are often limited. Mesh-based representations are more efficient but are limited by their ability to handle varying topologies. Implicit surfaces, however, can robustly handle complex shapes, topologies, and also provide flexible resolution control. We address the fundamental problem of learning implicit surfaces for shape inference without the need of 3D supervision. Despite their advantages, it remains nontrivial to (1) formulate a differentiable connection between implicit surfaces and their 2D renderings, which is needed for image-based supervision; and (2) ensure precise geometric properties and control, such as local smoothness. In particular, sampling implicit surfaces densely is also known to be a computationally demanding and very slow operation. To this end, we propose a novel ray-based field probing technique for efficient image-to-field supervision, as well as a general geometric regularizer for implicit surfaces, which provides natural shape priors in unconstrained regions. We demonstrate the effectiveness of our framework on the task of single-view image-based 3D shape digitization and show how we outperform state-of-the-art techniques both quantitatively and qualitatively.

AAAI Conference 2019 Conference Paper

Robust Optimization over Multiple Domains

  • Qi Qian
  • Shenghuo Zhu
  • Jiasheng Tang
  • Rong Jin
  • Baigui Sun
  • Hao Li

In this work, we study the problem of learning a single model for multiple domains. Unlike the conventional machine learning scenario where each domain can have the corresponding model, multiple domains (i. e. , applications/users) may share the same machine learning model due to maintenance loads in cloud computing services. For example, a digit-recognition model should be applicable to hand-written digits, house numbers, car plates, etc. Therefore, an ideal model for cloud computing has to perform well at each applicable domain. To address this new challenge from cloud computing, we develop a framework of robust optimization over multiple domains. In lieu of minimizing the empirical risk, we aim to learn a model optimized to the adversarial distribution over multiple domains. Hence, we propose to learn the model and the adversarial distribution simultaneously with the stochastic algorithm for efficiency. Theoretically, we analyze the convergence rate for convex and non-convex models. To our best knowledge, we first study the convergence rate of learning a robust non-convex model with a practical algorithm. Furthermore, we demonstrate that the robustness of the framework and the convergence rate can be further enhanced by appropriate regularizers over the adversarial distribution. The empirical study on real-world fine-grained visual categorization and digits recognition tasks verifies the effectiveness and efficiency of the proposed framework.

AAAI Conference 2018 Conference Paper

Extremely Low Bit Neural Network: Squeeze the Last Bit Out With ADMM

  • Cong Leng
  • Zesheng Dou
  • Hao Li
  • Shenghuo Zhu
  • Rong Jin

Although deep learning models are highly effective for various learning tasks, their high computational costs prohibit the deployment to scenarios where either memory or computational resources are limited. In this paper, we focus on compressing and accelerating deep models with network weights represented by very small numbers of bits, referred to as extremely low bit neural network. We model this problem as a discretely constrained optimization problem. Borrowing the idea from Alternating Direction Method of Multipliers (ADMM), we decouple the continuous parameters from the discrete constraints of network, and cast the original hard problem into several subproblems. We propose to solve these subproblems using extragradient and iterative quantization algorithms that lead to considerably faster convergency compared to conventional optimization methods. Extensive experiments on image recognition and object detection verify that the proposed algorithm is more effective than state-ofthe-art approaches when coming to extremely low bit neural network.

NeurIPS Conference 2018 Conference Paper

MacNet: Transferring Knowledge from Machine Comprehension to Sequence-to-Sequence Models

  • Boyuan Pan
  • Yazheng Yang
  • Hao Li
  • Zhou Zhao
  • Yueting Zhuang
  • Deng Cai
  • Xiaofei He

Machine Comprehension (MC) is one of the core problems in natural language processing, requiring both understanding of the natural language and knowledge about the world. Rapid progress has been made since the release of several benchmark datasets, and recently the state-of-the-art models even surpass human performance on the well-known SQuAD evaluation. In this paper, we transfer knowledge learned from machine comprehension to the sequence-to-sequence tasks to deepen the understanding of the text. We propose MacNet: a novel encoder-decoder supplementary architecture to the widely used attention-based sequence-to-sequence models. Experiments on neural machine translation (NMT) and abstractive text summarization show that our proposed framework can significantly improve the performance of the baseline models, and our method for the abstractive text summarization achieves the state-of-the-art results on the Gigaword dataset.

ICRA Conference 2018 Conference Paper

Robust and Precise Vehicle Localization Based on Multi-Sensor Fusion in Diverse City Scenes

  • Guowei Wan
  • Xiaolong Yang
  • Renlan Cai
  • Hao Li
  • Yao Zhou
  • Hao Wang
  • Shiyu Song

We present a robust and precise localization system that achieves centimeter-level localization accuracy in disparate city scenes. Our system adaptively uses information from complementary sensors such as GNSS, LiDAR, and IMU to achieve high localization accuracy and resilience in challenging scenes, such as urban downtown, highways, and tunnels. Rather than relying only on LiDAR intensity or 3D geometry, we make innovative use of LiDAR intensity and altitude cues to significantly improve localization system accuracy and robustness. Our GNSS RTK module utilizes the help of the multi-sensor fusion framework and achieves a better ambiguity resolution success rate. An error-state Kalman filter is applied to fuse the localization measurements from different sources with novel uncertainty estimation. We validate, in detail, the effectiveness of our approaches, achieving 5-10cm RMS accuracy and outperforming previous state-of-the-art systems. Importantly, our system, while deployed in a large autonomous driving fleet, made our vehicles fully autonomous in crowded city streets despite road construction that occurred from time to time. A dataset including more than 60 km real traffic driving in various urban roads is used to comprehensively test our system.

NeurIPS Conference 2018 Conference Paper

Visualizing the Loss Landscape of Neural Nets

  • Hao Li
  • Zheng Xu
  • Gavin Taylor
  • Christoph Studer
  • Tom Goldstein

Neural network training relies on our ability to find "good" minimizers of highly non-convex loss functions. It is well known that certain network architecture designs (e. g. , skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimizers that generalize better. However, the reasons for these differences, and their effect on the underlying loss landscape, is not well understood. In this paper, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. First, we introduce a simple "filter normalization" method that helps us visualize loss function curvature, and make meaningful side-by-side comparisons between loss functions. Then, using a variety of visualizations, we explore how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers.

AAAI Conference 2017 Conference Paper

Learning Latent Sentiment Scopes for Entity-Level Sentiment Analysis

  • Hao Li
  • Wei Lu

In this paper, we focus on the task of extracting named entities together with their associated sentiment information in a joint manner. Our key observation in such an entity-level sentiment analysis (a. k. a. targeted sentiment analysis) task is that there exists a sentiment scope within which each named entity is embedded, which largely decides the sentiment information associated with the entity. However, such sentiment scopes are typically not explicitly annotated in the data, and their lengths can be unbounded. Motivated by this, unlike traditional approaches that cast this problem as a simple sequence labeling task, we propose a novel approach that can explicitly model the latent sentiment scopes. Our experiments on the standard datasets demonstrate that our approach is able to achieve better results compared to existing approaches based on conventional conditional random fields (CRFs) and a more recent work based on neural networks.

IJCAI Conference 2017 Conference Paper

Self-paced Convolutional Neural Networks

  • Hao Li
  • Maoguo Gong

Convolutional neural networks (CNNs) have achieved breakthrough performance in many pattern recognition tasks. In order to distinguish the reliable data from the noisy and confusing data, we improve CNNs with self-paced learning (SPL) for enhancing the learning robustness of CNNs. In the proposed self-paced convolutional network (SPCN), each sample is assigned to a weight to reflect the easiness of the sample. Then a dynamic self-paced function is incorporated into the leaning objective of CNN to jointly learn the parameters of CNN and the latent weight variable. SPCN learns the samples from easy to complex and the sample weights can dynamically control the learning rates for converging to better values. To gain more insights of SPCN, theoretical studies are conducted to show that SPCN converges to a stationary solution and is robust to the noisy and confusing data. Experimental results on MNIST and rectangles datasets demonstrate that the proposed method outperforms baseline methods.

NeurIPS Conference 2017 Conference Paper

Training Quantized Nets: A Deeper Understanding

  • Hao Li
  • Soham De
  • Zheng Xu
  • Christoph Studer
  • Hanan Samet
  • Tom Goldstein

Currently, deep neural networks are deployed on low-power portable devices by first training a full-precision model using powerful hardware, and then deriving a corresponding low-precision model for efficient inference on such systems. However, training models directly with coarsely quantized weights is a key step towards learning on embedded platforms that have limited computing resources, memory capacity, and power consumption. Numerous recent publications have studied methods for training quantized networks, but these studies have mostly been empirical. In this work, we investigate training methods for quantized neural networks from a theoretical viewpoint. We first explore accuracy guarantees for training methods under convexity assumptions. We then look at the behavior of these algorithms for non-convex problems, and show that training algorithms that exploit high-precision representations have an important greedy search phase that purely quantized training methods lack, which explains the difficulty of training using low-precision arithmetic.

IJCAI Conference 2017 Conference Paper

What to Do Next: Modeling User Behaviors by Time-LSTM

  • Yu Zhu
  • Hao Li
  • Yikang Liao
  • Beidou Wang
  • Ziyu Guan
  • Haifeng Liu
  • Deng Cai

Recently, Recurrent Neural Network (RNN) solutions for recommender systems (RS) are becoming increasingly popular. The insight is that, there exist some intrinsic patterns in the sequence of users' actions, and RNN has been proved to perform excellently when modeling sequential data. In traditional tasks such as language modeling, RNN solutions usually only consider the sequential order of objects without the notion of interval. However, in RS, time intervals between users' actions are of significant importance in capturing the relations of users' actions and the traditional RNN architectures are not good at modeling them. In this paper, we propose a new LSTM variant, i. e. Time-LSTM, to model users' sequential actions. Time-LSTM equips LSTM with time gates to model time intervals. These time gates are specifically designed, so that compared to the traditional RNN solutions, Time-LSTM better captures both of users' short-term and long-term interests, so as to improve the recommendation performance. Experimental results on two real-world datasets show the superiority of the recommendation method using Time-LSTM over the traditional methods.

IJCAI Conference 2016 Conference Paper

Content-Driven Detection of Cyberbullying on the Instagram Social Network

  • Haoti Zhong
  • Hao Li
  • Anna Squicciarini
  • Sarah Rajtmajer
  • Christopher Griffin
  • David Miller
  • Cornelia Caragea

We study detection of cyberbullying in photo-sharing networks, with an eye on developing early warning mechanisms for the prediction of posted images vulnerable to attacks. Given the overwhelming increase in media accompanying text in online social networks, we investigate use of posted images and captions for improved detection of bullying in response to shared content. We validate our approaches on a dataset of over 3000 images along with peer-generated comments posted on the Instagram photo-sharing network, running comprehensive experiments using a variety of classifiers and feature sets. In addition to standard image and text features, we leverage several novel features including topics determined from image captions and a pretrained convolutional neural network on image pixels. We identify the importance of these advanced features in assisting detection of cyberbullying in posted comments. We also provide results on classification of images and captions themselves as potential targets for cyberbullies.

AAAI Conference 2016 Conference Paper

Multi-Objective Self-Paced Learning

  • Hao Li
  • Maoguo Gong
  • Deyu Meng
  • Qiguang Miao

Current self-paced learning (SPL) regimes adopt the greedy strategy to obtain the solution with a gradually increasing pace parameter while where to optimally terminate this increasing process is difficult to determine. Besides, most SPL implementations are very sensitive to initialization and short of a theoretical result to clarify where SPL converges to with pace parameter increasing. In this paper, we propose a novel multi-objective self-paced learning (MOSPL) method to address these issues. Specifically, we decompose the objective functions as two terms, including the loss and the self-paced regularizer, respectively, and treat the problem as the compromise between these two objectives. This naturally reformulates the SPL problem as a standard multi-objective issue. A multi-objective evolutionary algorithm is used to optimize the two objectives simultaneously to facilitate the rational selection of a proper pace parameter. The proposed technique is capable of ameliorating a set of solutions with respect to a range of pace parameters through finely compromising these solutions inbetween, and making them perform robustly even under bad initialization. A good solution can then be naturally achieved from these solutions by making use of some offthe-shelf tools in multi-objective optimization. Experimental results on matrix factorization and action recognition demonstrate the superiority of the proposed method against the existing issues in current SPL research.

IROS Conference 2014 Conference Paper

A robot system design for low-cost multi-robot manipulation

  • James McLurkin
  • Adam McMullen
  • Nick Robbins
  • Golnaz Habibi
  • Aaron T. Becker
  • Alvin Chou
  • Hao Li
  • Meagan John

Multi-robot manipulation allows for scalable environmental interaction, which is critical for multi-robot systems to have an impact on our world. A successful manipulation model requires cost-effective robots, robust hardware, and proper system feedback and control. This paper details key sensing and manipulator capabilities of the r-one robot. The r-one robot is an advanced, open source, low-cost platform for multi-robot manipulation and sensing that meets all of these requirements. The parts cost is around $250 per robot. The r-one has a rich sensor suite, including a flexible IR communication/localization/obstacle detection system, high-precision quadrature encoders, gyroscope, accelerometer, integrated bump sensor, and light sensors. Two years of working with these robots inspired the development of an external manipulator that gives the robots the ability to interact with their environment. This paper presents an overview of the r-one, the r-one manipulator, and basic manipulation experiments to illustrate the efficacy our design. The advanced design, low cost, and small size can support university research with large populations of robots and multi-robot curriculum in computer science, electrical engineering, and mechanical engineering. We conclude with remarks on the future implementation of the manipulators and expected work to follow.

ICRA Conference 2011 Conference Paper

Design optimization of parallel manipulators with required pose resolution

  • Hao Li
  • Yuru Zhang
  • Jian S. Dai 0001

Performance of a parallel manipulator heavily relies on its position/orientation resolution. Without a good resolution the manipulator is difficult to achieve a high stiffness. Therefore how to obtain the required resolution is a basic issue to design a parallel manipulator. This paper presents a method to solve the problem. Firstly, the mathematical definition of the position/orientation resolution is given. Then we discuss how to calculate them using Rayleigh quotient. And the design optimization problem is formulated. Fundamental concepts of the method are illustrated through a 3-RRR planar parallel manipulator. The design process to achieve an optimized task space of the required pose resolution is also demonstrated in the example.

ICRA Conference 2011 Conference Paper

Virtual prototyping for drive chain optimization in an industrial robot

  • Bojun Ma
  • Hao Li
  • Said Zahrai
  • Hui Zhang

Cost, performance and efficiency of energy usage of a robot system is strongly dependent on its drive chain, i. e. combination of the drive, the motors and the gears. A model is presented that allows accurate simulation of the drive chain in an industrial robot to offer a high degree of optimization in the design process. Simulation results are compared to final data from the developed units and excellent agreements are found.

IROS Conference 2003 Conference Paper

Real world implementation of fuzzy anti-swing control for behavior-based intelligent crane system

  • Jiaming Wang
  • Hao Li
  • Fakhri Karray
  • Otman A. Basir

There exist several industrial applications for large crane systems. Most of them experience serious problems with load swing. This paper presents a fuzzy based control scheme to minimize load swing for crane systems while maintaining continuous payload transportation. The control system of the crane is built using behavior-based approaches. In the control system developed, each module generates behaviors, and improvement in the performance of the system proceeds by adding new modules to the system. In order to develop the anti-swing module, fuzzy logic controller is applied using information extracted from potentiometers. The fuzzy controller provides a mechanism for dealing with imprecise sensor data. The anti-swing behaviors are successfully implemented by formulating a set of fuzzy rules. The performance of the developed system is illustrated by both simulations and experiments. The simulation and experimental results of the system show that the system remains stable under several operating situations.

IROS Conference 2001 Conference Paper

Real-time planning and control of robots using shunting neural networks

  • Simon X. Yang
  • Xiaobu Yuan
  • Max Q. -H. Meng
  • Guangfeng Yuan
  • Hao Li

In this paper, shunting neural networks are proposed for dynamic planning and control of robots. The dynamic environment is represented by a neural activity landscape of a neural network, where each neuron in the topologically organized neural network is characterized by a shunting equation that is derived from Hodgkin and Huxley's (1952) biological membrane equation. The collision-free path is generated in real-time from the activity landscape without any explicit searching procedures and without any prior knowledge of the dynamic environment. The real-time tracking control of robots to follow the planned dynamic path is designed using shunting equation as well. The effectiveness and efficiency of the proposed approach are demonstrated through simulation and comparison studies. Simulation in several computer-synthesized virtual environments further demonstrates the advantages of the proposed approach with encouraging experimental results.