Author name cluster

Haoyu Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

47 papers

2 author rows

EAAI Journal 2026 Journal Article

Application of hyperspectral super-resolution reconstruction based on multi-scan clustering-guided Mamba for adulteration detection of whey protein powder

Ailing Tan
Zixuan Zhang
Yong Zhao
Haijie Su
Haoyu Wang
Rongxuan Zhao

Details DOI

AAAI Conference 2026 Conference Paper

Brownian Bridge Augmented Surrogate Simulation and Injection Planning for Geological CO2 Storage

Haoyue Bai
Guodong Chen
Wangyang Ying
Xinyuan Wang
Nanxu Gong
Sixun Dong
Giulia Pedrielli
Haoyu Wang

Geological CO2 storage (GCS) involves injecting captured CO2 into deep subsurface formations to support climate goals. The effective management of GCS relies on adaptive injection planning to dynamically control injection rates and well pressures to balance both storage safety and efficiency. Prior literature, including numerical optimization methods and surrogate-optimization methods, is limited by real-world GCS requirements of smooth state transitions and goal-directed planning within limited time. To address these limitations, we propose a Brownian Bridge–augmented framework for surrogate simulation and injection planning in GCS and develop two insights (i) Brownian bridge as smooth state regularizer for better surrogate simulator; (ii) Brownian bridge as goal-time-conditioned planning guidance for better injection planning. Our method has three stages: (i) learning deep Brownian bridge representations with contrastive and reconstructive losses from historical reservoir and utility trajectories, (ii) incorporating Brownian bridge-based next state interpolation for simulator regularization (iii) guiding injection planning with Brownian utility-conditioned trajectories to generate high-quality injection plans. Experimental results across multiple datasets collected from diverse GCS settings demonstrate that our framework consistently improves simulation fidelity and planning effectiveness while maintaining low computational overhead.

PDF Details DOI

JBHI Journal 2026 Journal Article

EEG-VLM: A Hierarchical Vision-Language Model With Multi-Level Feature Alignment and Visually Enhanced Language-Guided Reasoning for EEG Image-Based Sleep Stage Prediction

Xihe Qiu
Gengchen Ma
Haoyu Wang
Chen Zhan
Xiaoyu Tan
Shuo Li

Sleep stage classification based on electroencephalography (EEG) is fundamental for assessing sleep quality and diagnosing sleep-related disorders. However, most traditional machine learning methods rely heavily on prior knowledge and handcrafted features, while existing deep learning models still struggle to jointly capture fine-grained time–frequency patterns and achieve clinical interpretability. Recently, vision–language models (VLMs) have made significant progress in the medical domain, yet their performance remains constrained when applied to physiological waveform data, especially EEG signals, due to their limited visual understanding and insufficient reasoning capability. To address these challenges, we propose EEG-VLM, a hierarchical vision–language framework that integrates multi-level feature alignment with visually enhanced language-guided reasoning for interpretable EEG-based sleep stage classification. Specifically, a specialized visual enhancement module constructs high-level visual tokens from intermediate-layer features to extract rich semantic representations of EEG images. These tokens are further aligned with low-level CLIP features through a multi-level alignment mechanism, enhancing the VLM's image-processing capability. In addition, a Chain-of-Thought (CoT) reasoning strategy decomposes complex medical inference into interpretable logical steps, effectively simulating expert-like decision-making. Experimental results demonstrate that the proposed method significantly improves both the accuracy and interpretability of VLMs in EEG-based sleep stage classification, showing promising potential for automated and explainable EEG analysis in clinical settings.

Details DOI

AAAI Conference 2026 Conference Paper

Emotion-Conditioned Motion Sub-spaces with Flow Matching for Real-Time Audio-Driven Talking Heads

Haoyu Wang
Xiaozhe Xin
Xiaoyu Qin
Meiguang Jin
Junfeng Ma
Dan Xu
Jia Jia

Recent advances in audio-driven talking-head synthesis have brought lip-sync precision close to human perception, yet emotional fidelity and real-time inference remain open challenges. Existing pipelines typically disentangle lip articulation, facial expression, and head pose in latent space; this rigid factorization ignores the intrinsic coupling between articulation and affect — e.g., downward lip corners when sad—thus limiting expressiveness. We cast speech-conditioned facial motion as a sample from an emotion-conditioned distribution in a motion latent space. Concretely, we (i) learn a motion dictionary of orthogonal bases with an autoencoder via self-supervision, (ii) construct emotion-conditioned sub-spaces within the latent space, and (iii) design a layer-progressive cross-attention fusion module that modulates a flow-matching sampler with both audio and emotion signals. Only ten reverse ODE steps are required to generate a motion-latent trajectory, enabling real-time end-to-end latency. Extensive experiments on MEAD and RAVDESS show that our method outperforms recent GAN- and diffusion-based baselines in emotion accuracy while running at around 75 FPS on a single desktop GPU. The proposed framework delivers the first emotionally expressive Audio2Face system that simultaneously achieves lip-sync accuracy, affective realism, and real-time performance.

PDF Details DOI

AAAI Conference 2026 Conference Paper

JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

Haoyu Wang
Lei Zhang
Wenrui Liu
Dengyang Jiang
Wei Wei
Chen Ding

Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.

PDF Details DOI

JBHI Journal 2026 Journal Article

MedSegAgent: A Universal and Scalable Multi-Agent System for Instructive Medical Image Segmentation

Ziyan Huang
Haoyu Wang
Jin Ye
Yuanfeng Ji
Xiaowei Hu
Lihao Liu
Zhikai Yang
Wei Li

Medical image segmentation is vital for clinical diagnosis and treatment; however, current solutions face three major limitations: (1) the lack of a universal framework capable of handling diverse modalities and anatomical targets, (2) the limited scalability to adapt to evolving clinical needs and new datasets, and (3) the lack of instructive interfaces that make models usable for non-expert users. To address these challenges, this paper presents MedSegAgent, a universal and scalable multi-agent system for instructive medical image segmentation. Specifically, MedSegAgent comprises five agents: one query parsing agent that processes natural language requests, three coarse-to-fine filtering agents (modality filtering, anatomical filtering, and label selection) for identifying relevant datasets and label values, and one execution agent responsible for model inference and result integration. Based on this framework, MedSegAgent utilizes 23 diverse datasets and pre-trained models to perform 343 types of segmentation across various modalities and anatomical targets. Experimental results demonstrate that MedSegAgent simplifies model selection while maintaining high performance, accurately identifying matching datasets and labels in 94. 27% of queries and locating at least one suitable match in 99. 03% of queries. MedSegAgent offers a universal and scalable solution for diverse medical image segmentation tasks, bridging the gap between user-friendly queries and the complexities of model selection and deployment. Our code is publicly available at https://github.com/uni-medical/MedSegAgent.

Details DOI

AAAI Conference 2026 Conference Paper

MMIFEvol: Towards Evolutionary Multimodal Instruction Following

Haoyu Wang
Sihang Jiang
Xiangru Zhu
Yuyan Chen
Xiaojun Meng
Jiansheng Wei
Yitong Wang
Yanghua Xiao

Multimodal Instruction Following serves as a fundamental capability of multimodal language models, involving accurate comprehension and execution of user-provided instructions. However, existing multimodal instruction-following datasets and benchmarks face the shortcomings outlined below: (a) Lack of Difficulty Stratification, they collect diverse instruction categories but neglect the stratification of difficulty levels across these categories, which leads to overlap, bias, and low interpretability. (b) Lack of Fine-Grained Metrics, they conflate the model's ability to ``solve tasks" and ``follow constraints" into a single metric, which fails to accurately reflect its instruction-following capability. (c) Lack of Multi-Task Instructions, they overlook the fact that real-world user instructions often consist of multiple combined tasks. This paper proposes MMIFEvol, a framework for multimodal instruction evolving and benchmarking. First, we define the essential components of a carefully curated multimodal instruction set and establish corresponding difficulty levels, based on which we synthesize diverse instruction data. Next, we decouple the evaluation criteria for the instruction following into three different metrics to construct a high-quality benchmark and assess existing models. Experimental results demonstrate that current models still struggle with following complex instructions, while fine-tuning using MMIFEvol data effectively improves models' responsiveness to multimodal instructions.

PDF Details DOI

AAAI Conference 2026 Conference Paper

OmniBench: A Comprehensive Benchmark Integrating Real-World, Time-sensitive, and Multi-Hop Questions with a Multi-Dimensional Hybrid Evaluation Framework

Wenjie Wang
Yufeng Jiang
Ge Sun
Chenghang Dong
Zheng Jun
Li Mengjie
Lixin Chen
Huan Wang

Recently, with the increasing capabilities of Large Language Models (LLMs), AI applications have gradually emerged to solve various problems in people's daily lives, so accurately measuring their performance and reliability is paramount. However, existing benchmarks predominantly rely on closed-ended, multiple-choice or short-answer question formats. While useful for assessment, these formats exhibit a significant gap compared to the diverse and open-ended nature of questions posed by real-world users. To bridge this gap, we produce OmniBench, a comprehensive open-domain benchmark. OmniBench is uniquely composed of authentic, user-generated questions harvested from real-world interactions on various websites and applications, covering 16 rigorously defined knowledge domains and 5 crucial user intents derived from a large-scale analysis of the mass corpus. Crucially, we propose three automated data construction pipelines that enable the continuous and periodic updating of the benchmark dataset. This approach not only ensures that the questions can keep up with current events, but also effectively mitigates the critical issue of data contamination prevalent in static benchmarks. Moreover, a multi-dimensional hybrid evaluation framework named OmniEval is proposed for evaluating the responses. This framework combines diverse metrics and evaluation methods to capture nuanced aspects of answer performance. Extensive validation demonstrates that this evaluation framework exhibits strong alignment with human judgments, ensuring the reliability of the benchmark results.

PDF Details DOI

JBHI Journal 2026 Journal Article

Radar HRV Monitoring With Physiological Prior Inspired Deep Neural Networks

Haoyu Wang
Jinbo Chen
Dongheng Zhang
Zhi Lu
Yang Hu
Qibin Sun
Yan Chen

Radar sensing has emerged as a promising solution for the contactless monitoring of Heart Rate Variability (HRV), a crucial indicator of the cardiovascular and autonomic nervous systems. However, due to signal noise and interference that easily obscure heartbeat details, along with variations in heartbeat across different physiological conditions, existing methods remain restricted to laboratory settings with healthy subjects and fail in real-world scenarios involving more complex physiological conditions. In this study, we propose a physiological prior-inspired deep learning framework for robust radar-based HRV monitoring. Specifically, we leverage the prior that internal heartbeats drive movements across the entire torso surface and design a hybrid deep neural network to model the spatio-temporal relationship between full-body radio reflections and heartbeats, effectively mitigating interference. Then, we incorporate the cardiac motion's self-similarity prior to establish a signal augmentation strategy, effectively remodeling the HRV distribution and enhancing performance across diverse physiological conditions. We build and validate our method on a large-scale dataset comprising 7, 150 outpatients with complex physiological conditions in real-world scenarios. The experimental results demonstrate that our method achieves a mean IBI error of 19. 21 ms, an RMSSD error of 16. 23 ms, an SDSD error of 16. 70 ms, and a pNN50 error of 7. 28%. We further validate the performance by classifying five common cardiac conditions based on HRV results, demonstrating performance comparable to ECG-based methods. These results highlight the great potential of our approach for accurate, contactless HRV monitoring in real-world applications.

Details DOI

AAAI Conference 2026 Conference Paper

Scaling Law for Large Wireless Models

Ziheng Liu
Jiayi Zhang
Haoyu Wang
Bokai Xu
Chen Zhang
Yiyang Zhu
Enyu Shi

Emerging from recent advances in foundation models, Large Wireless Models (LWMs) represent a new paradigm of general-purpose intelligence for wireless communications that transcends task-specific engineering. The success of foundation models is critically underpinned by scaling laws, which provide a predictable roadmap for how performance scales with resources. However, established scaling laws from language and vision, charting performance as a power-law of model and dataset sizes, are ill-suited for the wireless domain, as their core formulations cannot model the structured nature of the physical channel. To address this, we propose a novel wireless scaling law that extends the classical formulation by modeling two wireless-native factors: channel heterogeneity and discretization granularity. These two factors reshape scaling behavior via nested linear and power-law relationships, recasting the scaling law's parameters (notably the scaling exponent and irreducible loss) from universal constants into dynamic variables dictated by the physical environment. Our physics-aware formulation reveals two key insights: first, that compute-optimal scaling is not dictated by a fixed model-data ratio but is instead a dynamic function of heterogeneity and granularity, and second, that this dependency is particularly sensitive to granularity, allowing significant performance to be unlocked from existing data simply by refining its resolution. Crucially, this establishes a reliable roadmap for designing powerful yet resource-efficient LWMs, translating theoretical insights into actionable engineering principles. Extensive experiments validate our wireless scaling law, showing a 32.31% prediction accuracy improvement over classical laws in diverse wireless scenarios where they fail.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SEED: Spectral Entropy-Guided Evaluation of Spatial-Temporal Dependencies for Multivariate Time Series Forecasting

Feng Xiong
Zongxia Xie
Yanru Sun
Haoyu Wang
Jianhong Lin

Effective multivariate time series forecasting often benefits from accurately modeling complex inter-variable dependencies. However, existing attention- or graph-based methods face three key issues: (a) strong temporal self-dependencies are often disrupted by irrelevant variables; (b) softmax normalization ignores and reverses negative correlations; (c) variables struggle to perceive their temporal positions. To address these, we propose **SEED**, a Spectral Entropy-guided evaluation framework for spatial-temporal dependency modeling. SEED introduces a Dependency Evaluator, a key innovation that leverages spectral entropy to dynamically provide a preliminary evaluation of the spatial and temporal dependencies of each variable, enabling the model to adaptively balance Channel Independence (CI) and Channel Dependence (CD) strategies. To account for temporal regularities originating from the influence of other variables rather than intrinsic dynamics, we propose Spectral Entropy-based Fuser to further refine the evaluated dependency weights, effectively separating this part. Moreover, to preserve negative correlations, we introduce a Signed Graph Constructor that enables signed edge weights, overcoming the limitations of softmax. Finally, to help variables perceive their temporal positions and thereby construct more comprehensive spatial features, we introduce the Context Spatial Extractor, which leverages local contextual windows to extract spatial features. Extensive experiments on 12 real-world datasets from various application domains demonstrate that SEED achieves state-of-the-art performance, validating its effectiveness and generality.

PDF Details DOI

EAAI Journal 2025 Journal Article

A high-precision Acupoint recognition and localization model for acupuncture robot end-effectors

Ailing Tan
Hao Mu
Wei Ma
Yu'e Lv
Haoyu Wang
Yong Zhao

Details DOI

NeurIPS Conference 2025 Conference Paper

AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play

Ran Xu
Yuchen Zhuang
Zihan Dong
Ruiyu Wang
Yue Yu
Joyce Ho
Linjun Zhang
Haoyu Wang

Search-augmented LLMs often struggle with complex reasoning tasks due to ineffective multi-hop retrieval and limited reasoning ability. We propose AceSearcher, a cooperative self-play framework that trains a single large language model (LLM) to alternate between two roles: a decomposer that breaks down complex queries and a solver that integrates retrieved contexts for answer generation. AceSearcher couples supervised fine-tuning on a diverse mixture of search, reasoning, and decomposition tasks with reinforcement fine-tuning optimized for final answer accuracy, eliminating the need for intermediate annotations. Extensive experiments on three reasoning-intensive tasks across 10 datasets show that AceSearcher outperforms state-of-the-art baselines, achieving an average exact match improvement of 7. 6%. Remarkably, on document-level finance reasoning tasks, AceSearcher-32B matches the performance of the giant DeepSeek-V3 model using less than 5% of iits parameters. Even at smaller scales (1. 5B and 8B), AceSearcher often surpasses existing search-augmented LLMs with up to 9× more parameters, highlighting its exceptional efficiency and effectiveness in tackling complex reasoning tasks.

PDF Details

EAAI Journal 2025 Journal Article

An innovative contrastive learning approach to improve image recognition robustness and interpretability via simulated environmental perturbations

Leijun Cheng
Xihe Qiu
Xiaoyu Tan
Haoyu Wang
Yujie Xiong

Details DOI

TMLR Journal 2025 Journal Article

Are Large Language Models Really Robust to Word-Level Perturbations?

Haoyu Wang
Guozheng Ma
Cong Yu
Ning Gui
Linrui Zhang
Zhiqi Huang
Suwei Ma
Yongzhe Chang

The swift advancement in the scales and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLMs, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, potentially ignoring the superior generation capabilities of contemporary LLMs. To investigate the robustness of LLMs while using their generation ability, we propose a novel rational evaluation pipeline that leverages reward models as diagnostic tools to evaluate the long conversation generated from more challenging open questions by LLMs, which we refer to as the Reward Model for Reasonable Robustness Evaluation (TREvaL). Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions, a capability not entirely encompassed by individual words or letters.Our extensive empirical experiments demonstrate that TREvaL provides an identification for the lack of robustness of nowadays LLMs.Notably, we are surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted, calling for more attention on the robustness during alignment process.

PDF Details

NeurIPS Conference 2025 Conference Paper

DCI: Dual-Conditional Inversion for Boosting Diffusion-Based Image Editing

Zixiang Li
Haoyu Wang
Wei Wang
Chuangchuang Tan
Yunchao Wei
Yao Zhao

Diffusion models have achieved remarkable success in image generation and editing tasks. Inversion within these models aims to recover the latent noise representation for a real or generated image, enabling reconstruction, editing, and other downstream tasks. However, to date, most inversion approaches suffer from an intrinsic trade-off between reconstruction accuracy and editing flexibility. This limitation arises from the difficulty of maintaining both semantic alignment and structural consistency during the inversion process. In this work, we introduce Dual-Conditional Inversion (DCI), a novel framework that jointly conditions on the source prompt and reference image to guide the inversion process. Specifically, DCI formulates the inversion process as a dual-condition fixed-point optimization problem, minimizing both the latent noise gap and the reconstruction error under the joint guidance. This design anchors the inversion trajectory in both semantic and visual space, leading to more accurate and editable latent representations. Our novel setup brings new understanding to the inversion process. Extensive experiments demonstrate that DCI achieves state-of-the-art performance across multiple editing tasks, significantly improving both reconstruction quality and editing precision. Furthermore, we also demonstrate that our method achieves strong results in reconstruction tasks, implying a degree of robustness and generalizability approaching the ultimate goal of the inversion process. Our codes are available at: https: //github. com/Lzxhh/Dual-Conditional-Inversion

PDF Details

IROS Conference 2025 Conference Paper

Eagle-Scale Flapping-Wing Robot with Aggressive Roll Maneuverability: Bio-Inspired Actuation, Fluid-Structure Interaction Simulation and Flight Experiment

Haoyu Wang
Zhenkun Gong
Erzhen Pan
Wenfu Xu

Large flapping-wing aerial vehicles (FWAVs) face dual challenges in aerodynamic and structural design, with long-standing technical bottlenecks, particularly in roll maneuvers. In this study, by reverse-engineering the biomechanical mechanisms of raptor flight, we propose a bio-inspired wing-shoulder torsional mechanism and successfully developed an eagle-inspired flapping-wing aerial vehicle with a wingspan of 1. 87m and a takeoff weight of 1, 260g. A nonlinear explicit dynamics-lattice Boltzmann fluid-structure interaction (FSI) numerical model was innovatively established, comprehensively revealing the interaction mechanism between unsteady flapping flow fields and flexible wing deformations. Numerical simulations demonstrate that at a cruising speed of 8 m/s, the proposed mechanism generates a high-purity roll torque of 3. 3 N·m (with a residual yaw torque of 0. 2 N·m, torque purity ratio 16. 5: 1), while lift and thrust losses are below 1. 5%. Flight experiments validate the exceptional performance of this mechanism in 3D maneuvers: a 360° barrel roll is completed in 2. 6 seconds (average roll rate 136°/s). This study provides a theoretical framework and technological prototype for next-generation bio-inspired aerial vehicles that integrate efficient cruising with high maneuverability, marking the first instance where FWAVs surpass traditional aircraft in specific 3D maneuverability metrics.

Details

ICLR Conference 2025 Conference Paper

Emergent Orientation Maps - - Mechanisms, Coding Efficiency and Robustness

Haixin Zhong
Haoyu Wang
Wei P. Dai
Yuchao Huang
Mingyi Huang
Rubin Wang
Anna Wang Roe
Yuguo Yu

Extensive experimental studies have shown that in lower mammals, neuronal orientation preference in the primary visual cortex is organized in disordered "salt-and-pepper" organizations. In contrast, higher-order mammals display a continuous variation in orientation preference, forming pinwheel-like structures. Despite these observations, the spiking mechanisms underlying the emergence of these distinct topological structures and their functional roles in visual processing remain poorly understood. To address this, we developed a self-evolving spiking neural network model with Hebbian plasticity, trained using physiological parameters characteristic of rodents, cats, and primates, including retinotopy, neuronal morphology, and connectivity patterns. Our results identify critical factors, such as the degree of input visual field overlap, neuronal connection range, and the balance between localized connectivity and long-range competition, that determine the emergence of either salt-and-pepper or pinwheel-like topologies. Furthermore, we demonstrate that pinwheel structures exhibit lower wiring costs and enhanced sparse coding capabilities compared to salt-and-pepper organizations. They also maintain greater coding robustness against noise in naturalistic visual stimuli. These findings suggest that such topological structures confer significant computational advantages in visual processing and highlight their potential application in the design of brain-inspired deep learning networks and algorithms.

Details

IJCAI Conference 2025 Conference Paper

Evaluating and Mitigating Linguistic Discrimination in Large Language Models: Perspectives on Safety Equity and Knowledge Equity

Guoliang Dong
Haoyu Wang
Jun Sun
Xinyu Wang

Large language models (LLMs) typically provide multilingual support and demonstrate remarkable capabilities in solving tasks described in different languages. However, LLMs can exhibit linguistic discrimination due to the uneven distribution of training data across languages. That is, LLMs struggle to maintain consistency when handling the same task in different languages, compromising both safety equity and knowledge equity. In this paper, we first systematically evaluate the linguistic discrimination of LLMs from two aspects: safety and quality, using a form of metamorphic testing. The metamorphic relationship we examine is that LLMs are expected to deliver outputs with similar semantics when prompted with inputs that have the same meaning. We conduct this evaluation with two datasets based on four representative LLMs. The results show that LLMs exhibit stronger human alignment capabilities with queries in English, French, Russian, and Spanish compared to queries in Bengali, Georgian, Nepali and Maithili. Moreover, for queries in English, Danish, Czech and Slovenian, LLMs tend to produce responses with a higher quality compared to the other languages. Upon these findings, we propose LDFighter, a similarity-based voting method, to mitigate the linguistic discrimination in LLMs. We comprehensively evaluate LDFighter against a spectrum of queries including benign, harmful, and adversarial prompts. The results show that LDFighter significantly reduces jailbreak success rates and improves response quality. All code, data, and the technical appendix are publicly available at: \url{https: //github. com/dgl-prc/ldfighter}.

PDF Details DOI

IROS Conference 2025 Conference Paper

Exoskeleton Gait Adaptation Framework via Hm-DMP and PI 2 Optimization for Dynamic Patient Mobility Matching

Qiaohuan Cao
Dewei Liu
Hamza Azam
Haoyu Wang
Wenzhu Xu
Jiongjie Fang
Wei Yang 0045

Repetitive gait training with lower-limb exoskeletons enhances neuroplasticity and reduces muscle atrophy by promoting patient engagement in active rehabilitation training. Importantly, the therapeutic efficacy of such engagement critically depends on providing patients with task difficulty levels matching their real-time walking capacities. To address this, a closed-loop Mobility-Matching Framework is proposed, integrating Hybrid Multi-attractor Dynamic Movement Primitives (Hm-DMP) with Policy Improvement with Path Integral (PI 2 ) optimization, which achieves real-time trajectory adaptation. The Hm-DMP module preserves critical kinematic invariants of normative gait patterns during trajectory deformation through constrained multi-attractor modulation. Simultaneously, the PI 2 -driven optimizer iteratively adjusts joint trajectory keypoints of Hm-DMP by optimizing a hybrid cost function, enabling dynamic matching between training trajectories and patients’ real-time mobility. Experimental trials on the WEI-EXO platform demonstrate the proposed framework’s robustness to detect and respond to real-time changes in patient’ ambulatory capacity by optimizing assistance trajectories while preserving the normative gait kinematics. This closed-loop adaptation process facilitates personalized gait rehabilitation with exoskeletons, enhancing training efficacy and maintaining comfort across patients with diverse mobility levels.

Details

NeurIPS Conference 2025 Conference Paper

GeoLink: Empowering Remote Sensing Foundation Model with OpenStreetMap Data

Lubin Bai
Xiuyuan Zhang
Siqi Zhang
Zepeng Zhang
Haoyu Wang
Wei Qin
Shihong Du

Integrating ground-level geospatial data with rich geographic context, like OpenStreetMap (OSM), into remote sensing (RS) foundation models (FMs) is essential for advancing geospatial intelligence and supporting a broad spectrum of tasks. However, modality gap between RS and OSM data, including differences in data structure, content, and spatial granularity, makes effective synergy highly challenging, and most existing RS FMs focus on imagery alone. To this end, this study presents GeoLink, a multimodal framework that leverages OSM data to enhance RS FM during both the pretraining and downstream task stages. Specifically, GeoLink enhances RS self-supervised pretraining using multi-granularity learning signals derived from OSM data, guided by cross-modal spatial correlations for information interaction and collaboration. It also introduces image mask-reconstruction to enable sparse input for efficient pretraining. For downstream tasks, GeoLink generates both unimodal and multimodal fine-grained encodings to support a wide range of applications, from common RS interpretation tasks like land cover classification to more comprehensive geographic tasks like urban function zone mapping. Extensive experiments show that incorporating OSM data during pretraining enhances the performance of the RS image encoder, while fusing RS and OSM data in downstream tasks improves the FM’s adaptability to complex geographic scenarios. These results underscore the potential of multimodal synergy in advancing high-level geospatial artificial intelligence. Moreover, we find that spatial correlation plays a crucial role in enabling effective multimodal geospatial data integration. Code, checkpoints, and using examples are released at GitHub.

PDF Details

NeurIPS Conference 2025 Conference Paper

Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models

Haoyu Wang
Peihao Wang
Mufei Li
Shikun Liu
Siqi Miao
Zhangyang "Atlas" Wang
Pan Li

Modern large language models (LLMs) are inherently auto-regressive, requiring input to be serialized into flat sequences regardless of their structural dependencies. This serialization hinders the model’s ability to leverage structural inductive biases, especially in tasks such as retrieval-augmented generation (RAG) and reasoning on data with native graph structures, where inter-segment dependencies are crucial. We introduce Graph-KV with the potential to overcome this limitation. Graph-KV leverages the KV-cache of text segments as condensed representations and governs their interaction through structural inductive biases. In this framework, ''target'' segments selectively attend only to the KV-caches of their designated ''source'' segments, rather than all preceding segments in a serialized sequence. This approach induces a graph-structured block mask, sparsifying attention and enabling a message-passing-like step within the LLM. Furthermore, strategically allocated positional encodings for source and target segments reduce positional bias and context window consumption. We evaluate Graph-KV across three scenarios: (1) seven RAG benchmarks spanning direct inference, multi-hop reasoning, and long-document understanding; (2) Arxiv-QA, a novel academic paper QA task with full-text scientific papers structured as citation ego-graphs; and (3) paper topic classification within a citation network. By effectively reducing positional bias and harnessing structural inductive biases, Graph-KV substantially outperforms baselines, including standard costly sequential encoding, across various settings.

PDF Details

NeurIPS Conference 2025 Conference Paper

Lifelong Safety Alignment for Language Models

Haoyu Wang
Yifei Zhao
Zeyu Qin
Chao Du
Min Lin
Xueqian Wang
Tianyu Pang

LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments.

PDF Details

AIIM Journal 2025 Journal Article

MMSupcon: An image fusion-based multi-modal supervised contrastive method for brain tumor diagnosis

Haoyu Wang
Jing Zhang
Siying Wu
Haoran Wei
Xun Chen
Yunwei Ou
Xiaoyan Sun

Details DOI

AAAI Conference 2025 Conference Paper

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

Haoyu Wang
Zhilu Zhang
Donglin Di
Shiliang Zhang
Wangmeng Zuo

The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results from multiple views using the given clothes. Given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. Moreover, we adopt diffusion models that have demonstrated superior abilities to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest joint attention blocks to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset MVG, in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation

Wei Zhou
Guoliang Li
Haoyu Wang
Yuxing Han
Xufei Wu
Fan Wu
Xuanhe Zhou

Large language models (LLMs) have shown increasing effectiveness in Text-to-SQL tasks. However, another closely related problem, Cross-System SQL Translation (a. k. a. , SQL-to-SQL), which adapts a query written for one database system (e. g. , MySQL) into its equivalent one for another system (e. g. , ClickHouse), is of great practical importance but remains underexplored. Existing SQL benchmarks are not well-suited for SQL-to-SQL evaluation, which (1) focus on a limited set of database systems (often just SQLite) and (2) cannot capture many system-specific SQL dialects (e. g. , customized functions, data types, and syntax rules). Thus, in this paper, we introduce PARROT, a Practical And Realistic BenchmaRk for CrOss-System SQL Translation. PARROT comprises 598 translation pairs from 38 open-source benchmarks and real-world business services, specifically prepared to challenge system-specific SQL understanding (e. g. , LLMS achieve lower than 38. 53% accuracy on average). We also provide multiple benchmark variants, including PARROT-Diverse with 28, 003 translation (for extensive syntax testing) and PARROT-Simple with 5, 306 representative samples (for focused stress testing), covering 22 production-grade database systems. To promote future research, we release a public leaderboard and source code at: https: //code4db. github. io/parrot-bench/.

PDF Details

ICLR Conference 2025 Conference Paper

Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents

Haoyu Wang
Sunhao Dai
Haiyuan Zhao
Liang Pang 0001
Xiao Zhang 0034
Gang Wang 0056
Zhenhua Dong
Jun Xu 0001

Previous studies have found that PLM-based retrieval models exhibit a preference for LLM-generated content, assigning higher relevance scores to these documents even when their semantic quality is comparable to human-written ones. This phenomenon, known as source bias, threatens the sustainable development of the information access ecosystem. However, the underlying causes of source bias remain unexplored. In this paper, we explain the process of information retrieval with a causal graph and discover that PLM-based retrievers learn perplexity features for relevance estimation, causing source bias by ranking the documents with low perplexity higher. Theoretical analysis further reveals that the phenomenon stems from the positive correlation between the gradients of the loss functions in language modeling task and retrieval task. Based on the analysis, a causal-inspired inference-time debiasing method is proposed, called **C**ausal **D**iagnosis and **C**orrection (CDC). CDC first diagnoses the bias effect of the perplexity and then separates the bias effect from the overall estimated relevance score. Experimental results across three domains demonstrate the superior debiasing effectiveness of CDC, emphasizing the validity of our proposed explanatory framework. Source codes are available at https://github.com/WhyDwelledOnAi/Perplexity-Trap.

Details

IJCAI Conference 2025 Conference Paper

Prompt-Free Conditional Diffusion for Multi-object Image Augmentation

Haoyu Wang
Lei Zhang
Wei Wei
Chen Ding
Yanning Zhang

Diffusion model has underpinned much recent advances of dataset augmentation in various computer vision tasks. However, when involving generating multi-object images as real scenarios, most existing methods either rely entirely on text condition, resulting in a deviation between the generated objects and the original data, or rely too much on the original images, resulting in a lack of diversity in the generated images, which is of limited help to downstream tasks. To mitigate both problems with one stone, we propose a prompt-free conditional diffusion framework for multi-object image augmentation. Specifically, we introduce a local-global semantic fusion strategy to extract semantics from images to replace text, and inject knowledge into the diffusion model through LoRA to alleviate the category deviation between the original model and the target dataset. In addition, we design a reward model based counting loss to assist the traditional reconstruction loss for model training. By constraining the object counts of each category instead of pixel-by-pixel constraints, bridging the quantity deviation between the generated data and the original data while improving the diversity of the generated data. Experimental results demonstrate the superiority of the proposed method over several representative state-of-the-art baselines and showcase strong downstream task gain and out-of-domain generalization capabilities. Code is available at \href{https: //github. com/00why00/PFCD}{here}.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model

Jiarui Jin
Haoyu Wang
Hongyan Li 0002
Jun Li
Jiahui Pan
Shenda Hong

Electrocardiogram (ECG) is essential for the clinical diagnosis of arrhythmias and other heart diseases, but deep learning methods based on ECG often face limitations due to the need for high-quality annotations. Although previous ECG self-supervised learning (eSSL) methods have made significant progress in representation learning from unannotated ECG data, they typically treat ECG signals as ordinary time-series data, segmenting the signals using fixed-size and fixed-step time windows, which often ignore the form and rhythm characteristics and latent semantic relationships in ECG signals. In this work, we introduce a novel perspective on ECG signals, treating heartbeats as words and rhythms as sentences. Based on this perspective, we first designed the QRS-Tokenizer, which generates semantically meaningful ECG sentences from the raw ECG signals. Building on these, we then propose HeartLang, a novel self-supervised learning framework for ECG language processing, learning general representations at form and rhythm levels. Additionally, we construct the largest heartbeat-based ECG vocabulary to date, which will further advance the development of ECG language processing. We evaluated HeartLang across six public ECG datasets, where it demonstrated robust competitiveness against other eSSL methods. Our data and code are publicly available at https://github.com/PKUDigitalHealth/HeartLang.

Details

NeurIPS Conference 2025 Conference Paper

The Dual Nature of Plasticity Loss in Deep Continual Learning: Dissection and Mitigation

Haoyu Wang
Wei Dai
Jiawei Zhang
Jialun Ma
Mingyi Huang
Yuguo Yu

Loss of plasticity (LoP) is the primary cause of cognitive decline in normal aging brains next to cell loss. Recent works show that similar LoP also plagues neural networks during deep continual learning (DCL). While it has been shown that random perturbations of learned weights can alleviate LoP, its underlying mechanisms remain insufficiently understood. Here we offer a unique view of LoP and dissect its mechanisms through the lenses of an innovative framework combining the theory of neural collapse and finite-time Lyapunov exponents (FTLE) analysis. We show that LoP actually consists of two contrasting types: (i) type-1 LoP is characterized by highly negative FTLEs, where the network is prevented from learning due to the collapse of representations; (ii) while type-2 LoP is characterized by excessively positive FTLEs, where the network can train well but the growingly chaotic behaviors reduce its test accuracy. Based on these understandings, we introduce Generalized Mixup, designed to relax the representation space for prolonged DCL and demonstrate its superior efficacy vs. existing methods.

PDF Details

AAAI Conference 2025 Short Paper

UACOF: A USV-AUV Collaboration Framework for Underwater Tasks Under Extreme Sea Conditions (Student Abstract)

Jingzehua Xu
Guanwen Xie
Yimian Ding
Yongming Zeng
Haoyu Wang
Shuai Zhang

Ocean exploration requires effective collaboration between the unmanned surface vehicle (USV) and autonomous underwater vehicles (AUVs). We propose UACOF, a USV-AUV collaboration framework that enhances multi-AUV performance under extreme sea conditions. The framework includes high-precision multi-AUV location via USV path planning with Fisher information matrix optimization and reinforcement learning training for cooperative tasks. Experimental results show UACOF's superior feasibility, performance, coordination and robustness in extreme conditions.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis

Tianrui Wang
Haoyu Wang
Meng Ge
Cheng Gong
Chunyu Qiang
Ziyang Ma
Zikang Huang
Guanrou Yang

While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine-tune the model via self-training, thereby activating its ability for word-level expressive control in an end-to-end manner. Experimental results show that WeSCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model.

PDF Details

IROS Conference 2024 Conference Paper

Arm-Constrained Curriculum Learning for Loco-Manipulation of a Wheel-Legged Robot

Zifan Wang
Yufei Jia
Lu Shi
Haoyu Wang
Haizhou Zhao
Xueyang Li
Jinni Zhou
Jun Ma 0008

Incorporating a robotic manipulator into a wheellegged robot enhances its agility and expands its potential for practical applications. However, the presence of potential instability and uncertainties presents additional challenges for control objectives. In this paper, we introduce an arm-constrained curriculum learning architecture to tackle the issues introduced by adding the manipulator. Firstly, we develop an arm-constrained reinforcement learning algorithm to ensure safety and reliability in control performance after equipping the manipulator. Additionally, to address discrepancies in reward settings between the arm and the base, we propose a reward-aware curriculum learning method. The policy is first trained in Isaac gym and transferred to the physical robot to complete grasping tasks, including the door-opening task, fan-twitching task and the relay-baton-picking and following task. The results demonstrate that our proposed approach effectively controls the arm-equipped wheel-legged robot to master grasping abilities including the dynamic grasping skills, allowing it to chase and catch a moving object while in motion. Please refer to our website (https://acodedog.github.io/wheel-legged-loco-manipulation/) for the code and supplemental videos.

Details

NeurIPS Conference 2024 Conference Paper

Certified Machine Unlearning via Noisy Stochastic Gradient Descent

Eli Chien
Haoyu Wang
Ziang Chen
Pan Li

``The right to be forgotten'' ensured by laws for user data privacy becomes increasingly important. Machine unlearning aims to efficiently remove the effect of certain data points on the trained model parameters so that it can be approximately the same as if one retrains the model from scratch. We propose to leverage projected noisy stochastic gradient descent for unlearning and establish its first approximate unlearning guarantee under the convexity assumption. Our approach exhibits several benefits, including provable complexity saving compared to retraining, and supporting sequential and batch unlearning. Both of these benefits are closely related to our new results on the infinite Wasserstein distance tracking of the adjacent (un)learning processes. Extensive experiments show that our approach achieves a similar utility under the same privacy constraint while using $2\%$ and $10\%$ of the gradient computations compared with the state-of-the-art gradient-based approximate unlearning methods for mini-batch and full-batch settings, respectively.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Cross-video Identity Correlating for Person Re-identification Pre-training

Jialong Zuo
Ying Nie
Hanyu Zhou
Huaxin Zhang
Haoyu Wang
Tianyu Guo
Nong Sang
Changxin Gao

Recent researches have proven that pre-training on large-scale person images extracted from internet videos is an effective way in learning better representations for person re-identification. However, these researches are mostly confined to pre-training at the instance-level or single-video tracklet-level. They ignore the identity-invariance in images of the same person across different videos, which is a key focus in person re-identification. To address this issue, we propose a Cross-video Identity-cOrrelating pre-traiNing (CION) framework. Defining a noise concept that comprehensively considers both intra-identity consistency and inter-identity discrimination, CION seeks the identity correlation from cross-video images by modeling it as a progressive multi-level denoising problem. Furthermore, an identity-guided self-distillation loss is proposed to implement better large-scale pre-training by mining the identity-invariance within person images. We conduct extensive experiments to verify the superiority of our CION in terms of efficiency and performance. CION achieves significantly leading performance with even fewer training samples. For example, compared with the previous state-of-the-art ISR, CION with the same ResNet50-IBN achieves higher mAP of 93. 3% and 74. 3% on Market1501 and MSMT17, while only utilizing 8% training samples. Finally, with CION demonstrating superior model-agnostic ability, we contribute a model zoo named ReIDZoo to meet diverse research and application needs in this field. It contains a series of CION pre-trained models with spanning structures and parameters, totaling 32 models with 10 different structures, including GhostNet, ConvNext, RepViT, FastViT and so on. The code and models will be open-sourced.

PDF Details DOI

EAAI Journal 2024 Journal Article

Fast sparse twin learning framework for large-scale pattern classification

Haoyu Wang
Guolin Yu
Jun Ma

Details DOI

NeurIPS Conference 2024 Conference Paper

Langevin Unlearning: A New Perspective of Noisy Gradient Descent for Machine Unlearning

Eli Chien
Haoyu Wang
Ziang Chen
Pan Li

Machine unlearning has raised significant interest with the adoption of laws ensuring the ``right to be forgotten''. Researchers have provided a probabilistic notion of approximate unlearning under a similar definition of Differential Privacy (DP), where privacy is defined as statistical indistinguishability to retraining from scratch. We propose Langevin unlearning, an unlearning framework based on noisy gradient descent with privacy guarantees for approximate unlearning problems. Langevin unlearning unifies the DP learning process and the privacy-certified unlearning process with many algorithmic benefits. These include approximate certified unlearning for non-convex problems, complexity saving compared to retraining, sequential and batch unlearning for multiple unlearning requests.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

NoiseGPT: Label Noise Detection and Rectification through Probability Curvature

Haoyu Wang
Zhuo Huang
Zhiwei Lin
Tongliang Liu

Machine learning craves high-quality data which is a major bottleneck during realistic deployment, as it takes abundant resources and massive human labor to collect and label data. Unfortunately, label noise where image data mismatches with incorrect label exists ubiquitously in all kinds of datasets, significantly degrading the learning performance of deep networks. Learning with Label Noise (LNL) has been a common strategy for mitigating the influence of noisy labels. However, existing LNL methods either require pertaining using the memorization effect to separate clean data from noisy ones or rely on dataset assumptions that cannot extend to various scenarios. Thanks to the development of Multimodal Large Language Models (MLLMs) which possess massive knowledge and hold In-Context Learning (ICL) ability, this paper proposes NoiseGPT to effectively leverage MLLMs as a knowledge expert for conducting label noise detection and rectification. Specifically, we observe a \textit{probability curvature} effect of MLLMs where clean and noisy examples reside on curvatures with different smoothness, further enabling the detection of label noise. By designing a token-wise Mix-of-Feature (MoF) technique to produce the curvature, we propose an In-Context Discrepancy (ICD) measure to determine the authenticity of an image-label pair. Subsequently, we repeat such a process to find the best matching pairs to complete our label rectification. Through extensive experiments, we carefully demonstrate the effectiveness of NoiseGPT on detecting and cleansing dataset noise, especially on ILSVRC12, the AUROC of NoiseGPT reached over 0. 92. And by integrating with existing methods, the classification performance can be significantly improved on noisy datasets, typically by 22. 8\% on 80\% symmetric CIFAR-10 with M-correction. Source code: \url{https: //github. com/drunkerWang/NoiseGPT}

PDF Details DOI

ICRA Conference 2024 Conference Paper

Ospreys-inspired Self-takeoff Strategy of An Eagle-scale Flapping-wing Robot: System Design and Flight Experiments

Haoyu Wang
Wenfu Xu
Linpo Hou
Erzhen Pan

In this work, we achieved a self-takeoff of an eagle-scale flapping-wing robot for the first time. Inspired by the takeoff process of Ospreys, we propose a bio-inspired takeoff strategy, then discuss the dynamic model and the requirements for self-takeoff. Based on the requirements of flight strategy, we designed a system with two parts, including a flapping-wing aircraft with a wingspan of 1. 8m and a take-off weight of 870g, and an auxiliary platform with an initial pitch angle adjustment function. In order to explore the differences in the take-off process under different conditions, we conduct the flight experiments under different time-averaged thrust-to-weight ratios (0. 745-0. 876) and launch angles (45°-90°). The results of flight experiments confirmed the theoretical analysis that the flapping-wing robot can achieve self-takeoff with no potential energy cost and maintain high maneuverability (The video shows a rapid climb immediately after takeoff) even when the time-averaged thrust-to-weight ratio is smaller than 1. This is significantly different from conventional rotary-wing and vertical take-off and landing (VTOL) UAVs. This work solves the challenge of self-takeoff for large-scale flapping-wing robots using a designable method and demonstrates the superior performance potential of flapping-wing robots compared to conventional UAVs.

Details

ICLR Conference 2024 Conference Paper

Self-Supervised High Dynamic Range Imaging with Multi-Exposure Images in Dynamic Scenes

Zhilu Zhang 0001
Haoyu Wang
Shuai Liu 0009
Xiaotao Wang
Lei Lei
Wangmeng Zuo

Merging multi-exposure images is a common approach for obtaining high dynamic range (HDR) images, with the primary challenge being the avoidance of ghosting artifacts in dynamic scenes. Recent methods have proposed using deep neural networks for deghosting. However, the methods typically rely on sufficient data with HDR ground-truths, which are difficult and costly to collect. In this work, to eliminate the need for labeled data, we propose SelfHDR, a self-supervised HDR reconstruction method that only requires dynamic multi-exposure images during training. Specifically, SelfHDR learns a reconstruction network under the supervision of two complementary components, which can be constructed from multi-exposure images and focus on HDR color as well as structure, respectively. The color component is estimated from aligned multi-exposure images, while the structure one is generated through a structure-focused network that is supervised by the color component and an input reference (\eg, medium-exposure) image. During testing, the learned reconstruction network is directly deployed to predict an HDR image. Experiments on real-world images demonstrate our SelfHDR achieves superior results against the state-of-the-art self-supervised methods, and comparable performance to supervised ones. Codes are available at https://github.com/cszhilu1998/SelfHDR

Details

NeurIPS Conference 2024 Conference Paper

Visual Pinwheel Centers Act as Geometric Saliency Detectors

Haixin Zhong
Mingyi Huang
Wei P. Dai
Haoyu Wang
Anna W. Roe
Yuguo Yu

During natural evolution, the primary visual cortex (V1) of lower mammals typically forms salt-and-pepper organizations, while higher mammals and primates develop pinwheel structures with distinct topological properties. Despite the general belief that V1 neurons primarily serve as edge detectors, the functional advantages of pinwheel structures over salt-and-peppers are not well recognized. To this end, we propose a two-dimensional self-evolving spiking neural network that integrates Hebbian-like plasticity and empirical morphological data. Through extensive exposure to image data, our network evolves from salt-and-peppers to pinwheel structures, with neurons becoming localized bandpass filters responsive to various orientations. This transformation is accompanied by an increase in visual field overlap. Our findings indicate that neurons in pinwheel centers (PCs) respond more effectively to complex spatial textures in natural images, exhibiting quicker responses than those in salt-and-pepper organizations. PCs act as first-order stage processors with heightened sensitivity and reduced latency to intricate contours, while adjacent iso-orientation domains serve as second-order stage processors that refine edge representations for clearer perception. This study presents the first theoretical evidence that pinwheel structures function as crucial detectors of spatial contour saliency in the visual cortex.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Learning Better with Less: Effective Augmentation for Sample-Efficient Visual Reinforcement Learning

Guozheng Ma
Linrui Zhang
Haoyu Wang
Lu Li
Zilin Wang
Zhen Wang
Li Shen
Xueqian Wang

Data augmentation (DA) is a crucial technique for enhancing the sample efficiency of visual reinforcement learning (RL) algorithms. Notably, employing simple observation transformations alone can yield outstanding performance without extra auxiliary representation tasks or pre-trained encoders. However, it remains unclear which attributes of DA account for its effectiveness in achieving sample-efficient visual RL. To investigate this issue and further explore the potential of DA, this work conducts comprehensive experiments to assess the impact of DA's attributes on its efficacy and provides the following insights and improvements: (1) For individual DA operations, we reveal that both ample spatial diversity and slight hardness are indispensable. Building on this finding, we introduce Random PadResize (Rand PR), a new DA operation that offers abundant spatial diversity with minimal hardness. (2) For multi-type DA fusion schemes, the increased DA hardness and unstable data distribution result in the current fusion schemes being unable to achieve higher sample efficiency than their corresponding individual operations. Taking the non-stationary nature of RL into account, we propose a RL-tailored multi-type DA fusion scheme called Cycling Augmentation (CycAug), which performs periodic cycles of different DA operations to increase type diversity while maintaining data distribution consistency. Extensive evaluations on the DeepMind Control suite and CARLA driving simulator demonstrate that our methods achieve superior sample efficiency compared with the prior state-of-the-art methods.

PDF Details

AAAI Conference 2023 Conference Paper

SimFair: A Unified Framework for Fairness-Aware Multi-Label Classification

Tianci Liu
Haoyu Wang
Yaqing Wang
Xiaoqian Wang
Lu Su
Jing Gao

Recent years have witnessed increasing concerns towards unfair decisions made by machine learning algorithms. To improve fairness in model decisions, various fairness notions have been proposed and many fairness-aware methods are developed. However, most of existing definitions and methods focus only on single-label classification. Fairness for multi-label classification, where each instance is associated with more than one labels, is still yet to establish. To fill this gap, we study fairness-aware multi-label classification in this paper. We start by extending Demographic Parity (DP) and Equalized Opportunity (EOp), two popular fairness notions, to multi-label classification scenarios. Through a systematic study, we show that on multi-label data, because of unevenly distributed labels, EOp usually fails to construct a reliable estimate on labels with few instances. We then propose a new framework named Similarity s-induced Fairness (sγ -SimFair). This new framework utilizes data that have similar labels when estimating fairness on a particular label group for better stability, and can unify DP and EOp. Theoretical analysis and experimental results on real-world datasets together demonstrate the advantage of sγ -SimFair over existing methods on multi-label classification tasks.

PDF Details DOI

ICRA Conference 2021 Conference Paper

REGNet: REgion-based Grasp Network for End-to-end Grasp Detection in Point Clouds

Binglei Zhao
Hanbo Zhang
Xuguang Lan
Haoyu Wang
Zhiqiang Tian
Nanning Zheng 0001

Reliable robotic grasping in unstructured environments is a crucial but challenging task. The main problem is to generate the optimal grasp of novel objects from partial noisy observations. This paper presents an end-to-end grasp detection network taking one single-view point cloud as input to tackle the problem. Our network includes three stages: Score Network (SN), Grasp Region Network (GRN), and Refine Network (RN). Specifically, SN regresses point grasp confidence and selects positive points with high confidence. Then GRN conducts grasp proposal prediction on the selected positive points. RN generates more accurate grasps by refining proposals predicted by GRN. To further improve the performance, we propose a grasp anchor mechanism, in which grasp anchors with assigned gripper orientations are introduced to generate grasp proposals. Experiments demonstrate that REGNet achieves a success rate of 79. 34% and a completion rate of 96% in real-world clutter, which significantly outperforms several state-of-the-art point-cloud based methods, including GPD, PointNetGPD, and S 4 G. The code is available at https://github.com/zhaobinglei/REGNet for 3D Grasping.

Details

AAAI Conference 2019 Conference Paper

Adversarial Binary Collaborative Filtering for Implicit Feedback

Haoyu Wang
Nan Shao
Defu Lian

Fast item recommendation based on implicit feedback is vital in practical scenarios due to data-abundance, but challenging because of the lack of negative samples and the large number of recommended items. Recent adversarial methods unifying generative and discriminative models are promising, since the generative model, as a negative sampler, gradually improves as iteration continues. However, binary-valued generative model is still unexplored within the min-max framework, but important for accelerating item recommendation. Optimizing binary-valued models is difficult due to non-smooth and nondifferentiable. To this end, we propose two novel methods to relax the binarization based on the error function and Gumbel trick so that the generative model can be optimized by many popular solvers, such as SGD and ADMM. The binary-valued generative model is then evaluated within the min-max framework on four real-world datasets and shown its superiority to competing hashing-based recommendation algorithms. In addition, our proposed framework can approximate discrete variables precisely and be applied to solve other discrete optimization problems.

PDF Details

IJCAI Conference 2019 Conference Paper

Binarized Collaborative Filtering with Distilling Graph Convolutional Network

Haoyu Wang
Defu Lian
Yong Ge

The efficiency of top-K item recommendation based on implicit feedback are vital to recommender systems in real world, but it is very challenging due to the lack of negative samples and the large number of candidate items. To address the challenges, we firstly introduce an improved Graph Convolutional Network~(GCN) model with high-order feature interaction considered. Then we distill the ranking information derived from GCN into binarized collaborative filtering, which makes use of binary representation to improve the efficiency of online recommendation. However, binary codes are not only hard to be optimized but also likely to incur the loss of information during the training processing. Therefore, we propose a novel framework to convert the binary constrained optimization problem into an equivalent continuous optimization problem with a stochastic penalty. The binarized collaborative filtering model is then easily optimized by many popular solvers like SGD and Adam. The proposed algorithm is finally evaluated on three real-world datasets and shown the superiority to the competing baselines.

PDF Details

AAAI Conference 2019 Conference Paper

Fast Incremental SVDD Learning Algorithm with the Gaussian Kernel

Hansi Jiang
Haoyu Wang
Wenhao Hu
Deovrat Kakde
Arin Chaudhuri

Support vector data description (SVDD) is a machine learning technique that is used for single-class classification and outlier detection. The idea of SVDD is to find a set of support vectors that defines a boundary around data. When dealing with online or large data, existing batch SVDD methods have to be rerun in each iteration. We propose an incremental learning algorithm for SVDD that uses the Gaussian kernel. This algorithm builds on the observation that all support vectors on the boundary have the same distance to the center of sphere in a higher-dimensional feature space as mapped by the Gaussian kernel function. Each iteration involves only the existing support vectors and the new data point. Moreover, the algorithm is based solely on matrix manipulations; the support vectors and their corresponding Lagrange multiplier αi’s are automatically selected and determined in each iteration. It can be seen that the complexity of our algorithm in each iteration is only O(k2 ), where k is the number of support vectors. Experimental results on some real data sets indicate that FISVDD demonstrates significant gains in efficiency with almost no loss in either outlier detection accuracy or objective function value.

PDF Details