Author name cluster

Hongyu Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

JBHI Journal 2026 Journal Article

ECG-AuxNet: A Dual-Branch Spatial-Temporal Feature Fusion Framework with Auxiliary Learning for Enhanced Cardiac Disease Diagnosis

Ruiqi Shen
Yanan Wang
Chunge Cao
Shuaicong Hu
Jia Liu
Hongyu Wang
Gaoyan Zhong
Cuiwei Yang

Objective: Multiple limitations exist in current automated ECG analysis, including insufficient feature integration across leads, limited interpretability, poor generalization, and inadequate handling of class imbalance. To address these challenges, we develop a novel dual-branch framework that comprehensively captures spatial-temporal features for cardiac disease diagnosis. Methods: ECG-AuxNet combines a Multi-scale Transformer Attention CNN for spatial feature extraction and a GRU network for temporal dependency modeling. A Dual-stage Cross-Attention Fusion module integrates features from both branches, while a Feature Space Reconstruction (FSR) auxiliary task is introduced as a manifold regularizer to enhance feature discrimination. The framework was evaluated on PTB-XL (15, 709 ECGs) and validated in real-world clinical scenarios (SXMU-2k, 1, 673 ECGs). Results: For class-imbalanced disease recognition (NORM, CD, MI, STTC), ECG-AuxNet attained 78. 34% F1-score on PTB-XL and 82. 63% F1-score on SXMU-2k, outperforming 9 baseline models. FSR significantly improved feature discrimination by 11. 7%, enhancing class boundary clarity and classification accuracy. Grad-CAM analysis revealed attention patterns that precisely match cardiologists' diagnostic focus areas. Conclusion: ECG-AuxNet effectively integrates spatial-temporal features through auxiliary learning, achieving robust generalizability in cardiac disease diagnosis with interpretability aligned with clinical expertise.

Details DOI

TMLR Journal 2026 Journal Article

Scaling Large Language Models with Fully Sparse Activations

Hongyu Wang
Shuming Ma
Ruiping Wang
Furu Wei

Activation sparsity can reduce the inference cost of large language models (LLMs) by lowering both compute and memory traffic. Yet most existing approaches sparsify only FFN intermediate states, leaving substantial portions of inference effectively dense. We study how to scale fully sparsely activated LLMs, in which every activation participating in linear transformations is sparse. We focus on two questions: how to train such models effectively, and how activation sparsity affects model quality as scale increases. We develop a pre-training recipe that enables effective training fully sparsely activated LLMs from scratch, including using squared ReLU as activation function, top-K sparsification and a straight-through estimator for the remaining linear layers. Extensive experiments spanning model sizes, training-token budgets, and target sparsity levels reveal that its performance gap to dense baselines narrows with model scale, increases nonlinearly with sparsity, while remaining largely insensitive to the training-token budget. Finally, we investigate post-training activation sparsification of pre-trained dense models via both training-free techniques and supervised fine-tuning, and observe a similar trend as pre-training experiments: larger models are more robust to sparsification, and exhibit increasingly sparse activation patterns. Overall, our results provide practical training recipes and empirical guidance for building and scaling LLMs with fully sparse activations.

PDF Details

AAAI Conference 2026 Conference Paper

USE: A Unified Model for Universal Sound Separation and Extraction

Hongyu Wang
Chenda Li
Xin Zhou
Shuai Wang
Yanmin Qian

Sound separation (SS) and target sound extraction (TSE) are fundamental techniques for addressing complex acoustic scenarios. While existing SS methods struggle with determining the unknown number of sound sources, TSE approaches require precisely specified clues to achieve optimal performance. This paper proposes a unified framework that synergistically combines SS and TSE to overcome their individual limitations. Our architecture employs two complementary components: 1) An Encoder-Decoder Attractor (EDA) network that automatically infers both the source count and corresponding acoustic clues for SS, and 2) A multi-modal fusion network that precisely interprets diverse user-provided clues (acoustic, semantic, or visual) for TSE. Through joint training with cross-task consistency constraints, we establish a unified latent space that bridges both paradigms. During inference, the system adaptively operates in either fully autonomous SS mode or clue-driven TSE mode. Experiments demonstrate remarkable performance in both tasks, with notable improvements of 1.4 dB SDR improvement in SS compared to baseline and 86% TSE accuracy.

PDF Details DOI

JBHI Journal 2025 Journal Article

A Multi-Sequence MRI-Based Hierarchical Expert Diagnostic Method for the Molecular Subtype of Breast Cancer

Hongyu Wang
Yanfang Hao
Pingping Wang
Erjuan Wang
Songtao Ding
Baoying Chen

The molecular subtype of breast cancer is significant for patients' treatment and prognosis. The application of multi-sequence MRI technology provides a new non-invasive diagnostic method, which can more accurately assess the vascular status of tumors and reveal fine structures. However, providing interpretable classification results remains a challenge. Recently, although many convolutional neural network (CNN) and fine-grained classification methods based on MRI inputs have been proposed. However, most of these methods operate in a âblack-boxâ without a detailed explanation of the intermediate processes, resulting in a lack of interpretability of the breast cancer classification process. To address this problem, we proposes a multi-sequence MRI-based hierarchical expert diagnostic method for the molecular subtype of breast cancer. With the strong differentiation module, this method first identifies enhanced features in breast tumors, ensuring that the subsequent classification process is precisely focused on the lesion features. In addition, inspired by the co-diagnosis of multiple experts in clinical diagnosis, we set up a mechanism of collaborative diagnostic corrective learning by hierarchical experts to provide an interpretable classification process. Compared with previous studies, the framework learns features with a strong distinguishing ability for breast tumor classification. Specifically, multiple experts corrected each other's learning to give more accurate and interpretable classification results, significantly improving clinical diagnosis's practical value. We conducted extensive experiments on a breast dataset and compared it quantitatively with other methods, and we achieved the best performance in terms of accuracy (0. 889) and F1 Score (0. 893). We make the code public on GitHub: https://github.com/yanfangHao/HED.

Details DOI

JMLR Journal 2025 Journal Article

BitNet: 1-bit Pre-training for Large Language Models

Hongyu Wang
Shuming Ma
Lingxiao Ma
Lei Wang
Wenhui Wang
Li Dong
Shaohan Huang
Huaijie Wang

The increasing size of large language models (LLMs) has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. Previous research typically applies quantization after pre-training. While these methods avoid the need for model retraining, they often cause notable accuracy loss at extremely low bit-widths. In this work, we explore the feasibility and scalability of 1-bit pre-training. We introduce BitNet b1 and BitNet b1.58, the scalable and stable 1-bit Transformer architecture designed for LLMs. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results show that BitNet b1 achieves competitive performance, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. With the ternary weight, BitNet b1.58 matches the half-precision Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, BitNet defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. It enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

PDF Details

EAAI Journal 2025 Journal Article

Continuous spatio temporal prompts for visual tracking

Meng Sun
Xiaotao Liu
Yifan Li
Hongyu Wang
Dian Yuan
Jing Liu

Currently, visual single-object tracking methods utilize online template updates to combine temporal information. However, these methods rely on confidence scores to evaluate the reliability of the current template, which may result in a template not being updated for an extended period. Moreover, advanced trackers select bounding boxes based solely on the similarity between the template and the search area, which can lead to tracking drift when encountering deformable or similar targets. To alleviate these limitations, we propose a Spatio Temporal Prompt Tracker (STPTrack), which utilizes the prior information about small changes of object state between successive frames. Different from previous tracking methods that mainly rely on templates and similarity scores, STPTrack transfers the object position and shape information of the previous frame to the current frame as continuous spatio temporal prompt for the first time, and realizes the efficient fusion of spatio temporal information through the prompt encoder and the fusion decoder module. Specifically, it encodes the bounding box coordinates or mask information of the previous frame and the response points of the current frame as prompt features, and then combines prompt tokens with search tokens through the fusion decoder to provide the potential location of the object for the search feature map. Our STPTrack sets a new state-of-the-art performance on six tracking benchmark datasets.

Details DOI

ICRA Conference 2025 Conference Paper

GARAD-SLAM: 3D Gaussian Splatting for Real-Time Anti Dynamic SLAM

Mingrui Li
Weijian Chen
Na Cheng
Jingyuan Xu
Dong Li
Hongyu Wang

The 3D Gaussian Splatting (3DGS)-based SLAM system has garnered widespread attention due to its excellent performance in real-time high-fidelity rendering. However, in real-world environments with dynamic objects, existing 3DGS-based SLAM systems often face mapping errors and tracking drift issues. To address these problems, we propose GARAD-SLAM, a real-time 3DGS-based SLAM system tailored for dynamic scenes. In terms of tracking, unlike traditional methods, we directly perform dynamic segmentation on Gaussians and map them back to the front-end to obtain dynamic point labels through a Gaussian pyramid network, achieving precise dynamic removal and robust tracking. For mapping, we impose rendering penalties on dynamically labeled Gaussians, which are updated through the network, to avoid irreversible erroneous removal caused by simple pruning. Our results on real-world datasets demonstrate that our method is competitive in tracking compared to baseline methods, generating fewer artifacts and higher-quality reconstructions in rendering.

Details

TMLR Journal 2025 Journal Article

Pruning Feature Extractor Stacking for Cross-domain Few-shot Learning

Hongyu Wang
Eibe Frank
Bernhard Pfahringer
Geoff Holmes

Combining knowledge from source domains to learn efficiently from a few labelled instances in a target domain is a transfer learning problem known as cross-domain few-shot learning (CDFSL). Feature extractor stacking (FES) is a state-of-the-art CDFSL method that maintains a collection of source domain feature extractors instead of a single universal extractor. FES uses stacked generalisation to build an ensemble from extractor snapshots saved during target domain fine-tuning. It outperforms several contemporary universal model-based CDFSL methods in the Meta-Dataset benchmark. However, it incurs higher storage cost because it saves a snapshot for every fine-tuning iteration for every extractor. In this work, we propose a bidirectional snapshot selection strategy for FES, leveraging its cross-validation process and the ordered nature of its snapshots, and demonstrate that a 95% snapshot reduction can be achieved while retaining the same level of accuracy.

PDF Details

IROS Conference 2025 Conference Paper

Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation

Senwei Xie
Hongyu Wang
Zhanqi Xiao
Yun-Ru Wang
Xilin Chen 0001

Zero-shot generalization across various robots, tasks and environments remains a significant challenge in robotic manipulation. Policy code generation methods use executable code to connect high-level task descriptions and low-level action sequences, leveraging the generalization capabilities of large language models and atomic skill libraries. In this work, we propose Robotic Programmer (RoboPro), a robotic foundation model, enabling the capability of perceiving visual information and following free-form instructions to perform robotic manipulation with policy code in a zero-shot manner. To address low efficiency and high cost in collecting runtime code data for robotic tasks, we devise Video2Code to synthesize executable code from extensive videos in-the-wild with off-the-shelf vision-language model and code-domain large language model. Extensive experiments show that RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments. Specifically, the zero-shot success rate of RoboPro on RLBench surpasses Code-as-Policies equipped with the state-of-the-art model GPT-4o by 11. 6%. Furthermore, RoboPro is robust to variations on API formats and skill sets. Our website can be found at https://video2code.github.io/RoboPro-website/.

Details

IJCAI Conference 2025 Conference Paper

ST-TAR: An Efficient Spatio-Temporal Learning Framework for Traffic Accident Risk Forecasting

Hongyu Wang
Lisi Chen
Shuo Shang
Peng Han
Christian S. Jensen

Traffic accidents represent a significant concern due to their devastating consequences. The ability to predict future traffic accident risks is of key importance to accident prevention activities in transportation systems. Although existing studies have made substantial efforts to model spatio-temporal correlations, they fall short when it comes to addressing the zero-inflated data issue and capturing spatio-temporal heterogeneity, which reduces their predictive abilities. In addition, improving efficiency is an urgent requirement for traffic accident forecasting. To overcome these limitations, we propose an efficient Spatio-Temporal learning framework for Traffic Accident Risk forecasting (ST-TAR). Taking long-term and short-term data as separate inputs, the ST-TAR model integrates hierarchical multi-view GCN and long short-term cross-attention mechanism to encode spatial dependencies and temporal patterns. We leverage long-term periodicity and short-term proximity for spatio-temporal contrastive learning to capture spatio-temporal heterogeneity. A tailored adaptive risk-level weighted loss function based on efficient locality-sensitive hashing is introduced to alleviate the zero-inflated issue. Extensive experiments on two real-world datasets offer evidence that ST-TAR is capable of advancing state-of-the-art forecasting accuracy with improved efficiency. This makes ST-TAR suitable for applications that require accurate real-time forecasting.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

TESTN: A Triad-Enhanced Spatio-Temporal Network for Multi-Temporal POI Relationship Inference

Hongyu Wang
Lisi Chen
Shuo Shang

Multi-temporal Point-of-Interest (POI) relationship inference aims to identify evolving relationships among locations over time, providing critical insights for location-based services. While existing studies have made substantial efforts to model relationships with custom-designed graph neural networks, they face the challenge of leveraging POI contextual information characterized by spatial dependencies and temporal dynamics, as well as capturing the heterogeneity of multi-type relationships. To address these challenges, we propose a Triad-Enhanced Spatio-Temporal Network (TESTN), which conceptualizes triads as interactions between relationships for capturing potential interplay. Specifically, TESTN incorporates the spatial 2-hop aggregation layer to capture geographical and semantic information beyond first-order neighbors and the temporal context extractor to integrate relational dynamics within adjacent time segments. Furthermore, we introduce a self-supervised pairwise neighboring relation consistency detection scheme to preserve the heterogeneity of multi-type relationships. Extensive experiments on three real-world datasets demonstrate the superior performance of our TESTN framework.

PDF Details DOI

AAAI Conference 2024 Conference Paper

PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine

Chenrui Zhang
Lin Liu
Chuyuan Wang
Xiao Sun
Hongyu Wang
Jinpeng Wang
Mingchen Cai

As an effective tool for eliciting the power of Large Language Models (LLMs), prompting has recently demonstrated unprecedented abilities across a variety of complex tasks. To further improve the performance, prompt ensemble has attracted substantial interest for tackling the hallucination and instability of LLMs. However, existing methods usually adopt a two-stage paradigm, which requires a pre-prepared set of prompts with substantial manual effort, and is unable to perform directed optimization for different weak learners. In this paper, we propose a simple, universal, and automatic method named PREFER (Prompt Ensemble learning via Feedback-Reflect-Refine) to address the stated limitations. Specifically, given the fact that weak learners are supposed to focus on hard examples during boosting, PREFER builds a feedback mechanism for reflecting on the inadequacies of existing weak learners. Based on this, the LLM is required to automatically synthesize new prompts for iterative refinement. Moreover, to enhance stability of the prompt effect evaluation, we propose a novel prompt bagging method involving forward and backward thinking, which is superior to majority voting and is beneficial for both feedback and weight calculation in boosting. Extensive experiments demonstrate that our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin. We have made our code publicly available.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Temporal Adaptive RGBT Tracking with Modality Prompt

Hongyu Wang
Xiaotao Liu
Yifan Li
Meng Sun
Dian Yuan
Jing Liu

RGBT tracking has been widely used in various fields such as robotics, surveillance processing, and autonomous driving. Existing RGBT trackers fully explore the spatial information between the template and the search region and locate the target based on the appearance matching results. However, these RGBT trackers have very limited exploitation of temporal information, either ignoring temporal information or exploiting it through online sampling and training. The former struggles to cope with the object state changes, while the latter neglects the correlation between spatial and temporal information. To alleviate these limitations, we propose a novel Temporal Adaptive RGBT Tracking framework, named as TATrack. TATrack has a spatio-temporal two-stream structure and captures temporal information by an online updated template, where the two-stream structure refers to the multi-modal feature extraction and cross-modal interaction for the initial template and the online update template respectively. TATrack contributes to comprehensively exploit spatio-temporal information and multi-modal information for target localization. In addition, we design a spatio-temporal interaction (STI) mechanism that bridges two branches and enables cross-modal interaction to span longer time scales. Extensive experiments on three popular RGBT tracking benchmarks show that our method achieves state-of-the-art performance, while running at real-time speed.

PDF Details DOI

EAAI Journal 2023 Journal Article

Joint image enhancement learning for marine object detection in natural scene

Na Cheng
Hongye Xie
Xuanbing Zhu
Hongyu Wang

Marine object detection has received an increasing amount of attention due to its enormous application potential in the field of marine engineering, Remotely Operated Vehicles, and Autonomous Underwater Vehicles. It has made substantial progress in generic object detection with the prevalent trend of deep learning in the past few years. However, marine object detection in natural scenes remains certainly an unsolved problem. The challenges stem from low visibility, small size, serious occlusion, and dense distribution. In this article, we attempt to address the marine object detection problem by presenting a clever joint attention-guided dual-subnet network that can jointly learn both image enhancement and object detection tasks for end-to-end training. JADSNet attains significant performance gains by comprising two subnetworks: an image enhancement subnet and a marine object detection subnet. Essentially, the marine object detection subnet is an extended feature pyramid network with a dual attention-guided module and a multi-term loss function. It takes RetinaNet as a backbone and is responsible for classifying and locating objects. In the image enhancement subnet, feature extraction layers are shared with the marine object detection subnet and a feature enhancement module is used. A multi-term loss function is introduced to reduce false detection and miss detection caused by the mutual occlusion of marine objects. We build a new Marine Object Detection (MOD) dataset that contains more than 25, 000 train-val and 3000 test underwater images. The experimental findings demonstrate that our JADSNet realize notable performance and reach 74. 41% mAP on the MOD dataset. We also verify that the JADSNet method can be applied to object detection in foggy weather and achieve 49. 54% mAP on the foggy dataset.

Details DOI

JBHI Journal 2023 Journal Article

Two Path Gland Segmentation Algorithm of Colon Pathological Image Based on Local Semantic Guidance

Songtao Ding
Hongyu Wang
Hu Lu
Michele Nappi
Shaohua Wan

Colonic adenocarcinoma is a disease severely endangering human life caused by mucosal epidermal carcinogenesis. The segmentation of potentially cancerous glands is the key in the detection and diagnosis of colonic adenocarcinoma. The appearance of cancerous tissue is different in gland segmentation in colon pathological images, and it is impossible to accurately segment the changes of glands from benign to malignant using a single network. Given these issues, a two-path gland segmentation algorithm of colon pathological image based on local semantic guidance is proposed in this paper. The improved candidate region search algorithm is adopted to expand the original image data set and generate sub-datasets sensitive to specific features. Then, the semantic feature-guided model is employed to extract the local adenocarcinoma features and acts on the backbone network together with context feature extraction based on the attention mechanism. In this way, a larger receptive field and more local feature information are obtained, the learning ability of the network to the morphological features of glands is enhanced, and the performance of automatic gland segmentation is finally improved. The algorithm is verified on Warwick Qu-Dataset. Compared with the current popular segmentation algorithms, our algorithm has good performance in Dice coefficient, F1 score, and Hausdorff distance on different types of test sets.

Details DOI

JBHI Journal 2020 Journal Article

Thorax-Net: An Attention Regularized Deep Neural Network for Classification of Thoracic Diseases on Chest Radiography

Hongyu Wang
Haozhe Jia
Le Lu
Yong Xia

Deep learning techniques have been increasingly used to provide more accurate and more accessible diagnosis of thorax diseases on chest radiographs. However, due to the lack of dense annotation of large-scale chest radiograph data, this computer-aided diagnosis task is intrinsically a weakly supervised learning problem and remains challenging. In this paper, we propose a novel deep convolutional neural network called Thorax-Net to diagnose 14 thorax diseases using chest radiography. Thorax-Net consists of a classification branch and an attention branch. The classification branch serves as a uniform feature extraction-classification network to free users from the troublesome hand-crafted feature extraction and classifier construction. The attention branch exploits the correlation between class labels and the locations of pathological abnormalities via analyzing the feature maps learned by the classification branch. Feeding a chest radiograph to the trained Thorax-Net, a diagnosis is obtained by averaging and binarizing the outputs of two branches. The proposed Thorax-Net model has been evaluated against three state-of-the-art deep learning models using the patientwise official split of the ChestX-ray14 dataset and against other five deep learning models using the imagewise random data split. Our results show that Thorax-Net achieves an average per-class area under the receiver operating characteristic curve (AUC) of 0. 7876 and 0. 896 in both experiments, respectively, which are higher than the AUC values obtained by other deep models when they were all trained with no external data.

Details DOI