Author name cluster

Siyu Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

2 author rows

AAAI Conference 2026 Conference Paper

ICM-Fusion: In-Context Meta-Optimized LoRA Fusion for Multi-Task Adaptation

Yihua Shao
Xiaofeng Lin
Xinwei Long
Siyu Chen
Minxi Yan
Yang Liu
Ziyang Yan
Ao Ma

Enabling multi-task adaptation in pre-trained Low-Rank Adaptation (LoRA) models is crucial for enhancing their generalization capabilities. Most existing pre-trained LoRA fusion methods decompose weight matrices, sharing similar parameters, while fusion divergent ones. However, this paradigm inevitably induces inter-weight conflicts and leads to catastrophic domain forgetting. While incremental learning enables adaptation to multiple tasks, it struggles to achieve generalization in few-shot scenarios. Consequently, when the weight data follows a long-tailed distribution, it can lead to forgetting in the fused weights. To address this issue, we propose In-Context Meta LoRA Fusion (ICM-Fusion), a novel framework that synergizes meta-learning with in-context adaptation. The key innovation lies in our task vector arithmetic, which dynamically balances conflicting optimization directions across domains through learned manifold projections. ICM-Fusion obtains the optimal task vector orientation for the fused model in the latent space by adjusting the orientation of the task vectors. Subsequently, the fused LoRA is reconstructed by a self-designed Fusion VAE (F-VAE) to realize multi-task LoRA generation. We have conducted extensive experiments on visual and linguistic tasks, and the experimental results demonstrate that ICM-Fusion can be adapted to a wide range of architectural models and applied to various tasks. Compared to the current pre-trained LoRA fusion method, ICM-Fusion fused LoRA can significantly reduce the multi-tasking loss and can even achieve task enhancement in few-shot scenarios.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Piercing the Fog: Disentangling Key Features for Vision Models in Multi-Degradation Scenarios

Siyu Chen
Shiqiang Ma
Fei Guo

In natural scenarios, vision models often encounter the challenge of complex degradation scenarios(e.g., rain, snow, fog, or motion blur). These degradations severely corrupt image features, causing existing models to treat rarely seen or unseen degraded images as “unfamiliar”, thereby losing their inherent recognition and perception capabilities. To address this challenge, we propose a novel degradation disentanglement model (DDM) aimed at precisely disentangling degraded features from the image. The model enhances its perception of various degradations by controlling the matching of features across different degradation types and further strengthens the cross-correlation of target features by introducing a degradation suppression module. This enables the model to re-identify and re-localize targets while removing degradations. We validated the effectiveness of our method on more challenging few-shot segmentation datasets Degraded-Pascal and Degraded-COCO. Results on them outperform SOTA with 3.71% and 3.69% improvement respectively. The experimental results show that our method significantly improves the performance of vision models in various degradation scenarios and provides new ideas and solutions for visual understanding tasks in complex environments.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing

Chaolei Wang
Yang Luo
Jing Du
Siyu Chen
Yiping Chen
Ting Han

Accurate 3D instance segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D instance segmentation based on 2D-to-3D lifting approaches struggle to produce precise instance-level segmentation, due to accumulated errors introduced during the lifting process from ambiguous semantic guidance and insufficient depth constraints. To tackle these challenges, we propose Splitting and Growing reliable Semantic mask for high-fidelity 3D instance segmentation (SGS-3D), a novel "split-then-grow" framework that first purifies and splits ambiguous lifted masks using geometric primitives, and then grows them into complete instances within the scene. Unlike existing approaches that directly rely on raw lifted masks and sacrifice segmentation accuracy, SGS-3D serves as a training-free refinement method that jointly fuses semantic and geometric information, enabling effective cooperation between the two levels of representation. Specifically, for semantic guidance, we introduce a mask filtering strategy that leverages the co-occurrence of 3D geometry primitives to identify and remove ambiguous masks, thereby ensuring more reliable semantic consistency with the 3D object instances. For the geometric refinement, we construct fine-grained object instances by exploiting both spatial continuity and high-level features, particularly in the case of semantic ambiguity between distinct objects. Experimental results on ScanNet200, ScanNet++, and KITTI-360 demonstrate that SGS-3D substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained models, yielding high-fidelity object instances while maintaining strong generalization across diverse indoor and outdoor environments.

PDF Details DOI

AAAI Conference 2026 Conference Paper

TR-DQ: Time-Rotation Diffusion Quantization

Yihua Shao
Deyang Lin
Minxi Yan
Siyu Chen
Fanhu Zeng
Minwen Liao
Ao Ma
Ziyang Yan

Diffusion models have been widely adopted in image and video generation. However, their complex network architecture leads to high inference overhead for its generation process. Existing diffusion quantization methods primarily focus on the quantization of the model structure while ignoring the impact of time-steps variation during sampling. At the same time, most current approaches fail to account for significant activations that cannot be eliminated, resulting in substantial performance degradation after quantization. To address these issues, we propose Time-Rotation Diffusion Quantization (TR-DQ), a novel quantization method incorporating time-step and rotation-based optimization. TR-DQ first divides the sampling process based on time-steps and applies a rotation matrix to smooth activations and weights dynamically. For different time-steps, a dedicated hyperparameter is introduced for adaptive timing modeling, which enables dynamic quantization across different time steps. Additionally, we also explore the compression potential of Classifier-Free Guidance (CFG-wise) to establish a foundation for subsequent work. TR-DQ achieves state-of-the-art (SOTA) performance on image generation and video generation tasks and a 1.38-1.89× speedup and 1.97-2.58× memory reduction in inference compared to existing quantization methods.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

An Optimized Franz-Parisi Criterion and its Equivalence with SQ Lower Bounds

Siyu Chen
Theodor Misiakiewicz
Ilias Zadik
Peiyuan Zhang

Bandeira et al. (2022) introduced the Franz-Parisi (FP) criterion for characterizing the computational hard phases in statistical detection problems. The FP criterion, based on an annealed version of the celebrated Franz-Parisi potential from statistical physics, was shown to be equivalent to low-degree polynomial (LDP) lower bounds for Gaussian additive models, thereby connecting two distinct approaches to understanding the computational hardness in statistical inference. In this paper, we propose a refined FP criterion that aims to better capture the geometric ``overlap" structure of statistical models. Our main result establishes that this optimized FP criterion is equivalent to Statistical Query (SQ) lower bounds---another foundational framework in computational complexity of statistical inference. Crucially, this equivalence holds under a mild, verifiable assumption satisfied by a broad class of statistical models, including Gaussian additive models, planted sparse models, as well as non-Gaussian component analysis (NGCA), single-index (SI) models, and convex truncation detection settings. For instance, in the case of convex truncation tasks, the assumption is equivalent with the Gaussian correlation inequality (Royen, 2014) from convex geometry. In addition to the above, our equivalence not only unifies and simplifies the derivation of several known SQ lower bounds—such as for the NGCA model (Diakonikolas et al. , 2017) and the SI model (Damian et al. , 2024)—but also yields new SQ lower bounds of independent interest, including for the computational gaps in mixed sparse linear regression (Arpino et al. , 2023) and convex truncation (De et al. , 2023).

PDF Details

IROS Conference 2025 Conference Paper

FGS-SLAM: Fourier-based Gaussian Splatting for Real-time SLAM with Sparse and Dense Map Fusion

Yansong Xu
Junlin Li
Wei Zhang 0071
Siyu Chen
Shengyong Zhang
Yuquan Leng
Weijia Zhou

3D gaussian splatting has advanced simultaneous localization and mapping (SLAM) technology by enabling realtime positioning and the construction of high-fidelity maps. However, the uncertainty in gaussian position and initialization parameters introduces challenges, often requiring extensive iterative convergence and resulting in redundant or insufficient gaussian representations. To address this, we introduce a novel adaptive densification method based on Fourier frequency domain analysis to establish gaussian priors for rapid convergence. Additionally, we propose constructing independent and unified sparse and dense maps, where a sparse map supports efficient tracking via Generalized Iterative Closest Point (GICP) and a dense map creates high-fidelity visual representations. This is the first SLAM system leveraging frequency domain analysis to achieve high-quality gaussian mapping in realtime. Experimental results demonstrate an average frame rate of 36 FPS on Replica and TUM RGB-D datasets, achieving competitive accuracy in both localization and mapping. The source code is publicly available at https://github.com/3DV-Coder/FGS-SLAM.

Details

JBHI Journal 2025 Journal Article

Identifying Acute Thoracolumbar Vertebral Compression Fractures From Low-Quality Small-Sample X-Ray Images: A Transfer Learning-Based Approach

Yilin Wang
Weijun Li
Siyu Chen
Yang Yang
Aidi Fan
Chenhao Lei
Yuhui Kou
Na Han

Timely and accurate diagnosis of acute thoracolumbar vertebral compression fractures in X-ray images is critical for initiating prompt and effective treatment, preventing potential neurological damage and long-term disability. Recent advancements in artificial intelligence (AI) have significantly improved medical imaging analysis, providing sophisticated tools to assist clinicians in diagnosing acute thoracolumbar vertebral compression fractures. Nonetheless, detecting these fractures through imaging remains challenging due to the complex overlapping of bony structures in the thoracolumbar region, variability in fracture patterns, and often subtle nature of these injuries. Additionally, the limited availability and sometimes poor quality of medical images further complicate accurate AI-based detection. Addressing these challenges, this study introduces a transfer learning model optimized for recognizing acute thoracolumbar vertebral compression fractures from a small set of low-quality X-ray images. The model starts with a feature extraction model that analyzes multiple texture features of X-ray images. It then employs a Vision Transformer Detector (ViTDet) combined with a faster region-based convolutional neural network (Faster R-CNN) to recognize fractures efficiently. To enhance its performance on small datasets, the model employs a transfer learning approach for training. Extensive experiments with a large dataset of real-world images have shown that this model can effectively recognize acute thoracolumbar vertebral compression fractures from low-quality images, outperforming professionals with specialized knowledge in some cases.

Details DOI

ICML Conference 2025 Conference Paper

In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

Jianliang He
Xintian Pan
Siyu Chen
Zhuoran Yang

We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns: a diagonal and homogeneous pattern in the key-query weights, and a last-entry-only and zero-sum pattern in the output-value weights. Remarkably, these patterns consistently appear from gradient-based training starting from random initialization. Our analysis reveals that such emergent structures enable multi-head attention to approximately implement a debiased gradient descent predictor — one that outperforms single-head attention and nearly achieves Bayesian optimality up to proportional factor. We also extend our study to scenarios with anisotropic covariates and multi-task linear regression. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution, paving the way for deeper understanding and broader applications of in-context learning.

Details

IJCAI Conference 2025 Conference Paper

In-Context Meta LoRA Generation

Yihua Shao
Minxi Yan
Yang Liu
Siyu Chen
Wenjie Chen
Xinwei Long
Ziyang Yan
Lei Li

Low-rank Adaptation (LoRA) has demonstrated remarkable capabilities for task specific fine-tuning. However, in scenarios that involve multiple tasks, training a separate LoRA model for each one results in considerable inefficiency in terms of storage and inference. Moreover, existing parameter generation methods fail to capture the correlations among these tasks, making multi-task LoRA parameter generation challenging. To address these limitations, we propose In-Context Meta LoRA (ICM-LoRA), a novel approach that efficiently achieves task-specific customization of large language models (LLMs). Specifically, we use training data from all tasks to train a tailored generator, Conditional Variational Autoencoder (CVAE). CVAE takes task descriptions as inputs and produces task-aware LoRA weights as outputs. These LoRA weights are then merged with LLMs to create task-specialized models without the need for additional fine-tuning. Furthermore, we utilize in-context meta-learning for knowledge enhancement and task mapping, to capture the relationship between tasks and parameter distributions. As a result, our method achieves more accurate LoRA parameter generation for diverse tasks using CVAE. ICM-LoRA enables more accurate LoRA parameter reconstruction than current parameter reconstruction methods and is useful for implementing task-specific enhancements of LoRA parameters. At the same time, our method occupies 283MB, only 1% storage compared with the original LoRA. The code is available at https: //github. com/YihuaJerry/ICM-LoRA.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation

Siyu Chen
Ting Han
Chengzheng Fu
Changshe Zhang
Chaolei Wang
Jinhe Su
Guorong Cai
Meiliu Wu

Open-Vocabulary semantic segmentation (OVSS) and domain generalization in semantic segmentation (DGSS) highlight a subtle complementarity that motivates Open-Vocabulary Domain-Generalized Semantic Segmentation (OV-DGSS). OV-DGSS aims to generate pixel-level masks for unseen categories while maintaining robustness across unseen domains, a critical capability for real-world scenarios such as autonomous driving in adverse conditions. We introduce Vireo, a novel single-stage framework for OV-DGSS that unifies the strengths of OVSS and DGSS for the first time. Vireo builds upon the frozen Visual Foundation Models (VFMs) and incorporates scene geometry via Depth VFMs to extract domain-invariant structural features. To bridge the gap between visual and textual modalities under domain shift, we propose three key components: (1) GeoText Query, which align geometric features with language cues and progressively refine VFM encoder representations; (2) Coarse Mask Prior Embedding (CMPE) for enhancing gradient flow for faster convergence and stronger textual influence; and (3) the Domain-Open-Vocabulary Vector Embedding Head (DOV-VEH), which fuses refined structural and semantic features for robust prediction. Comprehensive evaluation on these components demonstrates the effectiveness of our designs. Our proposed Vireo achieves the state-of-the-art performance and surpasses existing methods by a large margin in both domain generalization and open-vocabulary recognition, offering a unified and scalable solution for robust visual understanding in diverse and dynamic environments. Code is available at https: //github. com/SY-Ch/Vireo.

PDF Details

IJCAI Conference 2025 Conference Paper

Multi-Scale Temporal Neural Network for Stock Trend Prediction Enhanced by Temporal Hyepredge Learning

Lingyun Song
Haodong Li
Siyu Chen
Xinbiao Gan
Binze Shi
Jie Ma
Yudai Pan
Xiaoqi Wang

Existing research in Stock Trend Prediction (STP) focuses on temporal features extracted from a temporal sequence of stock data with a look-back window, which frequently leads to the omission of important periodic patterns, such as weekly and monthly variations in stock prices. Furthermore, these methods examine stocks individually, ignoring the temporal variation patterns among stocks that share higher-order relationships, like those within the same industry. These relationships typically provide contextual insights into market investments influencing stock price fluctuations. To tackle these issues, we propose a Multi-Scale Temporal Neural Network (MSTNN) framework tailored for STP. This architecture explores the periodic fluctuation behaviors of individual stocks through an innovative 3D convolutional neural network, alongside examining temporal variation patterns of stocks linked to specific industries via a temporal hypergraph attention mechanism. Empirical results from two real-world benchmark datasets show that MSTNN significantly outperforms prior state-of-the-art STP methods. The code of our MSTNN is available at https: //github. com/sunlitsong/MSTNN.

PDF Details DOI

TAAS Journal 2025 Journal Article

Road Surface State Change Detection Based on Binocular Vision for Autonomous Driving System

Liangtian Zhao
Xiangmin Xu
Shanshan Pei
Siyu Chen
Xiyuan Hu
Qiwei Xie

Road surface condition monitoring is crucial for enhancing transportation safety and efficiency, with applications in autonomous driving and urban infrastructure management. Existing methods often rely on single-camera setups or manual inspections, which are either insufficient for real-time monitoring or labor-intensive. This system focuses on two critical factors: road slope and surface damage, both significantly impacting driving safety and experience, highlighting the need for timely detection. To ensure accuracy and robustness, the system employs a binocular camera for detailed road environment insights and integrates urban sensing techniques. Its hardware deployment processes stereo vision data on embedded platforms, ensuring compatibility with urban IoT networks. This approach surpasses single-camera systems in detecting road surface variations. The research motivation stems from the pressing need to enhance road safety and driving conditions in urban areas. By analyzing binocular camera data and urban sensing technologies, the system offers real-time road condition analysis for effective decision-making. Regarding results, the system showed robust performance in detecting both road slope and surface damage. Slope detection achieved high accuracy with minimal error, and road damage detection reached an overall accuracy of 84%. The system remained stable across diverse conditions, including adverse weather and varying lighting.

Details DOI

IJCAI Conference 2025 Conference Paper

Unlocking Dark Vision Potential for Medical Image Segmentation

Hongpeng Yang
Xiangyu Hu
Yingxin Chen
Siyu Chen
Srihari Nelakuditi
Yan Tong
Shiqiang Ma
Fei Guo

Accurate segmentation of lesions is crucial for disease diagnosis and treatment planning. However, blurring and low contrast in the imaging process can affect segmentation results. We have observed that noninvasive medical imaging shares considerable similarities with natural images under low light conditions and that nocturnal animals possess extremely strong night vision capabilities. Inspired by the dark vision of these nocturnal animals, we proposed a novel plug-and-play dark vision network (DVNet) to enhance the model's perception for low-contrast medical images. Specifically, by employing the wavelet transform, we decompose medical images into subbands of varying frequencies, mimicking the sensitivity of photoreceptor cells to different light intensities. To simulate the antagonistic receptive fields of horizontal cells and bipolar cells, we design a Mamba-Enhanced Fusion Module to achieve global information correlation and enhance contrast between lesions and surrounding healthy tissues. Extensive experiments demonstrate that the DVNet achieves SOTA performance in various medical image segmentation tasks.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers

Siyu Chen
Heejune Sheen
Tianhao Wang
Zhuoran Yang

In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically explains how the attention mechanism facilitates ICL under certain data models. It remains unclear how the other building blocks of the transformer contribute to ICL. To address this question, we study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data, where each token in the Markov chain statistically depends on the previous n tokens. We analyze a sophisticated transformer model featuring relative positional embedding, multi-head softmax attention, and a feed-forward layer with normalization. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that performs a generalized version of the "induction head" mechanism with a learned feature, resulting from the congruous contribution of all the building blocks. Specifically, the first attention layer acts as a copier, copying past tokens within a given window to each position, and the feed-forward network with normalization acts as a selector that generates a feature vector by only looking at informationally relevant parents from the window. Finally, the second attention layer is a classifier thatcompares these features with the feature at the output position, and uses the resulting similarity scores to generate the desired output. Our theory is further validated by simulation experiments.

PDF Details DOI

JBHI Journal 2022 Journal Article

Flexible Dual-Channel Digital Auscultation Patch With Active Noise Reduction for Bowel Sound Monitoring and Application

Gang Wang
Yingyun Yang
Siyu Chen
Ji Fu
Dong Wu
Aiming Yang
Yinji Ma
Xue Feng

Bowel sounds (BSs) have important clinical value in the auxiliary diagnosis of digestive diseases, but due to the inconvenience of long-term monitoring and too much interference from environmental noise, they have not been well studied. Most of the current electronic stethoscopes are hard and bulky without the function of noise reduction, and their application for long-term wearable monitoring of BS in noisy clinical environments is very limited. In this paper, a flexible dual-channel digital auscultation patch with active noise reduction is designed and developed, which is wireless, wearable, and conformably attached to abdominal skin to record BS more accurately. The ambient noise can be greatly reduced through active noise reduction based on the adaptive filter. At the same time, some nonstationary noises appearing intermittently (e. g. , frictional noise) can also be removed from BS by the cross validation of multichannel simultaneous acquisition. Then, two kinds of typical BS signals are taken as examples, and the feature parameters of the BS in the time domain and frequency domain are extracted through the time-frequency analysis algorithm. Furthermore, based on the short-term energy ratio between the four channels of dual patches, the two-dimensional localization of BS on the abdomen mapping plane is realized. Finally, the continuous wearable monitoring of BS for patients with postoperative ileus (POI) in the noisy ward from pre-operation (POD0) to postoperative Day 7 (POD7) was carried out. The obtained change curve of the occurrence frequency of BS provides guidance for doctors to choose a reasonable feeding time for patients after surgery and accelerate their recovery. Therefore, flexible dual-channel digital auscultation patches with active noise reduction will have promising applications in the clinical auxiliary diagnosis of digestive diseases.

Details DOI

NeurIPS Conference 2021 Conference Paper

Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic

Yufeng Zhang
Siyu Chen
Zhuoran Yang
Michael Jordan
Zhaoran Wang

Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years. However, most of the existing theoretical support for AC algorithms focuses on the case of linear function approximations, or linearized neural networks, where the feature representation is fixed throughout training. Such a limitation fails to capture the key aspect of representation learning in neural AC, which is pivotal in practical problems. In this work, we take a mean-field perspective on the evolution and convergence of feature-based neural AC. Specifically, we consider a version of AC where the actor and critic are represented by overparameterized two-layer neural networks and are updated with two-timescale learning rates. The critic is updated by temporal-difference (TD) learning with a larger stepsize while the actor is updated via proximal policy optimization (PPO) with a smaller stepsize. In the continuous-time and infinite-width limiting regime, when the timescales are properly separated, we prove that neural AC finds the globally optimal policy at a sublinear rate. Additionally, we prove that the feature representation induced by the critic network is allowed to evolve within a neighborhood of the initial one.

PDF Details

AAAI Conference 2018 Conference Paper

Predicting Aesthetic Score Distribution Through Cumulative Jensen-Shannon Divergence

Xin Jin
Le Wu
Xiaodong Li
Siyu Chen
Siwei Peng
Jingying Chi
Shiming Ge
Chenggen Song

Aesthetic quality prediction is a challenging task in the computer vision community because of the complex interplay with semantic contents and photographic technologies. Recent studies on the powerful deep learning based aesthetic quality assessment usually use a binary high-low label or a numerical score to represent the aesthetic quality. However the scalar representation cannot describe well the underlying varieties of the human perception of aesthetics. In this work, we propose to predict the aesthetic score distribution (i. e. , a score distribution vector of the ordinal basic human ratings) using Deep Convolutional Neural Network (DCNN). Conventional DCNNs which aim to minimize the difference between the predicted scalar numbers or vectors and the ground truth cannot be directly used for the ordinal basic rating distribution. Thus, a novel CNN based on the Cumulative distribution with Jensen-Shannon divergence (CJS-CNN) is presented to predict the aesthetic score distribution of human ratings, with a new reliability-sensitive learning method based on the kurtosis of the score distribution, which eliminates the requirement of the original full data of human ratings (without normalization). Experimental results on large scale aesthetic dataset demonstrate the effectiveness of our introduced CJS-CNN in this task.

PDF Details