Arrow Research search

Author name cluster

Wei Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

52 papers
2 author rows

Possible papers

52

AAAI Conference 2026 Conference Paper

Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning

  • Yuqin Cao
  • Yixuan Gao
  • Wei Sun
  • Xiaohong Liu
  • Yulun Zhang
  • Xiongkuo Min

Face videos accompanied by audio have become integral to our daily lives, while they often suffer from complex degradations. Most face video restoration methods neglect the intrinsic correlations between visual and audio features, particularly in the mouth region. Several audio-aided face video restoration methods have been proposed, but they only focus on compression artifact removal. In this paper, we propose a General Audio-assisted face Video restoration Network (GAVN) to address various types of streaming video distortions via identity and temporal complementary learning. Specifically, GAVN first captures inter-frame temporal features in the low-resolution space to restore frames coarsely and save computational cost. Then, GAVN extracts intra-frame identity features in the high-resolution space with the assistance of audio signals and face landmarks to restore more facial details. Finally, the reconstruction module integrates temporal features and identity features to generate high-quality face videos. Experimental results demonstrate that GAVN outperforms the existing state-of-the-art methods on face video compression artifact removal, deblurring, and super-resolution.

JBHI Journal 2026 Journal Article

CapsFormer: A Dual-Stream Causal-Aware Capsule-Transformer Network for EMG Signal Representation Learning

  • Pengpai Wang
  • Tiantian Xie
  • Yueying Zhou
  • Peiliang Gong
  • Wei Sun
  • Xiaorui Zhang
  • Rosa H. M. Chan

Electromyography (EMG) signals are widely applied in prosthetic control, rehabilitation training, and human-machine interaction. This places stringent requirements on gesture recognition algorithms to balance long-range temporal modeling with local pose invariance. Existing approaches typically trade off between local and global features and lack causally consistent interpretability, resulting in insufficient generalization across subjects and sessions. To address these shortcomings, we propose a novel dual-stream causal Capsule-Transformer network (CapsFormer). In the Transformer stream, we employ a “causal attention” in the self-attention mechanism to explicitly block all future information, ensuring that each time-step representation depends solely on itself and prior signals; in the Capsule stream, we leverage dynamic routing to capture local part-whole pose vectors, enhancing robustness against electrode shifts and muscle deformations. The two streams' features are then integrated in a fusion module and trained end-to-end. To validate the model's effectiveness, we evaluate it on a multi-subject dataset; results demonstrate that CapsFormer outperforms state-of-the-art models in recognition accuracy, cross-subject robustness, and interpretability. This work not only offers a new paradigm for efficient EMG signal representation but also supports causally consistent temporal signal analysis and interpretable deep learning methods, bearing significant implications for intelligent prosthetic control and human-machine interfaces.

AAAI Conference 2026 Conference Paper

FedBRICK: Structural Bias Aware Heterogeneous Foundation Model Federated Tuning

  • Yuhang Zhang
  • Xianda Wang
  • Wei Sun
  • Jiaxuan Chen
  • Fangxin Wang

Model-heterogeneous federated tuning (MHFT) enables the privacy-preserving fine-tuning of foundation models in heterogeneous systems by allowing clients and the server to adopt different model architectures. Depth partial training—where each client updates only a subset of the model's layers—alleviates system heterogeneity but exacerbates client drift, which stems from clients optimizing different objectives and therefore degrades overall performance. Beyond the well-known statistical bias—where non-IID data leads to client drift—we identify a structural bias arising from clients deploying only partial layers of the global model, which serves as an important cause of drift. We further provide a theoretical analysis showing that the possible range of structural bias expands linearly with the number of missing layers. To counter this effect, we introduce FedBRICK (Federated Bias Recovery via Inserted Calibrative Kernels), which inserts tiny BRICKs into each client’s subnetwork. We employ a dual-end layer-wise distillation scheme to train these blocks using both client-side local data and a small public proxy set on the server. This design effectively mitigates the structural bias caused by layer dropping, reduces client drift, and remains practical for storage-constrained devices. Extensive experiments on federated learning benchmarks confirm that FedBRICK delivers up to a 5% average accuracy gain while requiring no more than 1.44% extra storage per client.

AAAI Conference 2026 Conference Paper

Self-Enhanced Image Clustering with Cross-Modal Semantic Consistency

  • Zihan Li
  • Wei Sun
  • Jing Hu
  • Jianhua Yin
  • Xing Wang
  • Erwei Yin
  • Jianlong Wu

While large language-image pre-trained models like CLIP offer powerful generic features for image clustering, existing methods typically freeze the encoder. This creates a fundamental mismatch between the model's task-agnostic representations and the demands of a specific clustering task, imposing a ceiling on performance. To break this ceiling, we propose a self-enhanced framework based on cross-modal semantic consistency for efficient image clustering. Our framework first builds a strong foundation via Cross-Modal Semantic Consistency and then specializes the encoder through Self-Enhancement. In the first stage, we focus on Cross-Modal Semantic Consistency. By mining consistency between generated image-text pairs at the instance, cluster assignment, and cluster center levels, we train lightweight clustering heads to align with the rich semantics of the pre-trained model. This alignment process is bolstered by a novel method for generating higher-quality cluster centers and a dynamic balancing regularizer to ensure well-distributed assignments. In the second stage, we introduce a Self-Enhanced fine-tuning strategy. The well-aligned model from the first stage acts as a reliable pseudo-label generator. These self-generated supervisory signals are then used to feed back the efficient, joint optimization of the vision encoder and clustering heads, unlocking their full potential. Extensive experiments on six mainstream datasets show that our method outperforms existing deep clustering methods by significant margins. Notably, our ViT-B/32 model already matches or even surpasses the accuracy of state-of-the-art methods built upon the far larger ViT-L/14.

AAAI Conference 2026 Conference Paper

VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning

  • Linhan Cao
  • Wei Sun
  • Weixia Zhang
  • Xiangyang Zhu
  • Jun Jia
  • Kaiwei Zhang
  • Dandan Zhu
  • Guangtao Zhai

Video quality assessment (VQA) aims to objectively quantify perceptual quality degradation in alignment with human visual perception. Despite recent advances, existing VQA models still suffer from two critical limitations: poor generalization to out-of-distribution (OOD) videos and limited explainability, which restrict their applicability in real-world scenarios. To address these challenges, we propose VQAThinker, a reasoning-based VQA framework that leverages large multimodal models (LMMs) with reinforcement learning to jointly model video quality understanding and scoring, emulating human perceptual decision-making. Specifically, we adopt group relative policy optimization (GRPO), a rule-guided reinforcement learning algorithm that enables reasoning over video quality under score-level supervision, and introduce three VQA-specific rewards: (1) a bell-shaped regression reward that increases rapidly as the prediction error decreases and becomes progressively less sensitive near the ground truth; (2) a pairwise ranking reward that guides the model to correctly determine the relative quality between video pairs; and (3) a temporal consistency reward that encourages the model to prefer temporally coherent videos over their perturbed counterparts. Extensive experiments demonstrate that VQAThinker achieves state-of-the-art performance on both in-domain and OOD VQA benchmarks, showing strong generalization for video quality scoring. Furthermore, evaluations on video quality understanding tasks validate its superiority in distortion attribution and quality description compared to existing explainable VQA models and LMMs. These findings demonstrate that reinforcement learning offers an effective pathway toward building generalizable and explainable VQA models solely with score-level supervision.

ICLR Conference 2025 Conference Paper

API Pack: A Massive Multi-Programming Language Dataset for API Call Generation

  • Zhen Guo
  • Adriana Meza Soria
  • Wei Sun
  • Yikang Shen
  • Rameswar Panda

We introduce API Pack, a massive multi-programming language dataset containing over one million instruction-API calls for improving the API call generation capabilities of large language models. Our evaluation highlights three key findings: First, fine-tuning on API Pack enables open-source models to outperform GPT-3.5 and GPT-4 in generating code for entirely new API calls. We show this by fine-tuning CodeLlama-13B on 20,000 Python instances from API Pack. Second, fine-tuning on a large dataset in one language, combined with smaller datasets from others, improves API generation accuracy across multiple languages. Third, we confirm the benefits of larger datasets for API generalization, as increasing fine-tuning data to one million instances enhances generalization to new APIs. To support further research, we open-source the API Pack dataset, trained model, and code at https://github.com/zguo0525/API-Pack.

NeurIPS Conference 2025 Conference Paper

Causal LLM Routing: End-to-End Regret Minimization from Observational Data

  • Asterios Tsiourvas
  • Wei Sun
  • Georgia Perakis

LLM routing aims to select the most appropriate model for each query, balancing competing performance metrics such as accuracy and cost across a pool of language models. Prior approaches typically adopt a decoupled strategy, where the metrics are first predicted and the model is then selected based on these estimates. This setup is prone to compounding errors and often relies on full-feedback data, where each query is evaluated by all candidate models, which is costly to obtain and maintain in practice. In contrast, we learn from observational data, which records only the outcome of the model actually deployed. We propose a causal end-to-end framework that learns routing policies by minimizing decision-making regret from observational data. To enable efficient optimization, we introduce two theoretically grounded surrogate objectives: a classification-based upper bound, and a softmax-weighted regret approximation shown to recover the optimal policy at convergence. We further extend our framework to handle heterogeneous cost preferences via an interval-conditioned architecture. Experiments on public benchmarks show that our method outperforms existing baselines, achieving state-of-the-art performance across different embedding models.

ICRA Conference 2025 Conference Paper

Design of a Bioinspired Jumping Mechanism for Self-Takeoff of Flapping Robot

  • Erzhen Pan
  • Wei Sun
  • Wenfu Xu

Most birds in nature rely on jumping for takeoff. Flapping-Wing Robots can flap and fly like birds but require an operator to take off, which are unable to generate sufficient lift to maintain flight at a low airspeed and must accelerate to take-off speed in a short time. It poses a challenge for the design of the jumping mechanism. This study is inspired by the jump-takeoff of birds and designs a simple and lightweight jumping leg, which is capable of storing and releasing energy with only one degree of freedom. In addition, a prototype was developed and tested, with a wingspan of 2 meters and a mass of 1. 6 kilograms, accelerating to 4 m/s in 52 ms by jumping, achieving the jumping take-off from the ground.

JBHI Journal 2025 Journal Article

DoctorPupil: A Virtual Reality System for Parkinson's Diagnosis Through Task-Evoked Pupil Response

  • Xucheng Zhang
  • Zhirong Wan
  • Jing Zhao
  • Xinjin Li
  • Anfeng Liu
  • Xiangmin Fan
  • Wei Sun
  • Feng Tian

Parkinson's Disease (PD) is one of the most critical neurodegenerative diseases, yet there is no cure for it, and the state-of-the-art treatment is to slow its progression. Thus, the earlier a patient with PD is recognized, the better he can be treated. Our project joins the research effort that aims to support early PD diagnosis by designing a Virtual Reality (VR)-based system to monitor pupil diameter patterns as new biomarkers (e. g. , Pupil Light Reflex and Task-evoked Pupil Response) and provide early warning of potential PD onset. A follow-up experiment with 55 participants shows that the accuracy of recognizing early PD from healthy controls could reach 0. 8942. Our study shows early results of a promising research direction that leverages VR-based technology to non-intrusively recognize patterns and provide alerts to early PD patients who would otherwise not know their symptoms until much later.

ICRA Conference 2025 Conference Paper

Doppler Former: Velocity Supervision of Raw Radar Data

  • Shuo Zhao
  • Wei Sun
  • Huadong Li
  • Zhaoying Jiang

Thanks to the high robustness of 4D millimeterwave radar in various environments, it has been widely applied in the field of autonomous driving. Recent research has increasingly focused on utilizing raw data, as a substitute for the sparse and noisy point cloud data. However, these approaches have not fully exploited the Doppler features present in the raw data. In this paper, we introduce the Doppler Former (DPF) module to efficiently extract velocity information from the target environment. DPF can be seamlessly integrated into most radar perception backbone and enhance their performance in downstream tasks. Additionally, we propose a new backbone, Fully Complex Convolutional Network (FCCN), which is more suitable for raw data. By incorporating the DPF module into FCCN, we achieved state-of-the-art (SOTA) performance on the RADIal dataset, with code available at https://github.com/coconut-zs/Fvidar-DopplerFormer.

NeurIPS Conference 2025 Conference Paper

KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

  • Wei Sun
  • Wen Yang
  • Pu Jian
  • Qianlong Du
  • Fuwei Cui
  • Shuo Ren
  • Jiajun Zhang

Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models (LLMs), even without supervised fine-tuning (SFT). However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions. To address this limitation, we propose Key-token Advantage Estimation (KTAE)—a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens within a sequence to the final outcome. This quantified token-level importance is then combined with the rollout-level advantage to obtain a more fine-grained token-level advantage estimation. Empirical results show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks. Notably, they achieve higher accuracy with shorter responses and even surpass R1-Distill-Qwen-1. 5B using the same base model.

ICRA Conference 2025 Conference Paper

LamPro: Multi-Prototype Representation Learning for Enhanced Visual Pattern Recognition

  • Ji Qi
  • Wei Sun
  • Qihe Huang
  • Zhengyang Zhou
  • Yang Wang 0015

Visual pattern recognition usually plays important roles in robotics and automation society where the pattern recognition relies on representation learning. Existing representation learning often neglects two important issues, the diversity of intra-class representation and under-exploited label utilization, especially the negative feedback during training process. Fortunately, prototype learning potentially raises label utilization and encourages intra-class diversity. In this paper, we investigate the intra-class diversity and effective updates in prototype learning for enhanced visual pattern recognition. Specifically, we propose a Label-aware multi-Prototype learning, LamPro, by incorporating the label awareness into both prototype formation and update to improve the representation quality. Firstly, we design a supervised contrastive learning to achieve class-discriminative representations. Secondly, we randomly initialize multiple prototypes and update the nearest prototype upon the arrival of instance, to preserve intra-class diversity. Thirdly, we propose a novel Label-guided Adaptive Updating. We separate the prototype updates from the representation optimization and exploit the label indexes to directly implement the prediction feedback. To correct the model optimization directions, we identify the negative feedback, and correct the prototype updates via queries of labels. Finally, we design a memory-based counter to alternately update these deviated prototypes. Experiments verify the effectiveness of our label-aware and joint multi-prototype updating strategies.

ICLR Conference 2025 Conference Paper

Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models

  • Linh Tran
  • Wei Sun
  • Stacy Patterson
  • Ana L. Milanova

Multimodal Large Language Models (LLMs) are pivotal in revolutionizing customer support and operations by integrating multiple modalities such as text, images, and audio. Federated Prompt Learning (FPL) is a recently proposed approach that combines pre-trained multimodal LLMs such as vision-language models with federated learning to create personalized, privacy-preserving AI systems. However, balancing the competing goals of personalization, generalization, and privacy remains a significant challenge. Over-personalization can lead to overfitting, reducing generalizability, while stringent privacy measures, such as differential privacy, can hinder both personalization and generalization. In this paper, we propose a Differentially Private Federated Prompt Learning (DP-FPL) approach to tackle this challenge by leveraging a low-rank factorization scheme to capture generalization while maintaining a residual term that preserves expressiveness for personalization. To ensure privacy, we introduce a novel method where we apply local differential privacy to the two low-rank components of the local prompt, and global differential privacy to the global prompt. Our approach mitigates the impact of privacy noise on the model performance while balancing the tradeoff between personalization and generalization. Extensive experiments demonstrate the effectiveness of our approach over other benchmarks.

ICLR Conference 2025 Conference Paper

Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving

  • Xiang Li 0205
  • Pengfei Li 0007
  • Yupeng Zheng
  • Wei Sun
  • Yan Wang
  • Yilun Chen

Understanding world dynamics is crucial for planning in autonomous driving. Recent methods attempt to achieve this by learning a 3D occupancy world model that forecasts future surrounding scenes based on current observation. However, 3D occupancy labels are still required to produce promising results. Considering the high annotation cost for 3D outdoor scenes, we propose a semi-supervised vision-centric 3D occupancy world model, **PreWorld**, to leverage the potential of 2D labels through a novel two-stage training paradigm: the self-supervised pre-training stage and the fully-supervised fine-tuning stage. Specifically, during the pre-training stage, we utilize an attribute projection head to generate different attribute fields of a scene (e.g., RGB, density, semantic), thus enabling temporal supervision from 2D labels via volume rendering techniques. Furthermore, we introduce a simple yet effective state-conditioned forecasting module to recursively forecast future occupancy and ego trajectory in a direct manner. Extensive experiments on the nuScenes dataset validate the effectiveness and scalability of our method, and demonstrate that PreWorld achieves competitive performance across 3D occupancy prediction, 4D occupancy forecasting and motion planning tasks.

ECAI Conference 2024 Conference Paper

ANTIDOTE: ArgumeNtaTIon-Driven explainable artificial intelligence fOr digiTal mEdicine

  • Cristian Cardellino
  • Theo Alkibiades Collias
  • Benjamin Molinet
  • Erwan Hain
  • Wei Sun
  • Rodrigo Agerri
  • Serena Villata
  • Elena Cabrio

The need for transparent AI systems in sensitive domains like medicine has become key. In this paper we present ANTIDOTE, a software suite proposing different tools for argumentation-driven explainable Artificial Intelligence for digital medicine. Our system offers the following functionalities: multilingual argumentative analysis for the medical domain, explanation extraction and generation of clinical diagnoses, multilingual large language models for the medical domain, and the first multilingual benchmark for medical question-answering. Experimental results demonstrate the efficacy of ANTIDOTE across different tasks, highlighting its potential as an asset in medical research and practice and fostering transparency, which is crucial for informed decision-making in healthcare.

IJCAI Conference 2024 Conference Paper

DiffStega: Towards Universal Training-Free Coverless Image Steganography with Diffusion Models

  • Yiwei Yang
  • Zheyuan Liu
  • Jun Jia
  • Zhongpai Gao
  • Yunhao Li
  • Wei Sun
  • Xiaohong Liu
  • Guangtao Zhai

Traditional image steganography focuses on concealing one image within another, aiming to avoid steganalysis by unauthorized entities. Coverless image steganography (CIS) enhances imperceptibility by not using any cover image. Recent works have utilized text prompts as keys in CIS through diffusion models. However, this approach faces three challenges: invalidated when private prompt is guessed, crafting public prompts for semantic diversity, and the risk of prompt leakage during frequent transmission. To address these issues, we propose DiffStega, an innovative training-free diffusion-based CIS strategy for universal application. DiffStega uses a password-dependent reference image as an image prompt alongside the text, ensuring that only authorized parties can retrieve the hidden information. Furthermore, we develop Noise Flip technique to further secure the steganography against unauthorized decryption. To comprehensively assess our method across general CIS tasks, we create a dataset comprising various image steganography instances. Experiments indicate substantial improvements in our method over existing ones, particularly in aspects of versatility, password sensitivity, and recovery quality. Codes are available at https: //github. com/evtricks/DiffStega.

NeurIPS Conference 2024 Conference Paper

GAIA: Rethinking Action Quality Assessment for AI-Generated Videos

  • Zijian Chen
  • Wei Sun
  • Yuan Tian
  • Jun Jia
  • Zicheng Zhang
  • Jiarui Wang
  • Ru Huang
  • Xiongkuo Min

Assessing action quality is both imperative and challenging due to its significant impact on the quality of AI-generated videos, further complicated by the inherently ambiguous nature of actions within AI-generated video (AIGV). Current action quality assessment (AQA) algorithms predominantly focus on actions from real specific scenarios and are pre-trained with normative action features, thus rendering them inapplicable in AIGVs. To address these problems, we construct GAIA, a Generic AI-generated Action dataset, by conducting a large-scale subjective evaluation from a novel causal reasoning-based perspective, resulting in 971, 244 ratings among 9, 180 video-action pairs. Based on GAIA, we evaluate a suite of popular text-to-video (T2V) models on their ability to generate visually rational actions, revealing their pros and cons on different categories of actions. We also extend GAIA as a testbed to benchmark the AQA capacity of existing automatic evaluation methods. Results show that traditional AQA methods, action-related metrics in recent T2V benchmarks, and mainstream video quality methods perform poorly with an average SRCC of 0. 454, 0. 191, and 0. 519, respectively, indicating a sizable gap between current models and human action perception patterns in AIGVs. Our findings underscore the significance of action quality as a unique perspective for studying AIGVs and can catalyze progress towards methods with enhanced capacities for AQA in AIGVs.

IROS Conference 2024 Conference Paper

Visual Timing For Sound Source Depth Estimation in the Wild

  • Wei Sun
  • Lili Qiu

Depth estimation enables a wide variety of 3D applications, such as robotics and autonomous driving. Despite significant work on various depth sensors, it is challenging to develop an all-in-one method to meet multiple basic criteria. In this paper, we propose a novel audio-visual learning scheme by integrating semantic features with physical spatial cues to boost monocular depth with only one microphone. Inspired by the flash-to-bang theory, we develop FBDepth, the first passive audio-visual depth estimation framework. It is based on the difference between the time-of-flight (ToF) of the light and the sound. We formulate sound source depth estimation as an audio-visual event localization task for collision events. To approach decimeter-level depth accuracy, we design a coarse-to-fine pipeline to push the temporary localization accuracy from event-level to millisecond-level by aligning audio-visual correspondence and manipulating optical flow. FBDepth feeds the estimated visual timestamp together with the audio clip and objects visual features to regress the source depth. We use a mobile phone to collect 3. 6K+ video clips with 24 different objects at up to 65m. FBDepth shows superior performance especially at a long range compared to monocular and stereo methods.

ICML Conference 2023 Conference Paper

Learning Prescriptive ReLU Networks

  • Wei Sun
  • Asterios Tsiourvas

We study the problem of learning optimal policy from a set of discrete treatment options using observational data. We propose a piecewise linear neural network model that can balance strong prescriptive performance and interpretability, which we refer to as the prescriptive ReLU network, or P-ReLU. We show analytically that this model (i) partitions the input space into disjoint polyhedra, where all instances that belong to the same partition receive the same treatment, and (ii) can be converted into an equivalent prescriptive tree with hyperplane splits for interpretability. We demonstrate the flexibility of the P-ReLU network as constraints can be easily incorporated with minor modifications to the architecture. Through experiments, we validate the superior prescriptive accuracy of P-ReLU against competing benchmarks. Lastly, we present examples of prescriptive trees extracted from trained P-ReLUs using a real-world dataset, for both the unconstrained and constrained scenarios.

IJCAI Conference 2023 Conference Paper

MM-PCQA: Multi-Modal Learning for No-reference Point Cloud Quality Assessment

  • Zicheng Zhang
  • Wei Sun
  • Xiongkuo Min
  • Qiyuan Wang
  • Jun He
  • Quan Zhou
  • Guangtao Zhai

The visual quality of point clouds has been greatly emphasized since the ever-increasing 3D vision applications are expected to provide cost-effective and high-quality experiences for users. Looking back on the development of point cloud quality assessment (PCQA), the visual quality is usually evaluated by utilizing single-modal information, i. e. , either extracted from the 2D projections or 3D point cloud. The 2D projections contain rich texture and semantic information but are highly dependent on viewpoints, while the 3D point clouds are more sensitive to geometry distortions and invariant to viewpoints. Therefore, to leverage the advantages of both point cloud and projected image modalities, we propose a novel no-reference Multi-Modal Point Cloud Quality Assessment (MM-PCQA) metric. In specific, we split the point clouds into sub-models to represent local geometry distortions such as point shift and down-sampling. Then we render the point clouds into 2D image projections for texture feature extraction. To achieve the goals, the sub-models and projected images are encoded with point-based and image-based neural networks. Finally, symmetric cross-modal attention is employed to fuse multi-modal quality-aware information. Experimental results show that our approach outperforms all compared state-of-the-art methods and is far ahead of previous no-reference PCQA methods, which highlights the effectiveness of the proposed method. The code is available at https: //github. com/zzc-1998/MM-PCQA.

AAAI Conference 2023 Conference Paper

Scalable Optimal Multiway-Split Decision Trees with Constraints

  • Shivaram Subramanian
  • Wei Sun

There has been a surge of interest in learning optimal decision trees using mixed-integer programs (MIP) in recent years, as heuristic-based methods do not guarantee optimality and find it challenging to incorporate constraints that are critical for many practical applications. However, existing MIP methods that build on an arc-based formulation do not scale well as the number of binary variables is in the order of 2 to the power of the depth of the tree and the size of the dataset. Moreover, they can only handle sample-level constraints and linear metrics. In this paper, we propose a novel path-based MIP formulation where the number of decision variables is independent of dataset size. We present a scalable column generation framework to solve the MIP. Our framework produces a multiway-split tree which is more interpretable than the typical binary-split trees due to its shorter rules. Our framework is more general as it can handle nonlinear metrics such as F1 score, and incorporate a broader class of constraints. We demonstrate its efficacy with extensive experiments. We present results on datasets containing up to 1,008,372 samples while existing MIP-based decision tree models do not scale well on data beyond a few thousand points. We report superior or competitive results compared to the state-of-art MIP-based methods with up to a 24X reduction in runtime.

AAAI Conference 2022 Conference Paper

Constrained Prescriptive Trees via Column Generation

  • Shivaram Subramanian
  • Wei Sun
  • Youssef Drissi
  • Markus Ettl

With the abundance of available data, many enterprises seek to implement data-driven prescriptive analytics to help them make informed decisions. These prescriptive policies need to satisfy operational constraints, and proactively eliminate rule conflicts, both of which are ubiquitous in practice. It is also desirable for them to be simple and interpretable, so they can be easily verified and implemented. Existing approaches from the literature center around constructing variants of prescriptive decision trees to generate interpretable policies. However, none of the existing methods are able to handle constraints. In this paper, we propose a scalable method that solves the constrained prescriptive policy generation problem. We introduce a novel path-based mixed-integer program (MIP) formulation which identifies a (near) optimal policy efficiently via column generation. The policy generated can be represented as a multiway-split tree which is more interpretable and informative than a binary-split tree due to its shorter rules. We demonstrate the efficacy of our method with extensive experiments on both synthetic and real datasets.

AAAI Conference 2022 Conference Paper

Enhancing Counterfactual Classification Performance via Self-Training

  • Ruijiang Gao
  • Max Biggs
  • Wei Sun
  • Ligong Han

Unlike traditional supervised learning, in many settings only partial feedback is available. We may only observe outcomes for the chosen actions, but not the counterfactual outcomes associated with other alternatives. Such settings encompass a wide variety of applications including pricing, online marketing and precision medicine. A key challenge is that observational data are influenced by historical policies deployed in the system, yielding a biased data distribution. We approach this task as a domain adaptation problem and propose a selftraining algorithm which imputes outcomes with categorical values for finite unseen actions in the observational data to simulate a randomized trial through pseudolabeling, which we refer to as Counterfactual Self-Training (CST). CST iteratively imputes pseudolabels and retrains the model. In addition, we show input consistency loss can further improve CST performance which is shown in recent theoretical analysis of pseudolabeling. We demonstrate the effectiveness of the proposed algorithms on both synthetic and real datasets.

AAAI Conference 2022 Conference Paper

HNO: High-Order Numerical Architecture for ODE-Inspired Deep Unfolding Networks

  • Lin Kong
  • Wei Sun
  • Fanhua Shang
  • Yuanyuan Liu
  • Hongying Liu

Recently, deep unfolding networks (DUNs) based on optimization algorithms have received increasing attention, and their high efficiency has been confirmed by many experimental and theoretical results. Since this type of networks combines model-based traditional optimization algorithms, they have high interpretability. In addition, ordinary differential equations (ODEs) are often used to explain deep neural networks, and provide some inspiration for designing innovative network models. In this paper, we transform DUNs into first-order ODE forms, and propose a high-order numerical architecture for ODE-inspired deep unfolding networks. To the best of our knowledge, this is the first work to establish the relationship between DUNs and ODEs. Moreover, we take two representative DUNs as examples, apply our architecture to them and design novel DUNs. In theory, we prove the existence, uniqueness of the solution and convergence of the proposed network, and also prove that our network obtains a fast linear convergence rate. Extensive experiments verify the effectiveness and advantages of our architecture.

TIST Journal 2022 Journal Article

Multitask Balanced and Recalibrated Network for Medical Code Prediction

  • Wei Sun
  • Shaoxiong Ji
  • Erik Cambria
  • Pekka Marttinen

Human coders assign standardized medical codes to clinical documents generated during patients’ hospitalization, which is error prone and labor intensive. Automated medical coding approaches have been developed using machine learning methods, such as deep neural networks. Nevertheless, automated medical coding is still challenging because of complex code association, noise in lengthy documents, and the imbalanced class problem. We propose a novel neural network, called the Multitask Balanced and Recalibrated Neural Network, to solve these issues. Significantly, the multitask learning scheme shares the relationship knowledge between different coding branches to capture code association. A recalibrated aggregation module is developed by cascading convolutional blocks to extract high-level semantic features that mitigate the impact of noise in documents. Also, the cascaded structure of the recalibrated module can benefit learning from lengthy notes. To solve the imbalanced class problem, we deploy focal loss to redistribute the attention on low- and high-frequency medical codes. Experimental results show that our proposed model outperforms competitive baselines on a real-world clinical dataset called the Medical Information Mart for Intensive Care (MIMIC-III).

NeurIPS Conference 2022 Conference Paper

Video-based Human-Object Interaction Detection from Tubelet Tokens

  • Danyang Tu
  • Wei Sun
  • Xiongkuo Min
  • Guangtao Zhai
  • Wei Shen

We present a novel vision Transformer, named TUTOR, which is able to learn tubelet tokens, served as highly-abstracted spatial-temporal representations, for video-based human-object interaction (V-HOI) detection. The tubelet tokens structurize videos by agglomerating and linking semantically-related patch tokens along spatial and temporal domains, which enjoy two benefits: 1) Compactness: each token is learned by a selective attention mechanism to reduce redundant dependencies from others; 2) Expressiveness: each token is enabled to align with a semantic instance, i. e. , an object or a human, thanks to agglomeration and linking. The effectiveness and efficiency of TUTOR are verified by extensive experiments. Results show our method outperforms existing works by large margins, with a relative mAP gain of $16. 14\%$ on VidHOI and a 2 points gain on CAD-120 as well as a $4 \times$ speedup.

NeurIPS Conference 2021 Conference Paper

DominoSearch: Find layer-wise fine-grained N:M sparse schemes from dense neural networks

  • Wei Sun
  • Aojun Zhou
  • Sander Stuijk
  • Rob Wijnhoven
  • Andrew O. Nelson
  • Hongsheng Li
  • Henk Corporaal

Neural pruning is a widely-used compression technique for Deep Neural Networks (DNNs). Recent innovations in Hardware Architectures (e. g. Nvidia Ampere Sparse Tensor Core) and N: M fine-grained Sparse Neural Network algorithms (i. e. every M-weights contains N non-zero values) reveal a promising research line of neural pruning. However, the existing N: M algorithms only address the challenge of how to train N: M sparse neural networks in a uniform fashion (i. e. every layer has the same N: M sparsity) and suffer from a significant accuracy drop for high sparsity (i. e. when sparsity > 80\%). To tackle this problem, we present a novel technique -- \textbf{\textit{DominoSearch}} to find mixed N: M sparsity schemes from pre-trained dense deep neural networks to achieve higher accuracy than the uniform-sparsity scheme with equivalent complexity constraints (e. g. model size or FLOPs). For instance, for the same model size with 2. 1M parameters (87. 5\% sparsity), our layer-wise N: M sparse ResNet18 outperforms its uniform counterpart by 2. 1\% top-1 accuracy, on the large-scale ImageNet dataset. For the same computational complexity of 227M FLOPs, our layer-wise sparse ResNet18 outperforms the uniform one by 1. 3\% top-1 accuracy. Furthermore, our layer-wise fine-grained N: M sparse ResNet50 achieves 76. 7\% top-1 accuracy with 5. 0M parameters. {This is competitive to the results achieved by layer-wise unstructured sparsity} that is believed to be the upper-bound of Neural Network pruning with respect to the accuracy-sparsity trade-off. We believe that our work can build a strong baseline for further sparse DNN research and encourage future hardware-algorithm co-design work. Our code and models are publicly available at \url{https: //github. com/NM-sparsity/DominoSearch}.

AAAI Conference 2020 Conference Paper

Fatigue-Aware Bandits for Dependent Click Models

  • Junyu Cao
  • Wei Sun
  • Zuo-Jun (Max) Shen
  • Markus Ettl

As recommender systems send a massive amount of content to keep users engaged, users may experience fatigue which is contributed by 1) an overexposure to irrelevant content, 2) boredom from seeing too many similar recommendations. To address this problem, we consider an online learning setting where a platform learns a policy to recommend content that takes user fatigue into account. We propose an extension of the Dependent Click Model (DCM) to describe users’ behavior. We stipulate that for each piece of content, its attractiveness to a user depends on its intrinsic relevance and a discount factor which measures how many similar contents have been shown. Users view the recommended content sequentially and click on the ones that they find attractive. Users may leave the platform at any time, and the probability of exiting is higher when they do not like the content. Based on user’s feedback, the platform learns the relevance of the underlying content as well as the discounting effect due to content fatigue. We refer to this learning task as “fatigue-aware DCM Bandit” problem. We consider two learning scenarios depending on whether the discounting effect is known. For each scenario, we propose a learning algorithm which simultaneously explores and exploits, and characterize its regret bound.

AAAI Conference 2019 Conference Paper

Dynamic Learning of Sequential Choice Bandit Problem under Marketing Fatigue

  • Junyu Cao
  • Wei Sun

Motivated by the observation that overexposure to unwanted marketing activities leads to customer dissatisfaction, we consider a setting where a platform offers a sequence of messages to its users and is penalized when users abandon the platform due to marketing fatigue. We propose a novel sequential choice model to capture multiple interactions taking place between the platform and its user: Upon receiving a message, a user decides on one of the three actions: accept the message, skip and receive the next message, or abandon the platform. Based on user feedback, the platform dynamically learns users’ abandonment distribution and their valuations of messages to determine the length of the sequence and the order of the messages, while maximizing the cumulative payoff over a horizon of length T. We refer to this online learning task as the sequential choice bandit problem. For the offline combinatorial optimization problem, we show a polynomialtime algorithm. For the online problem, we propose an algorithm that balances exploration and exploitation, and characterize its regret bound. Lastly, we demonstrate how to extend the model with user contexts to incorporate personalization.

IJCAI Conference 2019 Conference Paper

Learn to Select via Hierarchical Gate Mechanism for Aspect-Based Sentiment Analysis

  • Xiangying Ran
  • Yuanyuan Pan
  • Wei Sun
  • Chongjun Wang

Aspect-based sentiment analysis (ABSA) is a fine-grained task. Recurrent Neural Network (RNN) model armed with attention mechanism seems a natural fit for this task, and actually it achieves the state-of-the-art performance recently. However, previous attention mechanisms proposed for ABSA may attend irrelevant words and thus downgrade the performance, especially when dealing with long and complex sentences with multiple aspects. In this paper, we propose a novel architecture named Hierarchical Gate Memory Network (HGMN) for ABSA: firstly, we employ the proposed hierarchical gate mechanism to learn to select the related part about the given aspect, which can keep the original sequence structure of sentence at the same time. After that, we apply Convolutional Neural Network (CNN) on the final aspect-specific memory. We conduct extensive experiments on the SemEval 2014 and Twitter dataset, and results demonstrate that our model outperforms attention based state-of-the-art baselines.

JBHI Journal 2019 Journal Article

Weakly Supervised Biomedical Image Segmentation by Reiterative Learning

  • Qiaokang Liang
  • Yang Nan
  • Gianmarc Coppola
  • Kunglin Zou
  • Wei Sun
  • Dan Zhang
  • Yaonan Wang
  • Guanzhen Yu

Recent advances in deep learning have produced encouraging results for biomedical image segmentation; however, outcomes rely heavily on comprehensive annotation. In this paper, we propose a neural network architecture and a new algorithm, known as overlapped region forecast, for the automatic segmentation of gastric cancer images. To the best of our knowledge, this report for the first time describes that deep learning has been applied to the segmentation of gastric cancer images. Moreover, a reiterative learning framework that achieves superior performance without pretraining or further manual annotation is presented to train a simple network on weakly annotated biomedical images. We customize the loss function to make the model converge faster while avoiding becoming trapped in local minima. Patch boundary errors were eliminated by our overlapped region forecast algorithm. By studying the characteristics of the model trained using two different patch extraction methods, we train iteratively and integrate predictions and weak annotations to improve the quality of the training data. Using these methods, a mean Intersection over Union coefficient of 0. 883 and a mean accuracy of 91. 09% were achieved on the partially labeled dataset, thereby securing a win in the 2017 China Big Data and Artificial Intelligence Innovation and Entrepreneurship Competition.

JAIR Journal 2018 Journal Article

Graphical Model Market Maker for Combinatorial Prediction Markets

  • Kathryn Blackmond Laskey
  • Wei Sun
  • Robin Hanson
  • Charles Twardy
  • Shou Matsumoto
  • Brandon Goldfedder

We describe algorithms for use by prediction markets in forming a crowd consensus joint probability distribution over thousands of related events. Equivalently, we describe market mechanisms to efficiently crowdsource both structure and parameters of a Bayesian network. Prediction markets are among the most accurate methods to combine forecasts; forecasters form a consensus probability distribution by trading contingent securities. A combinatorial prediction market forms a consensus joint distribution over many related events by allowing conditional trades or trades on Boolean combinations of events. Explicitly representing the joint distribution is infeasible, but standard inference algorithms for graphical probability models render it tractable for large numbers of base events. We show how to adapt these algorithms to compute expected assets conditional on a prospective trade, and to find the conditional state where a trader has minimum assets, allowing full asset reuse. We compare the performance of three algorithms: the straightforward algorithm from the DAGGRE (Decomposition-Based Aggregation) prediction market for geopolitical events, the simple block-merge model from the SciCast market for science and technology forecasting, and a more sophisticated algorithm we developed for future markets.

NeurIPS Conference 2018 Conference Paper

Sketching Method for Large Scale Combinatorial Inference

  • Wei Sun
  • Junwei Lu
  • Han Liu

We present computationally efficient algorithms to test various combinatorial structures of large-scale graphical models. In order to test the hypotheses on their topological structures, we propose two adjacency matrix sketching frameworks: neighborhood sketching and subgraph sketching. The neighborhood sketching algorithm is proposed to test the connectivity of graphical models. This algorithm randomly subsamples vertices and conducts neighborhood regression and screening. The global sketching algorithm is proposed to test the topological properties requiring exponential computation complexity, especially testing the chromatic number and the maximum clique. This algorithm infers the corresponding property based on the sampled subgraph. Our algorithms are shown to substantially accelerate the computation of existing methods. We validate our theory and method through both synthetic simulations and a real application in neuroscience.

AAAI Conference 2015 Conference Paper

Causal Inference via Sparse Additive Models with Application to Online Advertising

  • Wei Sun
  • Pengyuan Wang
  • Dawei Yin
  • Jian Yang
  • Yi Chang

Advertising effectiveness measurement is a fundamental problem in online advertising. Various causal inference methods have been employed to measure the causal effects of ad treatments. However, existing methods mainly focus on linear logistic regression for univariate and binary treatments and are not well suited for complex ad treatments of multi-dimensions, where each dimension could be discrete or continuous. In this paper we propose a novel two-stage causal inference framework for assessing the impact of complex ad treatments. In the first stage, we estimate the propensity parameter via a sparse additive model; in the second stage, a propensity-adjusted regression model is applied for measuring the treatment effect. Our approach is shown to provide an unbiased estimation of the ad effectiveness under regularity conditions. To demonstrate the efficacy of our approach, we apply it to a real online advertising campaign to evaluate the impact of three ad treatments: ad frequency, ad channel, and ad size. We show that the ad frequency usually has a treatment effect cap when ads are showing on mobile device. In addition, the strategies for choosing best ad size are completely different for mobile ads and online ads.

NeurIPS Conference 2015 Conference Paper

Non-convex Statistical Optimization for Sparse Tensor Graphical Model

  • Wei Sun
  • Zhaoran Wang
  • Han Liu
  • Guang Cheng

We consider the estimation of sparse graphical models that characterize the dependency structure of high-dimensional tensor-valued data. To facilitate the estimation of the precision matrix corresponding to each way of the tensor, we assume the data follow a tensor normal distribution whose covariance has a Kronecker product structure. The penalized maximum likelihood estimation of this model involves minimizing a non-convex objective function. In spite of the non-convexity of this estimation problem, we prove that an alternating minimization algorithm, which iteratively estimates each sparse precision matrix while fixing the others, attains an estimator with the optimal statistical rate of convergence as well as consistent graph recovery. Notably, such an estimator achieves estimation consistency with only one tensor sample, which is unobserved in previous work. Our theoretical results are backed by thorough numerical studies.

JMLR Journal 2013 Journal Article

Consistent Selection of Tuning Parameters via Variable Selection Stability

  • Wei Sun
  • Junhui Wang
  • Yixin Fang

Penalized regression models are popularly used in high- dimensional data analysis to conduct variable selection and model fitting simultaneously. Whereas success has been widely reported in literature, their performances largely depend on the tuning parameters that balance the trade-off between model fitting and model sparsity. Existing tuning criteria mainly follow the route of minimizing the estimated prediction error or maximizing the posterior model probability, such as cross validation, AIC and BIC. This article introduces a general tuning parameter selection criterion based on variable selection stability. The key idea is to select the tuning parameters so that the resultant penalized regression model is stable in variable selection. The asymptotic selection consistency is established for both fixed and diverging dimensions. Its effectiveness is also demonstrated in a variety of simulated examples as well as an application to the prostate cancer data. [abs] [ pdf ][ bib ] &copy JMLR 2013. ( edit, beta )