Author name cluster

Wei Shen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

46 papers

2 author rows

AAAI Conference 2026 Conference Paper

A Novel Retrieve-Read-Group Paradigm for Open Knowledge Base Canonicalization

Binhan Yang
Wei Shen
Han Tian

Noun phrases (NPs) in open knowledge bases (OKBs) are not canonicalized, leading to scattered knowledge that necessitates the exploration of the OKB canonicalization task (i.e., clustering synonymous noun phrases into the same group and assigning them a unique identifier). However, existing OKB canonicalization methods typically adhere to a traditional embedding-centered pipeline, which fails to exploit the direct interaction between NPs for pairwise NP similarity calculations, resulting in suboptimal performance and instead relying extensively on external resources. To address these limitations, we introduce a groundbreaking retrieve-read-group paradigm that enables fine-grained pairwise NP similarity calculations by effectively leveraging the direct NP interaction via the reading stage, thereby relieving the reliance on external resources. As an instantiation of this paradigm, we propose DUVK, a novel self-supervised framework that fully integrates the dual-view knowledge involved in OKBs from the relational view and the semantic view. In the retriever component of DUVK, a dual-view cross-training strategy is designed to make two view-specific encoders mutually reinforce each other by capitalizing on the complementary knowledge delivered from both views. Experimental results demonstrate that, even without the need of any external resources, DUVK outperforms all state-of-the-art competitors that rely on such resources.

PDF Details DOI

AAAI Conference 2026 Conference Paper

CP-CLIP: Customized Parameter Generation for Open-vocabulary Semantic Segmentation

Zelin Peng
Zhengqin Xu
Feilong Tang
Wei Shen

Open-vocabulary semantic segmentation aims to assign pixel-level labels to images based on textual descriptions, even for categories beyond predefined closed sets. While vision-language foundation models like CLIP are widely used for this task, fine-tuning them for pixel-level predictions often compromises their generalization capabilities. To address this, we propose a novel fine-tuning strategy, CP-CLIP, which generates customized parameters for CLIP without sacrificing its generalization. Our method employs a customized parameter generator that produces newly added parameters based on random noise, using local visual features from CLIP's image encoder as conditions, enabling generalization to new images from unseen scenarios. Additionally, we introduce an orthogonal adaptation technique to ensure the update direction is orthogonal to the pre-trained weights, largely preserving the initial generalization ability. Extensive experiments demonstrate that CP-CLIP achieves state-of-the-art performance across multiple benchmarks in open-vocabulary semantic segmentation.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Dereflection Any Image with Diffusion Priors and Diversified Data

Jichen Hu
Chen Yang
Zanwei Zhou
Jiemin Fang
Qi Tian
Wei Shen

Reflection removal of a single image remains a highly challenging task due to the complex entanglement between target scenes and unwanted reflections. Despite significant progress, existing methods are hindered by the scarcity of high-quality, diverse data and insufficient restoration priors, resulting in limited generalization across various real-world scenarios. In this paper, we propose Dereflection Any Image, a comprehensive solution with an efficient data preparation pipeline and a generalizable model for robust reflection removal. First, we introduce a dataset named Diverse Reflection Removal (DRR) created by randomly rotating reflective mediums in target scenes, enabling variation of reflection angles and intensities, and setting a new benchmark in scale, quality, and diversity. Second, we propose a diffusion-based framework with one-step diffusion for deterministic outputs and fast inference. To ensure stable learning, we design a three-stage progressive training strategy, including reflection-invariant finetuning to encourage consistent outputs across varying reflection patterns that characterize our dataset. Extensive experiments show that our method achieves SOTA performance on both common benchmarks and challenging in-the-wild images, showing superior generalization across diverse real-world scenes.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Efficient Segmentation with Multimodal Large Language Model via Token Routing

Changsong Wen
Zelin Peng
Yu Huang
Wei Shen

Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in addressing open-world segmentation tasks. However, the substantial computational cost of the LLM components presents a significant challenge, especially in segmentation tasks, where efficiency has long been a central concern. Existing efficient MLLM approaches typically reduce computation cost by pruning visual tokens in the early layers, as they account for the majority of the input sequence. Despite their efficiency, this is incompatible with dense prediction tasks such as segmentation, since removing visual tokens leads to the loss of essential object parts and spatial details. To better understand the roles of visual tokens in segmentation, we analyze the attention weights of both image and mask tokens within LLM. We find that image tokens are important throughout all layers, whereas mask tokens only attend to image tokens at deeper layers. Based on the observation, we build an efficient segmentation framework based on MLLMs by introducing a sophisticated token routing strategy. This strategy dynamically determines when and how different tokens participate in computation: For mask tokens, they are only inserted at deeper layers of the LLM to reduce redundant computation, since they rarely attend to image tokens in early layers; For image tokens, only a small number of them, named proxies, are updated via full feedforward network (FFN) computation, while the update of the remaining tokens is guided by these proxies, i.e., efficiently computed through a lightweight projector applied on the difference of the proxies during their update. Our method achieves a 1.5× acceleration over the original LLM process by reducing its FLOPs to 56%, while maintaining the same segmentation performance.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Few-step Flow for 3D Generation via Marginal-Data Transport Distillation

Zanwei Zhou
Taoran Yi
Jiemin Fang
Chen Yang
Lingxi Xie
Xinggang Wang
Wei Shen
Qi Tian

Flow-based 3D generation models typically require dozens of sampling steps during inference. Though few-step distillation methods, particularly Consistency Models (CMs), have achieved substantial advancements in accelerating 2D diffusion models, they remain under-explored for more complex 3D generation tasks. In this study, we propose a novel framework, MDT-dist, for few-step 3D flow distillation. Our approach is built upon a primary objective: distilling the pretrained model to learn the Marginal-Data Transport. Directly learning this objective needs to integrate the velocity fields, while this integral is intractable to be implemented. Therefore, we propose two optimizable objectives, Velocity Matching (VM) and Velocity Distillation (VD), to equivalently convert the optimization target from the transport level to the velocity and the distribution level respectively. Velocity Matching (VM) learns to stably match the velocity fields between the student and the teacher, but inevitably provides biased gradient estimates. Velocity Distillation (VD) further enhances the optimization process by leveraging the learned velocity fields to perform probability density distillation. When evaluated on the pioneer 3D generation framework TRELLIS, our method reduces sampling steps of each flow transformer from 25 to 1–2, achieving 0.68s (1 step x2) and 0.94s (2 steps x2) latency with 9.0x and 6.5x speedup on A800, while preserving high visual and geometric fidelity. Experiments demonstrate that our method significantly outperforms existing CM distillation methods, and enables TRELLIS to achieve superior performance in few-step 3D generation.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Junshu Pan
Wei Shen
Shulin Huang
Qiji Zhou
Yue Zhang

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly training on offline preference data to align with human preferences. During DPO training, the reference model serves as a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the absence of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and requires stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that improves preference optimization by introducing a guiding reference model. This reference model provides foresight into the desired policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on the AlpacaEval 2 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

PDF Details DOI

AAAI Conference 2026 Conference Paper

WorldGrow: Generating Infinite 3D World

Sikuang Li
Chen Yang
Jiemin Fang
Taoran Yi
Jia Lu
Jiazhong Cen
Lingxi Xie
Wei Shen

We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance. Existing methods face key challenges: 2D-lifting approaches suffer from geometric and appearance inconsistencies across views, 3D implicit representations are hard to scale up, and current 3D foundation models are mostly object-centric, limiting their applicability to scene-level generation. Our key insight is leveraging strong generation priors from pre-trained 3D models for structured scene block generation. To this end, we propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis. Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity. Evaluated on the large-scale 3D-FRONT dataset, WorldGrow achieves SOTA performance in geometry reconstruction, while uniquely supporting infinite scene generation with photorealistic and structurally consistent outputs. These results highlight its capability for constructing large-scale virtual environments and potential for building future world models.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

A Single-Loop First-Order Algorithm for Linearly Constrained Bilevel Optimization

Wei Shen
Jiawei Zhang
Minhui Huang
Cong Shen

We study bilevel optimization problems where the lower-level problems are strongly convex and have coupled linear constraints. To overcome the potential non-smoothness of the hyper-objective and the computational challenges associated with the Hessian matrix, we utilize penalty and augmented Lagrangian methods to reformulate the original problem as a single-level one. Especially, we establish a strong theoretical connection between the reformulated function and the original hyper-objective by characterizing the closeness of their values and derivatives. Based on this reformulation, we propose a single-loop, first-order algorithm for linearly constrained bilevel optimization (SFLCB). We provide rigorous analyses of its non-asymptotic convergence rates, showing an improvement over prior double-loop algorithms -- form $O(\epsilon^{-3}\log(\epsilon^{-1}))$ to $O(\epsilon^{-3})$. The experiments corroborate our theoretical findings and demonstrate the practical efficiency of the proposed SFLCB algorithm. Simulation code is provided at https: //github. com/ShenGroup/SFLCB.

JBHI Journal 2025 Journal Article

DMformer: Difficulty-adapted Masked Transformer for Semi-Supervised Medical Image Segmentation

Zelin Peng
Guanchun Wang
Zhengqin Xu
Xiaokang Yang
Wei Shen

The shared anatomy among different human bodies can serve as a strong prior for effectively leveraging unlabeled data in semi-supervised medical image segmentation. Inspired by the success of masked image modeling, we notice that this prior can be explicitly realized by incorporating an auxiliary unsupervised gross anatomy reconstruction task into a teacher-student semi-supervised segmentation framework. In this auxiliary task, consistency is maintained between the student's predictions on masked images and the teacher's predictions on the original images. Despite its potential, we observe that the reconstruction difficulties of different organs/tissues can vary significantly and therefore reconstructing them requires tailored learning strategies. To address this issue, we introduce a difficulty-adapted mask mechanism based on the teacher-student framework, wherein the reconstruction difficulty is adapted to facilitate training. Specifically, we control the reconstruction difficulty by modulating two important factors: masked region ratio and masked class ratio. Accordingly, we design two corresponding mask strategies. 1) Region-based masking: randomly masks a fraction of each class according to an automatically computed mask ratio. 2) Class-based masking: masks the entire regions of the specific classes according to the class confidence predicted by the teacher model. During training, a conflict-aware gradient computation strategy is introduced to mitigate potential optimization conflicts arising from modulating the two reconstruction factors simultaneously. By building on vision transformers, we develop an D ifficulty-adapted M asked Trans former (DMformer) for semi-supervised medical image segmentation. Extensive experiments demonstrate the superiority of DMformer, which outperforms the previous SOTA by 9. 53% and 4. 63% in terms of DSC on ACDC dataset with 5% labeled images and Synapse dataset with 30% labeled images, respectively. Code is available at: https://github.com/SJTU-DeepVisionLab/DMformer.

NeurIPS Conference 2025 Conference Paper

Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback

Wei Shen
Guanlin Liu
Yu Yue
Ruofei Zhu
Qingping Yang
Chao Xin
Lin Yan

Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences and values. While recent research has primarily focused on algorithmic advancements—such as reducing computational overhead or strengthening reward models to mitigate reward hacking—the critical role of prompt-data construction and its scalability has received comparatively less attention. In this paper, we address this gap by systematically exploring data-driven bottlenecks that currently hinder RLHF performance scaling, focusing specifically on the challenges posed by reward hacking and decreasing response diversity. To mitigate reward hacking, we introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM). This approach not only exhibits enhanced resistance to reward hacking, but also enables accurate assessment of responses against clearly defined ground-truth solutions. Additionally, in order to ensure response diversity and enhance learning effectiveness, we propose a novel prompt-selection method named \textbf{Pre-PPO}, explicitly identifying training prompts that are inherently challenging and thus less prone to reward hacking. Furthermore, we find that \textbf{prioritizing mathematical and coding tasks during the early phases of RLHF training} significantly boosts performance, given that these tasks naturally encode fine-grained response distinctions and possess clearly defined ground truths. Through comprehensive experiments conducted across two model sizes, we validate the effectiveness and scalability of our proposed methods. Results show that RTV exhibits the strongest resistance to reward hacking, followed by GenRM with ground truth, and finally GenRM relying on SFT Best-of-N responses. Moreover, our proposed strategies enable the model to rapidly capture subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work underscores the importance of careful data construction and provides practical methodologies to overcome critical performance barriers in RLHF.

AAAI Conference 2025 Conference Paper

FATE: Feature-Adapted Parameter Tuning for Vision-Language Models

Zhengqin Xu
Zelin Peng
Xiaokang Yang
Wei Shen

Following the recent popularity of vision language models, several attempts, e.g., parameter-efficient fine-tuning (PEFT), have been made to extend them to different downstream tasks. Previous PEFT works motivate their methods from the view of introducing new parameters for adaptation but still need to learn this part of weight from scratch, i.e., random initialization. In this paper, we present a novel strategy that incorporates the potential of prompts, e.g., vision features, to facilitate the initial parameter space adapting to new scenarios. We introduce a Feature-Adapted parameTer Efficient tuning paradigm for vision-language models, dubbed as FATE, which injects informative features from the vision encoder into language encoder's parameters space. Specifically, we extract vision features from the last layer of CLIP's vision encoder and, after projection, treat them as parameters for fine-tuning each layer of CLIP's language encoder. By adjusting these feature-adapted parameters, we can directly enable communication between the vision and language branches, facilitating CLIP's adaptation to different scenarios. Experimental results show that FATE exhibits superior generalization performance on 11 datasets with a very small amount of extra parameters and computation.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

HPSERec: A Hierarchical Partitioning and Stepwise Enhancement Framework for Long-tailed Sequential Recommendation

Xiaolong Xu
Xudong Zhao
Haolong Xiang
Xuyun Zhang
Wei Shen
Hongsheng Hu
Lianyong Qi

The long-tail problem in sequential recommender systems stems from imbalanced interaction data, resulting in suboptimal model performance for tail users and items. Recent studies have leveraged head data to enhance tail data for diminish the impact of the long-tail problem. However, these methods often adopt ad-hoc strategies to distinguish between head and tail data, which fails to capture the underlying distributional characteristics and structural properties of each category. Moreover, due to a substantial representational gap exists between head and tail data, head-to-tail enhancement strategies are susceptible to negative transfer, often leading to a decline in overall model performance. To address these issues, we propose a hierarchical partitioning and stepwise enhancement framework, called HPSERec, for long-tailed sequential recommendation. HPSERec partitions the item set into subsets based on a data imbalance metric, assigning an expert network to each subset to capture user-specific local features. Subsequently, we apply knowledge distillation to progressively improve long-tail interest representation, followed by a Sinkhorn optimal transport-based feedback module, which aligns user representations across expert levels through a globally optimal and softly matched mapping. Extensive experiments on three real-world datasets demonstrate that HPSERec consistently outperforms all baseline methods. The implementation code is available at https: //anonymous. 4open. science/r/HPSERec-2404.

NeurIPS Conference 2025 Conference Paper

HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models

Zelin Peng
Zhengqin Xu
Qingyang Liu
Xiaokang Yang
Wei Shen

Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e. g. , thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e. g. , CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as \blg, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. \alg employs learnable matrices with M\"{o}bius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that \alg consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1\% additional parameters. Code is available at \url{https: //github. com/godlin-sjtu/HyperET}.

AAAI Conference 2025 Conference Paper

Label-Free Backdoor Attacks in Vertical Federated Learning

Wei Shen
Wenke Huang
Guancheng Wan
Mang Ye

Vertical Federated Learning (VFL) involves multiple clients collaborating to train a global model, with distributed features of shared samples. While it becomes a critical privacy-preserving learning paradigm, its security can be significantly compromised by backdoor attacks, where a malicious client injects a target backdoor by manipulating local data. Existing attack methods in VFL rely on the assumption that the malicious client can obtain additional knowledge about task labels, which is not applicable in VFL. In this work, we investigate a new backdoor attack paradigm in VFL, Label-Free Backdoor Attacks (LFBA), which does not require any additional task label information and is feasible in VFL settings. Specifically, while existing methods assume access to task labels or target-class samples, we demonstrate that the gradients of local embeddings reflect the semantic information of labels. It can be utilized to construct the target poison sample set. Besides, we uncover that backdoor triggers tend to be ignored and under-fitted due to the learning of original features, which hinders backdoor task optimization. To address this, we propose selectively switching poison samples to disrupt feature learning, promoting backdoor task learning while maintaining accuracy on clean data. Extensive experiments demonstrate the effectiveness of our method in various settings.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Learning LLM-as-a-Judge for Preference Alignment

Ziyi Ye
Xiangsheng Li
Qiuchi Li
Qingyao Ai
Yujia Zhou 0002
Wei Shen
Dong Yan
Yiqun Liu 0001

Learning from preference feedback is a common practice for aligning large language models (LLMs) with human value. Conventionally, preference data is learned and encoded into a scalar reward model that connects a value head with an LLM to produce a scalar score as preference. However, scalar models lack interpretability and are known to be susceptible to biases in datasets. This paper investigates leveraging LLM itself to learn from such preference data and serve as a judge to address both limitations in one shot. Specifically, we prompt the pre-trained LLM to generate initial judgment pairs with contrastive preference in natural language form. The self-generated contrastive judgment pairs are used to train the LLM-as-a-Judge with Direct Preference Optimization (DPO) and incentivize its reasoning capability as a judge. This proposal of learning the LLMas-a-Judge using self-generated Contrastive judgments (Con-J) ensures natural interpretability through the generated rationales supporting the judgments, and demonstrates higher robustness against bias compared to scalar models. Experimental results show that Con-J outperforms the scalar reward model trained on the same collection of preference data, and outperforms a series of open-source and closed-source generative LLMs. We open-source the training process and model weights of Con-J at https://github.com/YeZiyi1998/Con-J.

NeurIPS Conference 2025 Conference Paper

MARS-VFL: A Unified Benchmark for Vertical Federated Learning with Realistic Evaluation

Wei Shen
Weiqi Liu
Mingde Chen
Wenke Huang
Mang Ye

Vertical Federated Learning (VFL) has emerged as a critical privacy-preserving learning paradigm, enabling collaborative model training by leveraging distributed features across clients. However, due to privacy concerns, there are few publicly available real-world datasets for evaluating VFL methods, which poses significant challenges to related research. To bridge this gap, we propose MARS-VFL, a unified benchmark for realistic VFL evaluation. It integrates data from practical applications involving collaboration across different features, maintaining compatibility with the VFL setting. Based on this, we standardize the evaluation of VFL methods from the mainstream aspects of efficiency, robustness, and security. We conduct comprehensive experiments to assess different VFL approaches, providing references for unified evaluation. Furthermore, we are the first to unify the evaluation of robustness challenges in VFL and introduce a new method for addressing robustness challenges, establishing standard baselines for future research.

ICML Conference 2025 Conference Paper

On the Training Convergence of Transformers for In-Context Classification of Gaussian Mixtures

Wei Shen
Ruida Zhou
Jing Yang 0002
Cong Shen 0001

Although transformers have demonstrated impressive capabilities for in-context learning (ICL) in practice, theoretical understanding of the underlying mechanism that allows transformers to perform ICL is still in its infancy. This work aims to theoretically study the training dynamics of transformers for in-context classification tasks. We demonstrate that, for in-context classification of Gaussian mixtures under certain assumptions, a single-layer transformer trained via gradient descent converges to a globally optimal model at a linear rate. We further quantify the impact of the training and testing prompt lengths on the ICL inference error of the trained transformer. We show that when the lengths of training and testing prompts are sufficiently large, the prediction of the trained transformer approaches the ground truth distribution of the labels. Experimental results corroborate the theoretical findings.

NeurIPS Conference 2025 Conference Paper

OPMapper: Enhancing Open-Vocabulary Semantic Segmentation with Multi-Guidance Information

Xuehui Wang
Chongjie Si
Xue Yang
Yuzhi Zhao
Wenhai Wang
Xiaokang Yang
Wei Shen

Open-vocabulary semantic segmentation assigns every pixel a label drawn from an open-ended, text-defined space. Vision–language models such as CLIP excel at zero-shot recognition, yet their image-level pre-training hinders dense prediction. Current approaches either fine-tune CLIP—at high computational cost—or adopt training-free attention refinements that favor local smoothness while overlooking global semantics. In this paper, we present OPMapper, a lightweight, plug-and-play module that injects both local compactness and global connectivity into attention maps of CLIP. It combines Context-aware Attention Injection, which embeds spatial and semantic correlations, and Semantic Attention Alignment, which iteratively aligns the enriched weights with textual prompts. By jointly modeling token dependencies and leveraging textual guidance, OPMapper enhances visual understanding. OPMapper is highly flexible and can be seamlessly integrated into both training-based and training-free paradigms with minimal computational overhead. Extensive experiments demonstrate its effectiveness, yielding significant improvements across 8 open-vocabulary segmentation benchmarks.

ICLR Conference 2025 Conference Paper

RMB: Comprehensively benchmarking reward models in LLM alignment

Enyu Zhou
Guodong Zheng
Binghai Wang
Zhiheng Xi
Shihan Dou
Rong Bao
Wei Shen
Limao Xiong

Reward models (RMs) guide the alignment of large language models (LLMs), steering them toward behaviors preferred by humans. Evaluating RMs is the key to better aligning LLMs. However, the current evaluation of RMs may not directly correspond to their alignment performance due to the limited distribution of evaluation data and evaluation methods that are not closely related to alignment objectives. To address these limitations, we propose RMB, a comprehensive RM benchmark that covers over 49 real-world scenarios and includes both pairwise and Best-of-N (BoN) evaluations to better reflect the effectiveness of RMs in guiding alignment optimization. We demonstrate a positive correlation between our benchmark and the downstream alignment task performance. Based on our benchmark, we conduct extensive analysis on the state-of-the-art RMs, revealing their generalization defects that were not discovered by previous benchmarks, and highlighting the potential of generative RMs. Furthermore, we delve into open questions in reward models, specifically examining the effectiveness of majority voting for the evaluation of reward models and analyzing the impact factors of generative RMs, including the influence of evaluation criteria and instructing methods. We will release our evaluation code and datasets upon publication.

AAAI Conference 2025 Conference Paper

Segment Any 3D Gaussians

Jiazhong Cen
Jiemin Fang
Chen Yang
Lingxi Xie
Xiaopeng Zhang
Wei Shen
Qi Tian

This paper presents SAGA (Segment Any 3D GAussians), a highly efficient 3D promptable segmentation method based on 3D Gaussian Splatting (3D-GS). Given 2D visual prompts as input, SAGA can segment the corresponding 3D target represented by 3D Gaussians within 4 ms. This is achieved by attaching a scale-gated affinity feature to each 3D Gaussian to endow it a new property towards multi-granularity segmentation. Specifically, a scale-aware contrastive training strategy is proposed for the scale-gated affinity feature learning. It 1) distills the segmentation capability of the Segment Anything Model (SAM) from 2D masks into the affinity features and 2) employs a soft scale gate mechanism to deal with multi-granularity ambiguity in 3D segmentation through adjusting the magnitude of each feature channel according to a specified 3D physical scale. Evaluations demonstrate that SAGA achieves real-time multi-granularity segmentation with quality comparable to state-of-the-art methods. As one of the first methods addressing promptable segmentation in 3D-GS, the simplicity and effectiveness of SAGA pave the way for future advancements in this field.

PDF Details DOI

YNIMG Journal 2025 Journal Article

Tracking neural activity patterns during rapid high-altitude transitions

Ji-Yu Xie
Yi Zhang
Wei Shen
Liying Wu
Quanhao Yu
Zhen Lyu
Liangyuan Song
Rui Yang

Rapid adaptation to dynamic changes in the environment is critical for human survival. Extensive studies have observed human behavior and brain activity in a stable environment, but there is still a lack of understanding of how our brain's functional activity drives behavioral changes when the natural environment changes. Here, we used a virtual environment platform named the hypobaric hypoxia chamber to investigate how human neural oscillations and related behaviors are affected by changes in barometric pressure and oxygen levels at different altitudes. We found that physiological compensations occurred in the hypobaric hypoxic environment followed by an increase in altitude, resulting in faster response times in working memory tasks. High-density EEG analysis revealed a significant decrease in the alpha band at high altitudes, while delta band activity gradually increased with altitude. Moreover, a predictive model based on differences in brain regions across frequency bands identified the left supramarginal gyrus and left lingual gyrus as two hub regions strongly associated with hypoxia-related behavioral changes, and activations in the pallidum and amygdala could effectively decode the specific altitude at which humans are located. Our study underscores the potential of hypobaric hypoxia chambers as a powerful tool for dynamic high-altitude research and provides novel insights into how altitude-related changes shape human cognition and brain activity.

NeurIPS Conference 2025 Conference Paper

What Do Latent Action Models Actually Learn?

Chuheng Zhang
Tim Pearce
Pushi Zhang
Kaixin Wang
Xiaoyu Chen
Wei Shen
Li Zhao
Jiang Bian

Latent action models (LAMs) aim to learn action-relevant changes from unlabeled videos by compressing changes between frames as latents. However, differences between video frames can be caused by \textit{controllable changes} as well as exogenous noise, leading to an important concern -- do latents capture the changes caused by actions or irrelevant noise? This paper studies this issue analytically, presenting a linear model that encapsulates the essence of LAM learning, while being tractable. This provides several insights, including connections between LAM and principal component analysis (PCA), desiderata of the data-generating policy, and justification of strategies to encourage learning controllable changes using data augmentation, data cleaning, and auxiliary action-prediction. We also provide illustrative results based on numerical simulation, shedding light on the specific structure of observations, actions, and noise in data that influence LAM learning.

ICLR Conference 2024 Conference Paper

Improving Generalization of Alignment with Human Preferences through Group Invariant Learning

Rui Zheng
Wei Shen
Yuan Hua
Wenbin Lai
Shihan Dou
Yuhao Zhou 0005
Zhiheng Xi
Xiao Wang 0001

The success of AI assistants based on language models (LLMs) hinges crucially on Reinforcement Learning from Human Feedback (RLHF), which enables the generation of responses more aligned with human preferences. As universal AI assistants, there's a growing expectation for them to perform consistently across various domains. However, previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. This focus on quick reward gains undermines both the stability in training and the model's ability to generalize to new, unseen data. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains. Given the challenges associated with acquiring group annotations, our method automatically classifies data into different groups, deliberately maximizing performance variance. Then, we optimize the policy to perform well on challenging groups. Lastly, leveraging the established groups, our approach adaptively adjusts the exploration space, allocating more learning capacity to more challenging data and preventing the model from over-optimizing on simpler data. Experimental results indicate that our approach significantly enhances training stability and model generalization.

EAAI Journal 2024 Journal Article

Intelligent detection of loose fasteners in railway tracks using distributed acoustic sensing and machine learning

Chengjia Han
Shun Wang
Aayush Madan
Chaoyang Zhao
Lipi Mohanty
Yuguang Fu
Wei Shen
Ruihua Liang

Loose fasteners in railway tracks present a potential safety concern for train operations, especially when fasteners at sharp curves become loosened or when multiple consecutive fasteners are loosened. Traditional inspection methods are inefficient due to the large number of fasteners along the rail. This study introduces Fiber-optic based Distributed Acoustic Sensing (DAS) technology for real-time health monitoring of rail track, and proposes a DAS-based framework including both supervised and unsupervised learning methods for detecting loosened fasteners. In the supervised approach, a DAS signal anomaly detection (DSAD) model is proposed to directly predict the torque applied to the fasteners. Conversely, the unsupervised method employs a DAS signal anomaly detection Variational Autoencoder (DSAD-VAE) model, which evaluates the difference between the reconstructed and input signals to quantitatively assess the extent of loosening of rail fasteners. In the laboratory track tests, the DSAD model achieves an average prediction accuracy of about 1 N m per bolt, while the DSAD-VAE model attains an impressive F1-score of 0. 9 for classification. Furthermore, during field tests conducted on a subway track, the DSAD model achieves a F1-score of 0. 9917 for fastener loosening classification, while the DSAD-VAE model achieves 100% accuracy in unsupervised monitoring of fastener anomalies.

ICML Conference 2024 Conference Paper

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

Songyang Gao
Qiming Ge
Wei Shen
Shihan Dou
Junjie Ye 0005
Xiao Wang 0001
Rui Zheng
Yicheng Zou

The success of AI assistants based on Language Models (LLMs) hinges on Reinforcement Learning from Human Feedback (RLHF) to comprehend and align with user intentions. However, traditional alignment algorithms, such as PPO, are hampered by complex annotation and training requirements. This reliance limits the applicability of RLHF and hinders the development of professional assistants tailored to diverse human preferences. In this work, we introduce Linear Alignment, a novel algorithm that aligns language models with human preferences in one single inference step, eliminating the reliance on data annotation and model training. Linear alignment incorporates a new parameterization for policy optimization under divergence constraints, which enables the extraction of optimal policy in a closed-form manner and facilitates the direct estimation of the aligned response. Extensive experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment across diverse scenarios.

NeurIPS Conference 2024 Conference Paper

Mitigating Reward Overoptimization via Lightweight Uncertainty Estimation

Xiaoying Zhang
Jean-François Ton
Wei Shen
Hongning Wang
Yang Liu

Reinforcement Learning from Human Feedback (RLHF) has been pivotal in aligning Large Language Models with human values but often suffers from overoptimization due to its reliance on a proxy reward model. To mitigate this limitation, we first propose a lightweight uncertainty quantification method that assesses the reliability of the proxy reward using only the last layer embeddings of the reward model. Enabled by this efficient uncertainty quantification method, we formulate AdvPO, a distributionally robust optimization procedure to tackle the reward overoptimization problem in RLHF. Through extensive experiments on the Anthropic HH and TL; DR summarization datasets, we verify the effectiveness of AdvPO in mitigating the overoptimization problem, resulting in enhanced RLHF performance as evaluated through human-assisted evaluation.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Partial Label Learning with a Partner

Chongjie Si
Zekun Jiang
Xuehui Wang
Yan Wang
Xiaokang Yang
Wei Shen

In partial label learning (PLL), each instance is associated with a set of candidate labels among which only one is ground-truth. The majority of the existing works focuses on constructing robust classifiers to estimate the labeling confidence of candidate labels in order to identify the correct one. However, these methods usually struggle to rectify mislabeled samples. To help existing PLL methods identify and rectify mislabeled samples, in this paper, we introduce a novel partner classifier and propose a novel ``mutual supervision'' paradigm. Specifically, we instantiate the partner classifier predicated on the implicit fact that non-candidate labels of a sample should not be assigned to it, which is inherently accurate and has not been fully investigated in PLL. Furthermore, a novel collaborative term is formulated to link the base classifier and the partner one. During each stage of mutual supervision, both classifiers will blur each other's predictions through a blurring mechanism to prevent overconfidence in a specific label. Extensive experiments demonstrate that the performance and disambiguation ability of several well-established stand-alone and deep-learning based PLL approaches can be significantly improved by coupling with this learning paradigm.

PDF Details DOI

AAAI Conference 2024 Conference Paper

SAM-PARSER: Fine-Tuning SAM Efficiently by Parameter Space Reconstruction

Zelin Peng
Zhengqin Xu
Zhilin Zeng
Xiaokang Yang
Wei Shen

Segment Anything Model (SAM) has received remarkable attention as it offers a powerful and versatile solution for object segmentation in images. However, fine-tuning SAM for downstream segmentation tasks under different scenarios remains a challenge, as the varied characteristics of different scenarios naturally requires diverse model parameter spaces. Most existing fine-tuning methods attempt to bridge the gaps among different scenarios by introducing a set of new parameters to modify SAM's original parameter space. Unlike these works, in this paper, we propose fine-tuning SAM efficiently by parameter space reconstruction (SAM-PARSER), which introduce nearly zero trainable parameters during fine-tuning. In SAM-PARSER, we assume that SAM's original parameter space is relatively complete, so that its bases are able to reconstruct the parameter space of a new scenario. We obtain the bases by matrix decomposition, and fine-tuning the coefficients to reconstruct the parameter space tailored to the new scenario by an optimal linear combination of the bases. Experimental results show that SAM-PARSER exhibits superior segmentation performance across various scenarios, while reducing the number of trainable parameters by approximately 290 times compared with current parameter-efficient fine-tuning methods.

PDF Details DOI

ICML Conference 2024 Conference Paper

Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning

Zhiheng Xi
Wenxiang Chen
Boyang Hong
Senjie Jin
Rui Zheng
Wei He 0024
Yiwen Ding
Shichun Liu

In this paper, we propose R $^3$: Learning R easoning through R everse Curriculum R einforcement Learning (RL), a novel method that employs only outcome supervision to achieve the benefits of process supervision for large language models. The core challenge in applying RL to complex reasoning is to identify a sequence of actions that result in positive rewards and provide appropriate supervision for optimization. Outcome supervision provides sparse rewards for final results without identifying error locations, whereas process supervision offers step-wise rewards but requires extensive manual annotation. R $^3$ overcomes these limitations by learning from correct demonstrations. Specifically, R $^3$ progressively slides the start state of reasoning from a demonstration’s end to its beginning, facilitating easier model exploration at all stages. Thus, R $^3$ establishes a step-wise curriculum, allowing outcome supervision to offer step-level signals and precisely pinpoint errors. Using Llama2-7B, our method surpasses RL baseline on eight reasoning tasks by $4. 1$ points on average. Notably, in program-based reasoning, 7B-scale models perform comparably to larger models or closed-source models with our R $^3$.

AAAI Conference 2024 Conference Paper

ViTree: Single-Path Neural Tree for Step-Wise Interpretable Fine-Grained Visual Categorization

Danning Lao
Qi Liu
Jiazi Bu
Junchi Yan
Wei Shen

As computer vision continues to advance and finds widespread applications across various domains, the need for interpretability in deep learning models becomes paramount. Existing methods often resort to post-hoc techniques or prototypes to explain the decision-making process, which can be indirect and lack intrinsic illustration. In this research, we introduce ViTree, a novel approach for fine-grained visual categorization that combines the popular vision transformer as a feature extraction backbone with neural decision trees. By traversing the tree paths, ViTree effectively selects patches from transformer-processed features to highlight informative local regions, thereby refining representations in a step-wise manner. Unlike previous tree-based models that rely on soft distributions or ensembles of paths, ViTree selects a single tree path, offering a clearer and simpler decision-making process. This patch and path selectivity enhances model interpretability of ViTree, enabling better insights into the model's inner workings. Remarkably, extensive experimentation validates that this streamlined approach surpasses various strong competitors and achieves state-of-the-art performance while maintaining exceptional interpretability which is proved by multi-perspective methods. Code can be found at https://github.com/SJTU-DeepVisionLab/ViTree.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Intriguing Findings of Frequency Selection for Image Deblurring

Xintian Mao
Yiming Liu
Fengze Liu
Qingli Li
Wei Shen
Yan Wang

Blur was naturally analyzed in the frequency domain, by estimating the latent sharp image and the blur kernel given a blurry image. Recent progress on image deblurring always designs end-to-end architectures and aims at learning the difference between blurry and sharp image pairs from pixel-level, which inevitably overlooks the importance of blur kernels. This paper reveals an intriguing phenomenon that simply applying ReLU operation on the frequency domain of a blur image followed by inverse Fourier transform, i.e., frequency selection, provides faithful information about the blur pattern (e.g., the blur direction and blur level, implicitly shows the kernel pattern). Based on this observation, we attempt to leverage kernel-level information for image deblurring networks by inserting Fourier transform, ReLU operation, and inverse Fourier transform to the standard ResBlock. 1 × 1 convolution is further added to let the network modulate flexible thresholds for frequency selection. We term our newly built block as Res FFT-ReLU Block, which takes advantages of both kernel-level and pixel-level features via learning frequency-spatial dual-domain representations. Extensive experiments are conducted to acquire a thorough analysis on the insights of the method. Moreover, after plugging the proposed block into NAFNet, we can achieve 33.85 dB in PSNR on GoPro dataset. Our method noticeably improves backbone architectures without introducing many parameters, while maintaining low computational complexity. Code is available at https://github.com/DeepMed-Lab/DeepRFT-AAAI2023.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Low-Resource Personal Attribute Prediction from Conversations

Yinan Liu
Hu Chen
Wei Shen
Jiaoyan Chen

Personal knowledge bases (PKBs) are crucial for a broad range of applications such as personalized recommendation and Web-based chatbots. A critical challenge to build PKBs is extracting personal attribute knowledge from users' conversation data. Given some users of a conversational system, a personal attribute and these users' utterances, our goal is to predict the ranking of the given personal attribute values for each user. Previous studies often rely on a relative number of resources such as labeled utterances and external data, yet the attribute knowledge embedded in unlabeled utterances is underutilized and their performance of predicting some difficult personal attributes is still unsatisfactory. In addition, it is found that some text classification methods could be employed to resolve this task directly. However, they also perform not well over those difficult personal attributes. In this paper, we propose a novel framework PEARL to predict personal attributes from conversations by leveraging the abundant personal attribute knowledge from utterances under a low-resource setting in which no labeled utterances or external data are utilized. PEARL combines the biterm semantic information with the word co-occurrence information seamlessly via employing the updated prior attribute knowledge to refine the biterm topic model's Gibbs sampling process in an iterative manner. The extensive experimental results show that PEARL outperforms all the baseline methods not only on the task of personal attribute prediction from conversations over two data sets, but also on the more general weakly supervised text classification task over one data set.

PDF Details DOI

AAAI Conference 2023 Conference Paper

RePreM: Representation Pre-training with Masked Model for Reinforcement Learning

Yuanying Cai
Chuheng Zhang
Wei Shen
Xuyun Zhang
Wenjie Ruan
Longbo Huang

Inspired by the recent success of sequence modeling in RL and the use of masked language model for pre-training, we propose a masked model for pre-training in RL, RePreM (Representation Pre-training with Masked Model), which trains the encoder combined with transformer blocks to predict the masked states or actions in a trajectory. RePreM is simple but effective compared to existing representation pre-training methods in RL. It avoids algorithmic sophistication (such as data augmentation or estimating multiple models) with sequence modeling and generates a representation that captures long-term dynamics well. Empirically, we demonstrate the effectiveness of RePreM in various tasks, including dynamic prediction, transfer learning, and sample-efficient RL with both value-based and actor-critic methods. Moreover, we show that RePreM scales well with dataset size, dataset quality, and the scale of the encoder, which indicates its potential towards big RL models.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Segment Anything in 3D with NeRFs

Jiazhong Cen
Zanwei Zhou
Jiemin Fang
Chen Yang
Wei Shen
Lingxi Xie
Dongsheng Jiang
Xiaopeng Zhang

Recently, the Segment Anything Model (SAM) emerged as a powerful vision foundation model which is capable to segment anything in 2D images. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the Neural Radiance Field (NeRF) as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as SA3D, for Segment Anything in 3D. It is only required to provide a manual segmentation prompt (e. g. , rough points) for the target object in a single view, which is used to generate its 2D mask in this view with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively complete the 3D mask of the target object constructed with voxel grids. The former projects the 2D mask obtained by SAM in the current view onto 3D mask with guidance of the density distribution learned by the NeRF; The latter extracts reliable prompts automatically as the input to SAM from the NeRF-rendered 2D mask in another view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within minutes. Our research offers a generic and efficient methodology to lift a 2D vision foundation model to 3D, as long as the 2D model can steadily address promptable segmentation across multiple views.

IJCAI Conference 2022 Conference Paper

Community Question Answering Entity Linking via Leveraging Auxiliary Data

Yuhan Li
Wei Shen
Jianbo Gao
Yadong Wang

Community Question Answering (CQA) platforms contain plenty of CQA texts (i. e. , questions and answers corresponding to the question) where named entities appear ubiquitously. In this paper, we define a new task of CQA entity linking (CQAEL) as linking the textual entity mentions detected from CQA texts with their corresponding entities in a knowledge base. This task can facilitate many downstream applications including expert finding and knowledge base enrichment. Traditional entity linking methods mainly focus on linking entities in news documents, and are suboptimal over this new task of CQAEL since they cannot effectively leverage various informative auxiliary data involved in the CQA platform to aid entity linking, such as parallel answers and two types of meta-data (i. e. , topic tags and users). To remedy this crucial issue, we propose a novel transformer-based framework to effectively harness the knowledge delivered by different kinds of auxiliary data to promote the linking performance. We validate the superiority of our framework through extensive experiments over a newly released CQAEL data set against state-of-the-art entity linking methods.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

Video-based Human-Object Interaction Detection from Tubelet Tokens

Danyang Tu
Wei Sun
Xiongkuo Min
Guangtao Zhai
Wei Shen

We present a novel vision Transformer, named TUTOR, which is able to learn tubelet tokens, served as highly-abstracted spatial-temporal representations, for video-based human-object interaction (V-HOI) detection. The tubelet tokens structurize videos by agglomerating and linking semantically-related patch tokens along spatial and temporal domains, which enjoy two benefits: 1) Compactness: each token is learned by a selective attention mechanism to reduce redundant dependencies from others; 2) Expressiveness: each token is enabled to align with a semantic instance, i. e. , an object or a human, thanks to agglomeration and linking. The effectiveness and efficiency of TUTOR are verified by extensive experiments. Results show our method outperforms existing works by large margins, with a relative mAP gain of $16. 14\%$ on VidHOI and a 2 points gain on CAD-120 as well as a $4 \times$ speedup.

NeurIPS Conference 2021 Conference Paper

Glance-and-Gaze Vision Transformer

Qihang Yu
Yingda Xia
Yutong Bai
Yongyi Lu
Alan L. Yuille
Wei Shen

Recently, there emerges a series of vision Transformers, which show superior performance with a more compact model size than conventional convolutional neural networks, thanks to the strong ability of Transformers to model long-range dependencies. However, the advantages of vision Transformers also come with a price: Self-attention, the core part of Transformer, has a quadratic complexity to the input sequence length. This leads to a dramatic increase of computation and memory cost with the increase of sequence length, thus introducing difficulties when applying Transformers to the vision tasks that require dense predictions based on high-resolution feature maps. In this paper, we propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer), to address the aforementioned issues. It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes, with the ability to efficiently model both long-range dependencies and local context. In GG-Transformer, the Glance and Gaze behavior is realized by two parallel branches: The Glance branch is achieved by performing self-attention on the adaptively-dilated partitions of the input, which leads to a linear complexity while still enjoying a global receptive field; The Gaze branch is implemented by a simple depth-wise convolutional layer, which compensates local image context to the features obtained by the Glance mechanism. We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers on various vision tasks and benchmarks.

YNICL Journal 2020 Journal Article

Corrigendum to “Multivariate resting-state functional connectivity predicts responses to real and sham acupuncture treatment in chronic low back pain” [Neuroimage Clinical 23 (2019) 101885]

Yiheng Tu
Ana Ortiz
Randy L. Gollub
Jin Cao
Jessica Gerber
Courtney Lang
Joel Park
Georgia Wilson

The authors regret to find several errors that do not influence the main findings or conclusions.Specifically, we have found that the values of pre-and post-treatment clinical sub-scores for 'physical function' and 'sleep' in Fig. 4 contain errors.The corrected Fig. 4 is shown below: As a result, the Results section 3.3 (Page 6, right column): "Real and sham acupuncture significantly reduced PROMIS sub-scores in

YNIMG Journal 2020 Journal Article

Impaired mesocorticolimbic connectivity underlies increased pain sensitivity in chronic low back pain

Siyi Yu
Wen Li
Wei Shen
Robert R. Edwards
Randy L. Gollub
Georgia Wilson
Joel Park
Ana Ortiz

Chronic low back pain (cLBP) is a prevalent disorder. A growing body of evidence linking the pathology of the reward network to chronic pain suggests that pain sensitization may contribute to cLBP chronification via disruptions of mesocortical and mesolimbic circuits in the reward system. Resting-state (RS) functional magnetic resonance imaging (fMRI) data was acquired from 90 patients with cLBP and 74 matched pain-free controls (HCs) at baseline and after a manipulation for back pain intensification. The ventral tegmental area (VTA) was chosen as a seed region to perform RS functional connectivity (FC) analysis. Baseline rsFC of both the mesocortical (between the VTA and bilateral rostral anterior cingulate cortex (rACC)/and medial prefrontal cortex (mPFC)) and mesolimbic (between the VTA and bilateral hippocampus/parahippocampus) pathways was reduced in patients with cLBP (vs. HCs). In addition, patients exhibiting higher back pain intensity (compared to the relatively lower back pain intensity condition) also showed increases in both mesocortical and mesolimbic connectivity, implicating these pathways in pain downregulation in cLBP. Mediation analysis further isolated the mesolimbic (VTA-hippocampus/parahippocampus) dysconnectivity as a neural mechanism mediating the association between mechanical pain sensitivity (indexed by P40 pressure) and cLBP severity. In sum, the current study demonstrates deficient mesocorticolimbic connectivity in cLBP, with mesolimbic dysconnectivity potentially mediating the contribution of pain sensitization to pain chronification. These reward network dysfunctions and purportedly, dopaminergic dysregulations, may help us to identify key brain targets of neuromodulation in the treatment of cLBP.

IROS Conference 2020 Conference Paper

Towards Unsupervised Learning for Instrument Segmentation in Robotic Surgery with Cycle-Consistent Adversarial Networks

Daniil Pakhomov
Wei Shen
Nassir Navab

Surgical tool segmentation in endoscopic images is an important problem: it is a crucial step towards full instrument pose estimation and it is used for integration of pre- and intra-operative images into the endoscopic view. While many recent approaches based on convolutional neural networks have shown great results, a key barrier to progress lies in the acquisition of a large number of manually-annotated images which is necessary for an algorithm to generalize and work well in diverse surgical scenarios. Unlike the surgical image data itself, annotations are difficult to acquire and may be of variable quality. On the other hand, synthetic annotations can be automatically generated by using forward kinematic model of the robot and CAD models of tools by projecting them onto an image plane. Unfortunately, this model is very inaccurate and cannot be used for supervised learning of image segmentation models. Since generated annotations will not directly correspond to endoscopic images due to errors, we formulate the problem as an unpaired image-to-image translation where the goal is to learn the mapping between an input endoscopic image and a corresponding annotation using an adversarial model. Our approach allows to train image segmentation models without the need to acquire expensive annotations and can potentially exploit large unlabeled endoscopic image collection outside the annotated distributions of image/annotation data. We test our proposed method on Endovis 2017 challenge dataset and show that it is competitive with supervised segmentation methods.

YNICL Journal 2019 Journal Article

Corrigendum to ‘Multivariate resting-state functional connectivity predicts responses to real and sham acupuncture treatment in chronic low back pain’ Neuroimage Clinical, 23, 2019, 101885

Yiheng Tu
Ana Ortiz
Randy L. Gollub
Jin Cao
Jessica Gerber
Courtney Lang
Joel Park
Georgia Wilson

The authors regret to find several errors that do not influence the main findings or conclusions.Specifically, we have found that the values of pre-and post-treatment clinical sub-scores for 'physical function' and 'sleep' in Fig. 4 contain errors.The corrected Fig. 4 is shown below: As a result, the Results section 3.3 (Page 6, right column): "Real and sham acupuncture significantly reduced PROMIS sub-scores in

YNICL Journal 2019 Journal Article

Multivariate resting-state functional connectivity predicts responses to real and sham acupuncture treatment in chronic low back pain

Yiheng Tu
Ana Ortiz
Randy L. Gollub
Jin Cao
Jessica Gerber
Courtney Lang
Joel Park
Georgia Wilson

Despite the high prevalence and socioeconomic impact of chronic low back pain (cLBP), treatments for cLBP are often unsatisfactory, and effectiveness varies widely across patients. Recent neuroimaging studies have demonstrated abnormal resting-state functional connectivity (rsFC) of the default mode, salience, central executive, and sensorimotor networks in chronic pain patients, but their role as predictors of treatment responsiveness has not yet been explored. In this study, we used machine learning approaches to test if pre-treatment rsFC can predict responses to both real and sham acupuncture treatments in cLBP patients. Fifty cLBP patients participated in 4 weeks of either real (N = 24, age = 39.0 ± 12.6, 16 females) or sham acupuncture (N = 26, age = 40.0 ± 13.7, 15 females) treatment in a single-blinded trial, and a resting-state fMRI scan prior to treatment was used in data analysis. Both real and sham acupuncture can produce significant pain reduction, with those receiving real treatment experiencing greater pain relief than those receiving sham treatment. We found that pre-treatment rsFC could predict symptom changes with up to 34% and 29% variances for real and sham treatment, respectively, and the rsFC characteristics that were significantly predictive for real and sham treatment differed. These results suggest a potential way to predict treatment responses and may facilitate the development of treatment plans that optimize time, cost, and available resources.

YNICL Journal 2019 Journal Article

Visual network alterations in brain functional connectivity in chronic low back pain: A resting state functional connectivity and machine learning study

Wei Shen
Yiheng Tu
Randy L. Gollub
Ana Ortiz
Vitaly Napadow
Siyi Yu
Georgia Wilson
Joel Park

Chronic low back pain (cLBP) is associated with widespread functional and structural changes in the brain. This study aims to investigate the resting state functional connectivity (rsFC) changes of visual networks in cLBP patients and the feasibility of distinguishing cLBP patients from healthy controls using machine learning methods. cLBP (n = 90) and control individuals (n = 74) were enrolled and underwent resting-state BOLD fMRI scans. Primary, dorsal, and ventral visual networks derived from independent component analysis were used as regions of interest to compare resting state functional connectivity changes between the cLBP patients and healthy controls. We then applied a support vector machine classifier to distinguish the cLBP patients and control individuals. These results were further verified in a new cohort of subjects. We found that the functional connectivity between the primary visual network and the somatosensory/motor areas were significantly enhanced in cLBP patients. The rsFC between the primary visual network and S1 was negatively associated with duration of cLBP. In addition, we found that the rsFC of the visual network could achieve a classification accuracy of 79.3% in distinguishing cLBP patients from HCs, and these results were further validated in an independent cohort of subjects (accuracy = 66.7%). Our results demonstrate significant changes in the rsFC of the visual networks in cLBP patients. We speculate these alterations may represent an adaptation/self-adjustment mechanism and cross-model interaction between the visual, somatosensory, motor, attention, and salient networks in response to cLBP. Elucidating the role of the visual networks in cLBP may shed light on the pathophysiology and development of the disorder.

IJCAI Conference 2018 Conference Paper

Hi-Fi: Hierarchical Feature Integration for Skeleton Detection

Kai Zhao
Wei Shen
Shanghua Gao
Dandan Li
Ming-Ming Cheng

In natural images, the scales (thickness) of object skeletons may dramatically vary among objects and object parts. Thus, robust skeleton detection requires powerful multi-scale feature integration ability. To address this issue, we present a new convolutional neural network (CNN) architecture by introducing a novel hierarchical feature integration mechanism, named Hi-Fi, to address the object skeleton detection problem. The proposed CNN-based approach intrinsically captures high-level semantics from deeper layers, as well as low-level details from shallower layers. By hierarchically integrating different CNN feature levels with bidirectional guidance, our approach (1) enables mutual refinement across features of different levels, and (2) possesses the strong ability to capture both rich object context and high-resolution details. Experimental results show that our method significantly outperforms the state-of-the-art methods in terms of effectively fusing features from very different scales, as evidenced by a considerable performance improvement on several benchmarks.

NeurIPS Conference 2017 Conference Paper

Label Distribution Learning Forests

Wei Shen
Kai Zhao
Yilu Guo
Alan Yuille

Label distribution learning (LDL) is a general learning framework, which assigns to an instance a distribution over a set of labels rather than a single label or multiple labels. Current LDL methods have either restricted assumptions on the expression form of the label distribution or limitations in representation learning, e. g. , to learn deep features in an end-to-end manner. This paper presents label distribution learning forests (LDLFs) - a novel label distribution learning algorithm based on differentiable decision trees, which have several advantages: 1) Decision trees have the potential to model any general form of label distributions by a mixture of leaf node predictions. 2) The learning of differentiable decision trees can be combined with representation learning. We define a distribution-based loss function for a forest, enabling all the trees to be learned jointly, and show that an update function for leaf node predictions, which guarantees a strict decrease of the loss function, can be derived by variational bounding. The effectiveness of the proposed LDLFs is verified on several LDL tasks and a computer vision application, showing significant improvements to the state-of-the-art LDL methods.

IJCAI Conference 2013 Conference Paper

Improving Traffic Prediction with Tweet Semantics

Jingrui He
Wei Shen
Phani Divakaruni
Laura Wynter
Rick Lawrence

Road traffic prediction is a critical component in modern smart transportation systems. It provides the basis for traffic management agencies to generate proactive traffic operation strategies for alleviating congestion. Existing work on near-term traffic prediction (forecasting horizons in the range of 5 minutes to 1 hour) relies on the past and current traffic conditions. However, once the forecasting horizon is beyond 1 hour, i. e. , in longer-term traffic prediction, these techniques do not work well since additional factors other than the past and current traffic conditions start to play important roles. To address this problem, in this paper, for the first time, we examine whether it is possible to use the rich information in online social media to improve longer-term traffic prediction. To this end, we first analyze the correlation between traffic volume and tweet counts with various granularities. Then we propose an optimization framework to extract traffic indicators based on tweet semantics using a transformation matrix, and incorporate them into traffic prediction via linear regression. Experimental results using traffic and Twitter data originated from the San Francisco Bay area of California demonstrate the effectiveness of our proposed framework.

PDF Details DOI