Arrow Research search

Author name cluster

Sen Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers
2 author rows

Possible papers

19

AAAI Conference 2026 Conference Paper

EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens

  • Ze Feng
  • Sen Yang
  • Boqiang Duan
  • Wankou Yang
  • Jingdong Wang

Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision Semantic Distillation (VSD). Specifically, VLAD calculates the affinity matrix between text tokens and aligned vision tokens, and minimizes the smooth L1 distance of the student and the teacher affinity matrices. Considering the semantic richness of vision logits in the final layer, VSD employs the reverse KL divergence to measure the discrete probability distributions of the aligned vision logits over the vocabulary space. Comprehensive evaluation on diverse benchmarks demonstrates that EM-KD trained model outperforms prior Efficient MLLMs on both accuracy and efficiency with a large margin, validating its effectiveness. Compared with previous distillation methods, which are equipped with our proposed vision token matching strategy for fair comparison, EM-KD also achieves better performance.

JBHI Journal 2025 Journal Article

Counterfactual Bidirectional Co-Attention Transformer for Integrative Histology-Genomic Cancer Risk Stratification

  • Zheyi Ji
  • Yongxin Ge
  • Chijioke Chukwudi
  • Kaicheng U
  • Sophia Meixuan Zhang
  • Yulong Peng
  • Junyou Zhu
  • Hossam Zaki

Applying deep learning to predict patient prognostic survival outcomes using histological whole-slide images (WSIs) and genomic data is challenging due to the morphological and transcriptomic heterogeneity present in the tumor microenvironment. Existing deep learning-enabled methods often exhibit learning biases, primarily because the genomic knowledge used to guide directional feature extraction from WSIs may be irrelevant or incomplete. This results in a suboptimal and sometimes myopic understanding of the overall pathological landscape, potentially overlooking crucial histological insights. To tackle these challenges, we propose the CounterFactual Bidirectional Co-Attention Transformer framework. By integrating a bidirectional co-attention layer, our framework fosters effective feature interactions between the genomic and histology modalities and ensures consistent identification of prognostic features from WSIs. Using counterfactual reasoning, our model utilizes causality to model unimodal and multimodal knowledge for cancer risk stratification. This approach directly addresses and reduces bias, enables the exploration of ’what-if' scenarios, and offers a deeper understanding of how different features influence survival outcomes. Our framework, validated across eight diverse cancer benchmark datasets from The Cancer Genome Atlas (TCGA), represents a major improvement over current histology-genomic model learning methods. It shows an average 2. 5% improvement in c-index performance over 18 state-of-the-art models in predicting patient prognoses across eight cancer types.

ICRA Conference 2025 Conference Paper

Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous Driving

  • Lingyu Xiao
  • Jiang-Jiang Liu 0001
  • Sen Yang
  • Xiaofan Li
  • Xiaoqing Ye
  • Wankou Yang
  • Jingdong Wang 0001

The autoregressive world model exhibits robust generalization capabilities in vectorized scene understanding but encounters difficulties in deriving actions due to insufficient uncertainty modeling and self-delusion. In this paper, we explore the feasibility of deriving decisions from an autoregres-sive world model by addressing these challenges through the formulation of multiple probabilistic hypotheses. We propose LatentDriver, a framework models the environment's next states and the ego vehicle's possible actions as a mixture distribution, from which a deterministic control signal is then derived. By incorporating mixture modeling, the stochastic nature of decision-making is captured. Additionally, the self-delusion problem is mitigated by providing intermediate actions sampled from a distribution to the world model. Experimen-tal results on the recently released closed-loop benchmark Waymax demonstrate that LatentDriver surpasses state-of-the-art reinforcement learning and imitation learning methods, achieving expert-level performance. The code and models will be made available at https://github.com/Sephirex-X/LatentDriver.

ICLR Conference 2025 Conference Paper

MGMapNet: Multi-Granularity Representation Learning for End-to-End Vectorized HD Map Construction

  • Jing Yang
  • Minyue Jiang
  • Sen Yang
  • Xiao Tan 0001
  • Yingying Li
  • Errui Ding
  • Jingdong Wang 0001
  • Hanli Wang

The construction of vectorized high-definition map typically requires capturing both category and geometry information of map elements. Current state-of-the-art methods often adopt solely either point-level or instance-level representation, overlooking the strong intrinsic relationship between points and instances. In this work, we propose a simple yet efficient framework named MGMapNet (multi-granularity map network) to model map elements with multi-granularity representation, integrating both coarse-grained instance-level and fine-grained point-level queries. Specifically, these two granularities of queries are generated from the multi-scale bird's eye view features using a proposed multi-granularity aggregator. In this module, instance-level query aggregates features over the entire scope covered by an instance, and the point-level query aggregates features locally. Furthermore, a point-instance interaction module is designed to encourage information exchange between instance-level and point-level queries. Experimental results demonstrate that the proposed MGMapNet achieves state-of-the-art performances, surpassing MapTRv2 by 5.3 mAP on the nuScenes dataset and 4.4 mAP on the Argoverse2 dataset, respectively.

ICLR Conference 2023 Conference Paper

Capturing the Motion of Every Joint: 3D Human Pose and Shape Estimation with Independent Tokens

  • Sen Yang
  • Wen Heng
  • Gang Liu
  • Guozhong Luo
  • Wankou Yang
  • Gang Yu 0002

In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: \textit{joint rotation tokens, shape token, and camera token}. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available\footnote{\url{https://github.com/yangsenius/INT_HMR_Model}}.

AAAI Conference 2023 Conference Paper

PINAT: A Permutation INvariance Augmented Transformer for NAS Predictor

  • Shun Lu
  • Yu Hu
  • Peihao Wang
  • Yan Han
  • Jianchao Tan
  • Jixiang Li
  • Sen Yang
  • Ji Liu

Time-consuming performance evaluation is the bottleneck of traditional Neural Architecture Search (NAS) methods. Predictor-based NAS can speed up performance evaluation by directly predicting performance, rather than training a large number of sub-models and then validating their performance. Most predictor-based NAS approaches use a proxy dataset to train model-based predictors efficiently but suffer from performance degradation and generalization problems. We attribute these problems to the poor abilities of existing predictors to character the sub-models' structure, specifically the topology information extraction and the node feature representation of the input graph data. To address these problems, we propose a Transformer-like NAS predictor PINAT, consisting of a Permutation INvariance Augmentation module serving as both token embedding layer and self-attention head, as well as a Laplacian matrix to be the positional encoding. Our design produces more representative features of the encoded architecture and outperforms state-of-the-art NAS predictors on six search spaces: NAS-Bench-101, NAS-Bench-201, DARTS, ProxylessNAS, PPI, and ModelNet. The code is available at https://github.com/ShunLu91/PINAT.

AAAI Conference 2023 Conference Paper

ProxyBO: Accelerating Neural Architecture Search via Bayesian Optimization with Zero-Cost Proxies

  • Yu Shen
  • Yang Li
  • Jian Zheng
  • Wentao Zhang
  • Peng Yao
  • Jixiang Li
  • Sen Yang
  • Ji Liu

Designing neural architectures requires immense manual efforts. This has promoted the development of neural architecture search (NAS) to automate the design. While previous NAS methods achieve promising results but run slowly, zero-cost proxies run extremely fast but are less promising. Therefore, it’s of great potential to accelerate NAS via those zero-cost proxies. The existing method has two limitations, which are unforeseeable reliability and one-shot usage. To address the limitations, we present ProxyBO, an efficient Bayesian optimization (BO) framework that utilizes the zero-cost proxies to accelerate neural architecture search. We apply the generalization ability measurement to estimate the fitness of proxies on the task during each iteration and design a novel acquisition function to combine BO with zero-cost proxies based on their dynamic influence. Extensive empirical studies show that ProxyBO consistently outperforms competitive baselines on five tasks from three public benchmarks. Concretely, ProxyBO achieves up to 5.41× and 3.86× speedups over the state-of-the-art approaches REA and BRP-NAS.

NeurIPS Conference 2022 Conference Paper

SCL-WC: Cross-Slide Contrastive Learning for Weakly-Supervised Whole-Slide Image Classification

  • Xiyue Wang
  • Jinxi Xiang
  • Jun Zhang
  • Sen Yang
  • Zhongyi Yang
  • Ming-Hui Wang
  • Jing Zhang
  • Wei Yang

Weakly-supervised whole-slide image (WSI) classification (WSWC) is a challenging task where a large number of unlabeled patches (instances) exist within each WSI (bag) while only a slide label is given. Despite recent progress for the multiple instance learning (MIL)-based WSI analysis, the major limitation is that it usually focuses on the easy-to-distinguish diagnosis-positive regions while ignoring positives that occupy a small ratio in the entire WSI. To obtain more discriminative features, we propose a novel weakly-supervised classification method based on cross-slide contrastive learning (called SCL-WC), which depends on task-agnostic self-supervised feature pre-extraction and task-specific weakly-supervised feature refinement and aggregation for WSI-level prediction. To enable both intra-WSI and inter-WSI information interaction, we propose a positive-negative-aware module (PNM) and a weakly-supervised cross-slide contrastive learning (WSCL) module, respectively. The WSCL aims to pull WSIs with the same disease types closer and push different WSIs away. The PNM aims to facilitate the separation of tumor-like patches and normal ones within each WSI. Extensive experiments demonstrate state-of-the-art performance of our method in three different classification tasks (e. g. , over 2% of AUC in Camelyon16, 5% of F1 score in BRACS, and 3% of AUC in DiagSet). Our method also shows superior flexibility and scalability in weakly-supervised localization and semi-supervised classification experiments (e. g. , first place in the BRIGHT challenge). Our code will be available at https: //github. com/Xiyue-Wang/SCL-WC.

NeurIPS Conference 2021 Conference Paper

Shifted Chunk Transformer for Spatio-Temporal Representational Learning

  • Xuefan Zha
  • Wentao Zhu
  • Lv Xun
  • Sen Yang
  • Ji Liu

Spatio-temporal representational learning has been widely adopted in various fields such as action recognition, video object segmentation, and action anticipation. Previous spatio-temporal representational learning approaches primarily employ ConvNets or sequential models, e. g. , LSTM, to learn the intra-frame and inter-frame features. Recently, Transformer models have successfully dominated the study of natural language processing (NLP), image classification, etc. However, the pure-Transformer based spatio-temporal learning can be prohibitively costly on memory and computation to extract fine-grained features from a tiny patch. To tackle the training difficulty and enhance the spatio-temporal learning, we construct a shifted chunk Transformer with pure self-attention blocks. Leveraging the recent efficient Transformer design in NLP, this shifted chunk Transformer can learn hierarchical spatio-temporal features from a local tiny patch to a global videoclip. Our shifted self-attention can also effectively model complicated inter-frame variances. Furthermore, we build a clip encoder based on Transformer to model long-term temporal dependencies. We conduct thorough ablation studies to validate each component and hyper-parameters in our shifted chunk Transformer, and it outperforms previous state-of-the-art approaches on Kinetics-400, Kinetics-600, UCF101, and HMDB51.

NeurIPS Conference 2021 Conference Paper

TNASP: A Transformer-based NAS Predictor with a Self-evolution Framework

  • Shun Lu
  • Jixiang Li
  • Jianchao Tan
  • Sen Yang
  • Ji Liu

Predictor-based Neural Architecture Search (NAS) continues to be an important topic because it aims to mitigate the time-consuming search procedure of traditional NAS methods. A promising performance predictor determines the quality of final searched models in predictor-based NAS methods. Most existing predictor-based methodologies train model-based predictors under a proxy dataset setting, which may suffer from the accuracy decline and the generalization problem, mainly due to their poor abilities to represent spatial topology information of the graph structure data. Besides the poor encoding for spatial topology information, these works did not take advantage of the temporal information such as historical evaluations during training. Thus, we propose a Transformer-based NAS performance predictor, associated with a Laplacian matrix based positional encoding strategy, which better represents topology information and achieves better performance than previous state-of-the-art methods on NAS-Bench-101, NAS-Bench-201, and DARTS search space. Furthermore, we also propose a self-evolution framework that can fully utilize temporal information as guidance. This framework iteratively involves the evaluations of previously predicted results as constraints into current optimization iteration, thus further improving the performance of our predictor. Such framework is model-agnostic, thus can enhance performance on various backbone structures for the prediction task. Our proposed method helped us rank 2nd among all teams in CVPR 2021 NAS Competition Track 2: Performance Prediction Track.

IS Journal 2020 Journal Article

Battlefield Image Situational Awareness Application Based on Deep Learning

  • Hui Peng
  • Yifan Zhang
  • Sen Yang
  • Bin Song

With the rapid development of information technology, it has become an important topic to construct a situational awareness system that can independently mine data and information as well as perceive environmental situations by using deep learning. First, this article introduced the structure of convolutional neural networks (CNN) and You Only Look Once (YOLO) model. Then, it analyzed the structure and function of battlefield situational awareness system, and concluded that: in the whole situational awareness system, the discovery, category, and location analysis of situational elements, namely object target, is the foundation and key to realize the function. On this basis, this article establishes a battlefield situational awareness model based on the YOLO model. Finally, five common objects on the battlefield (helicopter gunship, missile, tank, soldier and gun) are classified and located, respectively. The YOLO model based on CNN is used to process the input image, and then the position, category, and corresponding confidence probability of all objects in the image are obtained directly, which realizes end-to-end learning, greatly improves the speed of target detection, and lays a foundation for assessing the battlefield situation.

UAI Conference 2019 Conference Paper

Learning with Non-Convex Truncated Losses by SGD

  • Yi Xu 0008
  • Shenghuo Zhu
  • Sen Yang
  • Chi Zhang 0012
  • Rong Jin 0001
  • Tianbao Yang

Learning with a convex loss function has been a dominating paradigm for many years. It remains an interesting question how non-convex loss functions help improve the generalization of learning with broad applicability. In this paper, we study a family of objective functions formed by truncating traditional loss functions, which is applicable to both shallow learning and deep learning. Truncating loss functions has potential to be less vulnerable and more robust to large noise in observations that could be adversarial. More importantly, it is a generic technique without assuming the knowledge of noise distribution. To justify non-convex learning with truncated losses, we establish excess risk bounds of empirical risk minimization based on truncated losses for heavy-tailed output, and statistical error of an approximate stationary point found by stochastic gradient descent (SGD) method. Our experiments for shallow and deep learning for regression with outliers, corrupted data and heavy-tailed noise further justify the proposed method.

IJCAI Conference 2019 Conference Paper

On the Convergence of (Stochastic) Gradient Descent with Extrapolation for Non-Convex Minimization

  • Yi Xu
  • Zhuoning Yuan
  • Sen Yang
  • Rong Jin
  • Tianbao Yang

Extrapolation is a well-known technique for solving convex optimization and variational inequalities and recently attracts some attention for non-convex optimization. Several recent works have empirically shown its success in some machine learning tasks. However, it has not been analyzed for non-convex minimization and there still remains a gap between the theory and the practice. In this paper, we analyze gradient descent and stochastic gradient descent with extrapolation for finding an approximate first-order stationary point in smooth non-convex optimization problems. Our convergence upper bounds show that the algorithms with extrapolation can be accelerated than without extrapolation.

ICML Conference 2019 Conference Paper

On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

  • Hao Yu
  • Rong Jin 0001
  • Sen Yang

Recent developments on large-scale distributed machine learning applications, e. g. , deep neural networks, benefit enormously from the advances in distributed non-convex optimization techniques, e. g. , distributed Stochastic Gradient Descent (SGD). A series of recent works study the linear speedup property of distributed SGD variants with reduced communication. The linear speedup property enables us to scale out the computing capability by adding more computing nodes into our system. The reduced communication complexity is desirable since communication overhead is often the performance bottleneck in distributed systems. Recently, momentum methods are more and more widely adopted by practitioners to train machine learning models since they can often converge faster and generalize better. However, it remains unclear whether any distributed momentum SGD possesses the same linear speedup property as distributed SGD and has reduced communication complexity. This paper fills the gap by considering a distributed communication efficient momentum SGD method and proving its linear speedup property.

AAAI Conference 2019 Conference Paper

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

  • Hao Yu
  • Sen Yang
  • Shenghuo Zhu

In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradients in parallel, aggregates all gradients in a single server to obtain the average, and updates each worker’s local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.

NeurIPS Conference 2012 Conference Paper

Multi-task Vector Field Learning

  • Binbin Lin
  • Sen Yang
  • Chiyuan Zhang
  • Jieping Ye
  • Xiaofei He

Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously and identifying the shared information among tasks. Most of existing MTL methods focus on learning linear models under the supervised setting. We propose a novel semi-supervised and nonlinear approach for MTL using vector fields. A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. We argue that vector fields provide a natural way to exploit the geometric structure of data as well as the shared differential structure of tasks, both are crucial for semi-supervised multi-task learning. In this paper, we develop multi-task vector field learning (MTVFL) which learns the prediction functions and the vector fields simultaneously. MTVFL has the following key properties: (1) the vector fields we learned are close to the gradient fields of the prediction functions; (2) within each task, the vector field is required to be as parallel as possible which is expected to span a low dimensional subspace; (3) the vector fields from all tasks share a low dimensional subspace. We formalize our idea in a regularization framework and also provide a convex relaxation method to solve the original non-convex problem. The experimental results on synthetic and real data demonstrate the effectiveness of our proposed approach.