Arrow Research search

Author name cluster

Yao Lu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

32 papers
2 author rows

Possible papers

32

AAAI Conference 2026 Conference Paper

SepPrune: Structured Pruning for Efficient Deep Speech Separation

  • Yuqi Li
  • Kai Li
  • Xin Yin
  • Zhifei Yang
  • Zeyu Dong
  • Zhengtao Yao
  • Haoyan Xu
  • Yingli Tian

Although deep learning has substantially advanced speech separation in recent years, most existing studies continue to prioritize separation quality while overlooking computational efficiency, an essential factor for low-latency speech processing in real-time applications. In this paper, we propose SepPrune, the first structured pruning framework specifically designed to compress deep speech separation models and reduce their computational cost. SepPrune begins by analyzing the computational structure of a given model to identify layers with the highest computational burden. It then introduces a differentiable masking strategy to enable gradient-driven channel selection. Based on the learned masks, SepPrune prunes redundant channels and fine-tunes the remaining parameters to recover performance. Extensive experiments demonstrate that this learnable pruning paradigm yields substantial advantages for channel pruning in speech separation models, outperforming existing methods. Notably, a model pruned with SepPrune can recover 85% of the performance of a pre-trained model (trained over hundreds of epochs) with only one epoch of fine-tuning, and achieves convergence 36x faster than training from scratch.

AAAI Conference 2026 Conference Paper

TimeCAP: A Channel-Aware Pre-Training Framework for Multivariate Time Series Forecasting

  • Chuanru Ren
  • Yao Lu
  • Tianjin Huang
  • Haowen Zheng
  • Hengde Zhu
  • Yunyin Li
  • Hengxiao Li
  • Lu Liu

Amid recent advances for multivariate time series forecasting, self-supervised learning has emerged as a promising paradigm for deriving transferable knowledge from multi-domain data. Despite its effectiveness, existing approaches exhibit two critical limitations: (1) Underestimating the significance of multivariate dependencies in learning generalizable representations and (2) Failing to reconcile the complementary strengths of autoregressive and one-shot generative paradigms. In this work, we propose TimeCAP, a novel channel-aware pre-training framework that internalizes latent causal relationships among variables inherent in multi-domain data, and effectively transfers the acquired knowledge to downstream applications. Technically, we present a flexible channel-grouping learning approach, complemented by an adaptive meta-routing mechanism, enabling TimeCAP to parallel recognize intra-group local patterns while maintaining global coherence. Intra- and inter-group multivariate dependencies are captured through the self- and cross-attention with channel-aware mask, which strictly confine interactions among time-aligned, fine-grained multivariate tokens. To seamlessly unify two advanced generative paradigms, we propose a novel dynamic dual-head decoding and optimization strategy, empowering TimeCAP to leverage critical dependencies in the output series while avoiding cumulative errors over time. In the few-shot evaluation, TimeCAP achieves average MSE and MAE reductions of 11.8% and 6% over leading baselines, while also outperforming state-of-the-art models in full-shot and zero-shot settings by large margins.

NeurIPS Conference 2025 Conference Paper

A Set of Generalized Components to Achieve Effective Poison-only Clean-label Backdoor Attacks with Collaborative Sample Selection and Triggers

  • Zhixiao Wu
  • Yao Lu
  • Jie Wen
  • Hao Sun
  • Qi Zhou
  • Guangming Lu

Poison-only Clean-label Backdoor Attacks (PCBAs) aim to covertly inject attacker-desired behavior into DNNs by merely poisoning the dataset without changing the labels. To effectively implant a backdoor, multiple triggers are proposed for various attack requirements of Attack Success Rate (ASR) and stealthiness. Additionally, sample selection enhances clean-label backdoor attacks' ASR by meticulously selecting "hard'' samples instead of random samples to poison. Current methods, however, 1) usually handle the sample selection and triggers in isolation, leading to severely limited improvements on both ASR and stealthiness. Consequently, attacks exhibit unsatisfactory performance on evaluation metrics when converted to PCBAs via a mere stacking of methods. Therefore, we seek to explore the bi-directional collaborative relations between the sample selection and triggers to address the above dilemma. 2) Since the strong specificity within triggers, the simple combination of sample selection and triggers fails to substantially enhance both evaluation metrics, with generalization preserved among various attacks. Therefore, we seek to propose a set of components to significantly improve both stealthiness and ASR based on the commonalities of attacks. Specifically, Component A ascertains two critical selection factors, and then makes them an appropriate combination based on the trigger scale to select more reasonable "hard'' samples for improving ASR. Component B is proposed to select samples with similarities to relevant trigger implanted samples to promote stealthiness. Component C reassigns trigger poisoning intensity on RGB colors through distinct sensitivity of the human visual system to RGB for higher ASR, with stealthiness ensured by sample selection including Component B. Furthermore, all components can be strategically integrated into diverse PCBAs, enabling tailored solutions that balance ASR and stealthiness enhancement for specific attack requirements. Extensive experiments demonstrate the superiority of our components in stealthiness, ASR, and generalization. Our code will be released as soon as possible.

AAAI Conference 2025 Conference Paper

ALRMR-GEC: Adjusting Learning Rate Based on Memory Rate to Optimize the Edit Scorer for Grammatical Error Correction

  • Zhixiao Wu
  • Yao Lu
  • Jie Wen
  • Guangming Lu

Edit-based approaches for Grammatical Error Correction (GEC) have attracted volume attention due to their outstanding explanations of the correction process and rapid inference. Through exploring the characteristics of the generalized and specific knowledge learning for GEC, we discover that efficiently training GEC systems with satisfactory generalization capacity prefers more generalized knowledge rather than specific knowledge. Current gradient-based methods for training GEC systems, however, usually prioritize minimizing training loss over generalization loss. This paper proposes the strategy of Adjusting Learning Rate Based on Mermory Rate to optimize the edit-based GEC scorer (ALRMR-GEC). Specifically, we introduce the memory rate, a novel metric, to provide an explicit indicator for the model’s state of learning generalized and specific knowledge, which can effectively guide the GEC system to adjust the learning rate timely. Extensive experiments, conducted by optimizing the published edit scorer on the BEA2019 dataset, have shown our ALRMR-GEC significantly enhances the model generalization ability with stable and satisfactory performance nearly irrespective of the initial learning rate selection. Also, our method can accelerate the training over tenfold faster in certain cases. Finally, the experiments indicate the memory rate introduced in our ALRMR-GEC guides the GEC editscorer to learn more generalized knowledge.

NeurIPS Conference 2025 Conference Paper

ArchPower: Dataset for Architecture-Level Power Modeling of Modern CPU Design

  • Qijun Zhang
  • Yao Lu
  • Mengming Li
  • Shang Liu
  • Zhiyao Xie

Power is the primary design objective of large-scale integrated circuits (ICs), especially for complex modern processors (i. e. , CPUs). Accurate CPU power evaluation requires designers to go through the whole time-consuming IC implementation process, easily taking months. At the early design stage (e. g. , architecture-level), classical power models are notoriously inaccurate. Recently, ML-based architecture-level power models have been proposed to boost accuracy, but the data availability is a severe challenge. Currently, there is no open-source dataset for this important ML application. A typical dataset generation process involves correct CPU design implementation and repetitive execution of power simulation flows, requiring significant design expertise, engineering effort, and execution time. Even private in-house datasets often fail to reflect realistic CPU design scenarios. In this work, we propose ArchPower, the first open-source dataset for architecture-level processor power modeling. We go through complex and realistic design flows to collect the CPU architectural information as features and the ground-truth simulated power as labels. Our dataset includes 200 CPU data samples, collected from 25 different CPU configurations when executing 8 different workloads. There are more than 100 architectural features in each data sample, including both hardware and event parameters. The label of each sample provides fine-grained power information, including the total design power and the power for each of the 11 components. Each power value is further decomposed into four fine-grained power groups: combinational logic power, sequential logic power, memory power, and clock power. ArchPower is available at https: //github. com/hkust-zhiyao/ArchPower.

TMLR Journal 2025 Journal Article

AutoAnnotator: A Collaborative Annotation Framework for Large and Small Language Models

  • Yao Lu
  • Ji Zhaiyuan
  • Jiawei Du
  • Yu Shanqing
  • Qi Xuan
  • Joey Tianyi Zhou

Although the annotation paradigm based on Large Language Models (LLMs) has made significant breakthroughs in recent years, its actual deployment still has two core bottlenecks: first, the cost of calling commercial APIs in large-scale annotation is very expensive; second, in scenarios that require fine-grained semantic understanding, such as sentiment classification and toxicity classification, the annotation accuracy of LLMs is even lower than that of Small Language Models (SLMs) dedicated to this field. To address these problems, we propose a new paradigm of multi-model cooperative annotation and design a fully automatic annotation framework AutoAnnotator based on this. Specifically, AutoAnnotator consists of two layers. The upper-level meta-controller layer uses the generation and reasoning capabilities of LLMs to select SLMs for annotation, automatically generate annotation code and verify difficult samples; the lower-level task‑specialist layer consists of multiple SLMs that perform annotation through multi-model voting. In addition, we use the difficult samples obtained by the secondary review of the meta-controller layer as the reinforcement learning set and fine-tune the SLMs in stages through a continual learning strategy, thereby improving the generalization of SLMs. Extensive experiments show that AutoAnnotator outperforms existing open-source/API LLMs in zero-shot, one-shot, CoT, and majority voting settings. Notably, AutoAnnotator reduces the annotation cost by 74.15% compared to directly annotating with GPT-3.5-turbo, while still improving the accuracy by 6.21%. The code is available in https://github.com/Zhaiyuan-Ji/AutoAnnotator.

NeurIPS Conference 2025 Conference Paper

Benford’s Curse: Tracing Digit Bias to Numerical Hallucination in LLMs

  • Jiandong Shao
  • Yao Lu
  • Jianfei Yang

Large Language Models (LLMs) exhibit impressive performance on complex reasoning tasks, yet they frequently fail on basic numerical problems, producing incorrect outputs. Inspired by Benford’s Law, a statistical pattern in which lower digits occur more frequently as leading digits, we hypothesize that the skewed digit distributions in web-collected corpora may be learned by LLMs during pretraining, leading to biased numerical generation. To investigate the hypothesis, we first examine whether digits frequencies in pretraining corpus (OLMo2) follows Benford's law. We then construct an evaluation benchmark in which the ground-truth digits are uniformly distributed within each of the seven numerical reasoning tasks. Our evaluation results demonstrate that leading open-source LLMs show a consistent pattern of digit bias that resembles Benford's law. Through logit-lens tracing and neuron-level dissection, we identify that this bias arises predominantly from a small subset of highly digit-selective feed-forward network (FFN) neurons in the deeper layers. Finally, we demonstrate that pruning these neurons mitigates imbalanced overgeneration and partially corrects erroneous outputs, providing causal evidence that fine-grained pretraining digit bias can propagate into model behavior. Our findings reveal a fundamental connection between corpus-level statistics and symbolic failure modes in LLMs, offering a new lens for diagnosing and mitigating hallucinations in numerical tasks.

NeurIPS Conference 2025 Conference Paper

Federated Dialogue-Semantic Diffusion for Emotion Recognition under Incomplete Modalities

  • Xihang Qiu
  • Jiarong Cheng
  • Yuhao Fang
  • Wanpeng Zhang
  • Yao Lu
  • Ye Zhang
  • Chun Li

Multimodal Emotion Recognition in Conversations (MERC) enhances emotional understanding through the fusion of multimodal signals. However, unpredictable modality absence in real-world scenarios significantly degrades the performance of existing methods. Conventional missing-modality recovery approaches, which depend on training with complete multimodal data, often suffer from semantic distortion under extreme data distributions, such as fixed-modality absence. To address this, we propose the Federated Dialogue-guided and Semantic-Consistent Diffusion (FedDISC) framework, pioneering the integration of federated learning into missing-modality recovery. By federated aggregation of modality-specific diffusion models trained on clients and broadcasting them to clients missing corresponding modalities, FedDISC overcomes single-client reliance on modality completeness. Additionally, the DISC-Diffusion module ensures consistency in context, speaker identity, and semantics between recovered and available modalities, using a Dialogue Graph Network to capture conversational dependencies and a Semantic Conditioning Network to enforce semantic alignment. We further introduce a novel Alternating Frozen Aggregation strategy, which cyclically freezes recovery and classifier modules to facilitate collaborative optimization. Extensive experiments on the IEMOCAP, CMUMOSI, and CMUMOSEI datasets demonstrate that FedDISC achieves superior emotion classification performance across diverse missing modality patterns, outperforming existing approaches.

NeurIPS Conference 2025 Conference Paper

Proper Hölder-Kullback Dirichlet Diffusion: A Framework for High Dimensional Generative Modeling

  • Wanpeng Zhang
  • Yuhao Fang
  • Xihang Qiu
  • Jiarong Cheng
  • Jialong Hong
  • Bin Zhai
  • Qing Zhou
  • Yao Lu

Diffusion-based generative models have long depended on Gaussian priors, with little exploration of alternative distributions. We introduce a Proper Hölder-Kullback Dirichlet framework that uses time-varying multiplicative transformations to define both forward and reverse diffusion processes. Moving beyond conventional reweighted evidence lower bounds (ELBO) or Kullback–Leibler upper bounds (KLUB), we propose two novel divergence measures: the Proper Hölder Divergence (PHD) and the Proper Hölder–Kullback (PHK) divergence, the latter designed to restore symmetry missing in existing formulations. When optimizing our Dirichlet diffusion model with PHK, we achieve a Fréchet Inception Distance (FID) of 2. 78 on unconditional CIFAR-10. Comprehensive experiments on natural-image datasets validate the generative strengths of model and confirm PHK’s effectiveness in model training. These contributions expand the diffusion-model family with principled non-Gaussian processes and effective optimization tools, offering new avenues for versatile, high-fidelity generative modeling.

NeurIPS Conference 2025 Conference Paper

Scaling RL to Long Videos

  • Yukang Chen
  • Wei Huang
  • Baifeng Shi
  • Qinghao Hu
  • Hanrong Ye
  • Ligeng Zhu
  • Zhijian Liu
  • Pavlo Molchanov

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65. 1% and 71. 1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8, 192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2. 1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e. g. , 3, 600 frames). Code and models are available at https: //github. com/NVlabs/Long-RL

TMLR Journal 2025 Journal Article

Wolf: Dense Video Captioning with a World Summarization Framework

  • Boyi Li
  • Ligeng Zhu
  • Ran Tian
  • Shuhan Tan
  • Yuxiao Chen
  • Yao Lu
  • Yin Cui
  • Sushant Veer

We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore (caption quality) by 55.6% and CapScore (caption similarity) by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment.

NeurIPS Conference 2025 Conference Paper

WorldModelBench: Judging Video Generation Models As World Models

  • Dacheng Li
  • Yunhao Fang
  • Yukang Chen
  • Shuo Yang
  • Shiyi Cao
  • Justin Wong
  • Michael Luo
  • Xiaolong Wang

Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law—issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 9. 9% lower error in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The dataset is hosted in HuggingFace at https: //huggingface. co/datasets/Efficient-Large-Model/worldmodelbench. The code to run evaluation is available at https: //github. com/WorldModelBench-Team/WorldModelBench.

IJCAI Conference 2024 Conference Paper

Implicit Prompt Learning for Image Denoising

  • Yao Lu
  • Bo Jiang
  • Guangming Lu
  • Bob Zhang

Recently, various deep denoising methods have been proposed to solve the insufficient feature problem in image denoising. These methods can be mainly classified into two categories: (1) Injecting learnable tensors into denoising backbone to supplement feature, which is effective to some extent but may cause serious over-fitting. (2) Using diverse natural images from large image datasets to synthesize noisy images and pre-train denoising models, which can bring model generalization but require large model size and expensive training costs. To address these issues, this paper proposes Implicit Prompt Learning for Image Denoising (IPLID) method to flexibly generate adaptive prompts without meticulously designing them. Specifically, we first introduce an efficient Linear Prompt (LP) block with ultra-few parameters to produce dynamic prompts for both different stages and samples in denoising procedure. We further propose an efficient Compact Feature Fusion (CFF) block to process previous multi-level prompted denoising feature to reconstruct the denoising images. Finally, to further efficiently and effectively produce satisfactory prompt and denoising performance, a Gradient Accumulation (GA) learning scheme is proposed. Experiments on multiple benchmarks showed that the proposed IPLID achieves competitive results with only 1 percent of pre-trained backbone parameters, outperforming classical denoising methods in both efficiency and quality of restored images.

ICRA Conference 2024 Conference Paper

Open X-Embodiment: Robotic Learning Datasets and RT-X Models: Open X-Embodiment Collaboration

  • Abby O'Neill
  • Abdul Rehman
  • Abhiram Maddukuri
  • Abhishek Gupta 0004
  • Abhishek Padalkar
  • Abraham Lee
  • Acorn Pooley
  • Agrim Gupta

Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train "generalist" X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. The project website is robotics-transformer-x. github.io.

IJCAI Conference 2024 Conference Paper

QFormer: An Efficient Quaternion Transformer for Image Denoising

  • Bo Jiang
  • Yao Lu
  • Guangming Lu
  • Bob Zhang

Since Deep Convolutional Neural Networks (DCNNs) and Vision Transformer perform well in learning generalizable image priors from large-scale data, these models have been widely used in image denoising tasks. However, vanilla DCNNs and Transformer suffer from two problems. First, the vanilla DCNNs and Transformer only accumulate the output along the channel axis, ignoring the internal relationship among channels. This results in the severely inadequate color structure representation retrieved from color images. Secondly, the DCNNs or Transformer-based image denoising models usually have a large number of parameters, high computational complexity, and slow inference speed. To resolve these issues, this paper proposes a highly-efficient Quaternion Transformer (QFormer) for image denoising. Specifically, the proposed Quaternion Transformer Block (QTB) simplifies the typical Transformer from a multi-branch structure to an elaborately sequential structure mainly with quaternion transformations, to alternately capture both long-range dependencies and local contextual features with color structure information. Furthermore, the proposed QTB can also avoid considerable element-wise multiplications of computing the self-attention matrices. Thus, our QTB can significantly reduce the computational complexity and its sequential structure can further improve the practical inference speed. Comprehensive experiments demonstrate that the proposed QFormer produces state-of-the-art results in both denoising performance and efficiency. We hope that our work will encourage further research to explore the Quaternion Transformer architecture for image denoising tasks.

NeurIPS Conference 2024 Conference Paper

UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-World Document Analysis

  • Yulong Hui
  • Yao Lu
  • Huanchen Zhang

The use of Retrieval-Augmented Generation (RAG) has improved Large Language Models (LLMs) in collaborating with external data, yet significant challenges exist in real-world scenarios. In areas such as academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats, which can be lengthy and highly unstructured. In this paper, we introduce a benchmark suite, namely Unstructured Document Analysis (UDA), that involves 2, 965 real-world documents and 29, 590 expert-annotated Q&A pairs. We revisit popular LLM- and RAG-based solutions for document analysis and evaluate the design choices and answer qualities across multiple document domains and diverse query types. Our evaluation yields interesting findings and highlights the importance of data parsing and retrieval. We hope our benchmark can shed light and better serve real-world document analysis applications. The benchmark suite and code can be found at https: //github. com/qinchuanhui/UDA-Benchmark

NeurIPS Conference 2023 Conference Paper

Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents

  • Wenlong Huang
  • Fei Xia
  • Dhruv Shah
  • Danny Driess
  • Andy Zeng
  • Yao Lu
  • Pete Florence
  • Igor Mordatch

Recent progress in large language models (LLMs) has demonstrated the ability to learn and leverage Internet-scale knowledge through pre-training with autoregressive models. Unfortunately, applying such models to settings with embodied agents, such as robots, is challenging due to their lack of experience with the physical world, inability to parse non-language observations, and ignorance of rewards or safety constraints that robots may require. On the other hand, language-conditioned robotic policies that learn from interaction data can provide the necessary grounding that allows the agent to be correctly situated in the real world, but such policies are limited by the lack of high-level semantic understanding due to the limited breadth of the interaction data available for training them. Thus, if we want to make use of the semantic knowledge in a language model while still situating it in an embodied setting, we must construct an action sequence that is both likely according to the language model and also realizable according to grounded models of the environment. We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives. We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon embodiment tasks in a robotic setting by leveraging the knowledge of both models.

NeurIPS Conference 2023 Conference Paper

Waymax: An Accelerated, Data-Driven Simulator for Large-Scale Autonomous Driving Research

  • Cole Gulino
  • Justin Fu
  • Wenjie Luo
  • George Tucker
  • Eli Bronstein
  • Yiren Lu
  • Jean Harb
  • Xinlei Pan

Simulation is an essential tool to develop and benchmark autonomous vehicle planning software in a safe and cost-effective manner. However, realistic simulation requires accurate modeling of multi-agent interactive behaviors to be trustworthy, behaviors which can be highly nuanced and complex. To address these challenges, we introduce Waymax, a new data-driven simulator for autonomous driving in multi-agent scenes, designed for large-scale simulation and testing. Waymax uses publicly-released, real-world driving data (e. g. , the Waymo Open Motion Dataset) to initialize or play back a diverse set of multi-agent simulated scenarios. It runs entirely on hardware accelerators such as TPUs/GPUs and supports in-graph simulation for training, making it suitable for modern large-scale, distributed machine learning workflows. To support online training and evaluation, Waymax includes several learned and hard-coded behavior models that allow for realistic interaction within simulation. To supplement Waymax, we benchmark a suite of popular imitation and reinforcement learning algorithms with ablation studies on different design decisions, where we highlight the effectiveness of routes as guidance for planning agents and the ability of RL to overfit against simulated agents.

AAAI Conference 2022 Conference Paper

Detail-Preserving Transformer for Light Field Image Super-resolution

  • Shunzhou Wang
  • Tianfei Zhou
  • Yao Lu
  • Huijun Di

Recently, numerous algorithms have been developed to tackle the problem of light field super-resolution (LFSR), i. e. , superresolving low-resolution light fields to gain high-resolution views. Despite delivering encouraging results, these approaches are all convolution-based, and are naturally weak in global relation modeling of sub-aperture images necessarily to characterize the inherent structure of light fields. In this paper, we put forth a novel formulation built upon Transformers, by treating LFSR as a sequence-to-sequence reconstruction task. In particular, our model regards sub-aperture images of each vertical or horizontal angular view as a sequence, and establishes long-range geometric dependencies within each sequence via a spatial-angular locally-enhanced self-attention layer, which maintains the locality of each subaperture image as well. Additionally, to better recover image details, we propose a detail-preserving Transformer (termed as DPT), by leveraging gradient maps of light field to guide the sequence learning. DPT consists of two branches, with each associated with a Transformer for learning from an original or gradient image sequence. The two branches are finally fused to obtain comprehensive feature representations for reconstruction. Evaluations are conducted on a number of light field datasets, including real-world scenes and synthetic data. The proposed method achieves superior performance comparing with other state-of-the-art schemes. Our code is publicly available at: https: //github. com/BITszwang/DPT.

JBHI Journal 2022 Journal Article

Interpreting Depression From Question-Wise Long-Term Video Recording of SDS Evaluation

  • Wanqing Xie
  • Lizhong Liang
  • Yao Lu
  • Chen Wang
  • Jihong Shen
  • Hui Luo
  • Xiaofeng Liu

Self-Rating Depression Scale (SDS) questionnaire has frequently been used for efficient depression preliminary screening. However, the uncontrollable self-administered measure can be easily affected by insouciantly or deceptively answering, and producing the different results with the clinician-administered Hamilton Depression Rating Scale (HDRS) and the final diagnosis. Clinically, facial expression (FE) and actions play a vital role in clinician-administered evaluation, while FE and action are underexplored for self-administered evaluations. In this work, we collect a novel dataset of 200 subjects to evidence the validity of self-rating questionnaires with their corresponding question-wise video recording. To automatically interpret depression from the SDS evaluation and the paired video, we propose an end-to-end hierarchical framework for the long-term variable-length video, which is also conditioned on the questionnaire results and the answering time. Specifically, we resort to a hierarchical model which utilizes a 3D CNN for local temporal pattern exploration and a redundancy-aware self-attention (RAS) scheme for question-wise global feature aggregation. Targeting for the redundant long-term FE video processing, our RAS is able to effectively exploit the correlations of each video clip within a question set to emphasize the discriminative information and eliminate the redundancy based on feature pair-wise affinity. Then, the question-wise video feature is concatenated with the questionnaire scores for final depression detection. Our thorough evaluations also show the validity of fusing SDS evaluation and its video recording, and the superiority of our framework to the conventional state-of-the-art temporal modeling methods.

AAAI Conference 2021 Conference Paper

A Global Occlusion-Aware Approach to Self-Supervised Monocular Visual Odometry

  • Yao Lu
  • Xiaoli Xu
  • Mingyu Ding
  • Zhiwu Lu
  • Tao Xiang

Self-Supervised monocular visual odometry (VO) is often cast into a view synthesis problem based on depth and camera pose estimation. One of the key challenges is to accurately and robustly estimate depth with occlusions and moving objects in the scene. Existing methods simply detect and mask out regions of occlusions locally by several convolutional layers, and then perform only partial view synthesis in the rest of the image. However, occlusion and moving object detection is an unsolved problem itself which requires global layout information. Inaccurate detection inevitably results in incorrect depth as well as pose estimation. In this work, instead of locally detecting and masking out occlusions and moving objects, we propose to alleviate their negative effects on monocular VO implicitly but more effectively from two global perspectives. First, a multi-scale non-local attention module, consisting of both intra-stage augmented attention and cascaded across-stage attention, is proposed for robust depth estimation given occlusions, alleviating the impacts of occlusions via global attention modeling. Second, adversarial learning is introduced in view synthesis for monocular VO. Unlike existing methods that use pixel-level losses on the quality of synthesized views, we enforce the synthetic view to be indistinguishable from the real one at the scene-level. Such a global constraint again helps cope with occluded and moving regions. Extensive experiments on the KITTI dataset show that our approach achieves new state-of-the-art in both pose estimation and depth recovery.

NeurIPS Conference 2021 Conference Paper

Delayed Gradient Averaging: Tolerate the Communication Latency for Federated Learning

  • Ligeng Zhu
  • Hongzhou Lin
  • Yao Lu
  • Yujun Lin
  • Song Han

Federated Learning is an emerging direction in distributed machine learning that en-ables jointly training a model without sharing the data. Since the data is distributed across many edge devices through wireless / long-distance connections, federated learning suffers from inevitable high communication latency. However, the latency issues are undermined in the current literature [15] and existing approaches suchas FedAvg [27] become less efficient when the latency increases. To over comethe problem, we propose \textbf{D}elayed \textbf{G}radient \textbf{A}veraging (DGA), which delays the averaging step to improve efficiency and allows local computation in parallel tocommunication. We theoretically prove that DGA attains a similar convergence rate as FedAvg, and empirically show that our algorithm can tolerate high network latency without compromising accuracy. Specifically, we benchmark the training speed on various vision (CIFAR, ImageNet) and language tasks (Shakespeare), with both IID and non-IID partitions, and show DGA can bring 2. 55$\times$ to 4. 07$\times$ speedup. Moreover, we built a 16-node Raspberry Pi cluster and show that DGA can consistently speed up real-world federated learning applications.

JBHI Journal 2021 Journal Article

Multicenter Privacy-Preserving Cox Analysis Based on Homomorphic Encryption

  • Yao Lu
  • Yu Tian
  • Tianshu Zhou
  • Shiqiang Zhu
  • Jingsong Li

The Cox proportional hazards model is one of the most widely used methods for analyzing survival data. Data from multiple data providers are required to improve the generalizability and confidence of the results of Cox analysis; however, such data sharing may result in leakage of sensitive information, leading to financial fraud, social discrimination or unauthorized data abuse. Some privacy-preserving Cox regression protocols have been proposed in past years, but they lack either security or functionality. In this paper, we propose a privacy-preserving Cox regression protocol for multiple data providers and researchers. The proposed protocol allows researchers to train models on horizontally or vertically partitioned datasets while providing privacy protection for both the sensitive data and the trained models. Our protocol utilizes threshold homomorphic encryption to guarantee security. Experimental results demonstrate that with the proposed protocol, Cox regression model training over 9 variables in a dataset of 113, 035 samples takes approximately 44 min, and the trained model is almost the same as that obtained with the original nonsecure Cox regression protocol; therefore, our protocol is a potential candidate for practical real-world applications in multicenter medical research.

JBHI Journal 2020 Journal Article

3D Shape-Based Body Composition Inference Model Using a Bayesian Network

  • Yao Lu
  • James K Hahn
  • Xiaoke Zhang

Body composition can be assessed in many different ways. High-end medical equipment, such as Dual-energy X-ray Absorptiometry (DXA), Computed Tomography (CT), and Magnetic Resonance Imaging (MRI) offers high-fidelity pixel/voxel-level assessment, but is prohibitive in cost. In the case of DXA and CT, the approach exposes users to ionizing radiation. Whole-body air displacement plethysmography (BOD POD) can accurately estimate body density, but the assessment is limited to the whole-body fat percentage. Optical three-dimensional (3D) scan and reconstruction techniques, such as using depth cameras, have brought new opportunities for improving body composition assessment by intelligently analyzing body shape features. In this paper, we present a novel supervised inference model to predict pixel-level body composition and percentage of body fat using 3D geometry features and body density. First, we use body density to model a fat distribution base prediction. Then, we use a Bayesian network to infer the probability of the base prediction bias with 3D geometry features. Finally, we correct the bias using non-parametric regression. We use DXA assessment as the ground truth in model training and validation. We compare our method, in terms of pixel-level body composition assessment, with the current state-of-the-art prediction models. Our method outperforms those prediction models by 52. 69% on average. We also compare our method, in terms of whole-body fat percentage assessment, with the medical-level equipment-BOD POD. Our method outperforms the BOD POD by 23. 28%.

JBHI Journal 2020 Journal Article

Sequential Saliency Guided Deep Neural Network for Joint Mitosis Identification and Localization in Time-Lapse Phase Contrast Microscopy Images

  • Yao Lu
  • An-An Liu
  • Mei Chen
  • Wei-Zhi Nie
  • Yu-Ting Su

The analysis of cell mitotic behavior plays important role in many biomedical research and medical diagnostic applications. To improve the accuracy of mitosis detection in automated analysis systems, this paper proposes the sequential saliency guided deep neural network (SSG-DNN) to jointly identify and localize mitotic events in time-lapse phase contrast microscopy images. It consists of three key modules. First, the module of visual context learning extracts static visual feature and dynamic visual transition within individual volumetric cell regions. Secondly, with these information, the module of sequential saliency modeling aims to discover the saliency distribution over all successive frames in each volumetric region. Finally, the module of sequence structure modeling can leverage both visual context and saliency distribution for mitosis identification and localization. SSG-DNN can jointly realize visual feature learning and sequential structure modeling in the end-to-end framework. Moreover, the proposed method is independent of complicated preconditioning methods for mitotic candidate extraction and can be applied for mitosis detection in one-shot manner. To our knowledge, it is the first weakly supervised work to realize joint mitosis identification and localization only with sequence-wise labels. In our experiments, we evaluate its performances of both tasks on the popular C3H10 dataset and a novel and large-scale dataset, C2C12-16, which contains much more mitotic events and is more challenging owing to diverse cell culture conditions. Experimental results can demonstrate the superiority of the proposed method.

AAAI Conference 2019 Short Paper

A Multi-Task Learning Framework for Abstractive Text Summarization

  • Yao Lu
  • Linqing Liu
  • Zhile Jiang
  • Min Yang
  • Randy Goebel

We propose a Multi-task learning approach for Abstractive Text Summarization (MATS), motivated by the fact that humans have no difficulty performing such task because they have the capabilities of multiple domains. Specifically, MATS consists of three components: (i) a text categorization model that learns rich category-specific text representations using a bi-LSTM encoder; (ii) a syntax labeling model that learns to improve the syntax-aware LSTM decoder; and (iii) an abstractive text summarization model that shares its encoder and decoder with the text categorization and the syntax labeling tasks, respectively. In particular, the abstractive text summarization model enjoys significant benefit from the additional text categorization and syntax knowledge. Our experimental results show that MATS outperforms the competitors. 1

AAAI Conference 2019 Conference Paper

Super Sparse Convolutional Neural Networks

  • Yao Lu
  • Guangming Lu
  • Bob Zhang
  • Yuanrong Xu
  • Jinxing Li

To construct small mobile networks without performance loss and address the over-fitting issues caused by the less abundant training datasets, this paper proposes a novel super sparse convolutional (SSC) kernel, and its corresponding network is called SSC-Net. In a SSC kernel, every spatial kernel has only one non-zero parameter and these non-zero spatial positions are all different. The SSC kernel can effectively select the pixels from the feature maps according to its non-zero positions and perform on them. Therefore, SSC can preserve the general characteristics of the geometric and the channels’ differences, resulting in preserving the quality of the retrieved features and meeting the general accuracy requirements. Furthermore, SSC can be entirely implemented by the “shift” and “group point-wise” convolutional operations without any spatial kernels (e. g. , “3 × 3”). Therefore, SSC is the first method to remove the parameters’ redundancy from the both spatial extent and the channel extent, leading to largely decreasing the parameters and Flops as well as further reducing the img2col and col2img operations implemented by the low leveled libraries. Meanwhile, SSC-Net can improve the sparsity and overcome the over-fitting more effectively than the other mobile networks. Comparative experiments were performed on the less abundant CIFAR and low resolution ImageNet datasets. The results showed that the SSC-Nets can significantly decrease the parameters and the computational Flops without any performance losses. Additionally, it can also improve the ability of addressing the over-fitting problem on the more challenging less abundant datasets.

IJCAI Conference 2018 Conference Paper

AAR-CNNs: Auto Adaptive Regularized Convolutional Neural Networks

  • Yao Lu
  • Guangming Lu
  • Yuanrong Xu
  • Bob Zhang

In order to address the overfitting problem caused by the small or simple training datasets and the large model’s size in Convolutional Neural Networks (CNNs), a novel Auto Adaptive Regularization (AAR) method is proposed in this paper. The relevant networks can be called AAR-CNNs. AAR is the first method using the “abstraction extent” (predicted by AE net) and a tiny learnable module (SE net) to auto adaptively predict more accurate and individualized regularization information. The AAR module can be directly inserted into every stage of any popular networks and trained end to end to improve the networks’ flexibility. This method can not only regularize the network at both the forward and the backward processes in the training phase, but also regularize the network on a more refined level (channel or pixel level) depending on the abstraction extent’s form. Comparative experiments are performed on low resolution ImageNet, CIFAR and SVHN datasets. Experimental results show that the AAR-CNNs can achieve state-of-the-art performances on these datasets.

AAAI Conference 2018 Short Paper

Generative Adversarial Network for Abstractive Text Summarization

  • Linqing Liu
  • Yao Lu
  • Min Yang
  • Qiang Qu
  • Jia Zhu
  • Hongyan Li

In this paper, we propose an adversarial process for abstractive text summarization, in which we simultaneously train a generative model G and a discriminative model D. In particular, we build the generator G as an agent of reinforcement learning, which takes the raw text as input and predicts the abstractive summarization. We also build a discriminator which attempts to distinguish the generated summary from the ground truth summary. Extensive experiments demonstrate that our model achieves competitive ROUGE scores with the state-of-the-art methods on CNN/Daily Mail dataset. Qualitatively, we show that our model is able to generate more abstractive, readable and diverse summaries1.

AAAI Conference 2017 Conference Paper

Closing the Loop for Edge Detection and Object Proposals

  • Yao Lu
  • Linda Shapiro

Edge grouping and object perception are unified procedures in perceptual organization. However the computer vision literature classifies them as independent tasks. In this paper, we argue that edge detection and object proposals should benefit one another. To achieve this, we go beyond bounding boxes and extract closed contours that represent potential objects within. A novel objectness metric is proposed to score and rank the proposal boxes by considering the sizes and edge intensities of the closed contours. To improve the edge detector given the top-down object proposals, we group local closed contours and construct global object hierarchies and segmentations. The edge detector is retrained and enhanced using these hierarchical segmentations as additional feature channels. In the experiments we show that by closing the loop for edge detection and object proposals, we observe improvements for both tasks. Unifying edges and object proposals is valid and useful.

IJCAI Conference 2016 Conference Paper

Unsupervised Learning on Neural Network Outputs: With Application in Zero-Shot Learning

  • Yao Lu

The outputs of a trained neural network contain much richer information than just a one-hot classifier. For example, a neural network might give an image of a dog the probability of one in a million of being a cat but it is still much larger than the probability of being a car. To reveal the hidden structure in them, we apply two unsupervised learning algorithms, PCA and ICA, to the outputs of a deep Convolutional Neural Network trained on the ImageNet of 1000 classes. The PCA/ICA embedding of the object classes reveals their visual similarity and the PCA/ICA components can be interpreted as common visual features shared by similar object classes. For an application, we proposed a new zero-shot learning method, in which the visual features learned by PCA/ICA are employed. Our zero-shot learning method achieves the state-of-the-art results on the ImageNet of over 20000 classes.