Arrow Research search

Author name cluster

Yan Lu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

43 papers
2 author rows

Possible papers

43

NeurIPS Conference 2025 Conference Paper

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

  • Xiaoyi Zhang
  • Zhaoyang Jia
  • Zongyu Guo
  • Jiahao Li
  • Bin Li
  • Houqiang Li
  • Yan Lu

Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the $\textbf{D}eep \ \textbf{V}ideo \ \textbf{D}iscovery \ (\textbf{DVD})$ agent to leverage an $\textit{agentic search}$ strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools to orchestrate adaptive workflow for different queries in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates our advantage. Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of $\textbf{74. 2\%}$, which substantially surpasses all prior works, and further improves to $\textbf{76. 0\%}$ with transcripts.

NeurIPS Conference 2025 Conference Paper

FuncGenFoil: Airfoil Generation and Editing Model in Function Space

  • Jinouwen Zhang
  • Junjie Ren
  • Ma Qianhong
  • Jianyu Wu
  • Aobo Yang
  • Yan Lu
  • Lu Chen
  • Hairun Xie

Aircraft manufacturing is the jewel in the crown of industry, in which generating high-fidelity airfoil geometries with controllable and editable representations remains a fundamental challenge. Existing deep learning methods, which typically rely on predefined parametric representations (e. g. , Bézier curves) or discrete point sets, face an inherent trade-off between expressive power and resolution adaptability. To tackle this challenge, we introduce FuncGenFoil, a novel function-space generative model that directly reconstructs airfoil geometries as function curves. Our method inherits the advantages of arbitrary-resolution sampling and smoothness from parametric functions, as well as the strong expressiveness of discrete point-based representations. Empirical evaluations demonstrate that FuncGenFoil improves upon state-of-the-art methods in airfoil generation, achieving a relative 74. 4% reduction in label error and a 23. 2% increase in diversity on the AF-200K dataset. Our results highlight the advantages of function-space modeling for aerodynamic shape optimization, offering a powerful and flexible framework for high-fidelity airfoil design.

NeurIPS Conference 2025 Conference Paper

Image as a World: Generating Interactive World from Single Image via Panoramic Video Generation

  • Dongnan Gui
  • Xun Guo
  • Wengang Zhou
  • Yan Lu

Generating an interactive visual world from a single image is both challenging and practically valuable, as single-view inputs are easy to acquire and align well with prompt-driven applications such as gaming and virtual reality. This paper introduces a novel unified framework, Image as a World ( IaaW ), which synthesizes high-quality 360-degree videos from a single image that are both controllable and temporally continuable. Our framework consists of three stages: world initialization, which jointly synthesizes spatially complete and temporally dynamic scenes from a single view; world exploration, which supports user-specified viewpoint rotation; and world continuation, which extends the generated scene forward in time with temporal consistency. To support this pipeline, we design a visual world model based on generative diffusion models modulated with spherical 3D positional encoding and multi-view composition to represent geometry and view semantics. Additionally, a vision-language model (IaaW-VLM) is fine-tuned to produce both global and view-specific prompts, improving semantic alignment and controllability. Extensive experiments demonstrate that our method produces panoramic videos with superior visual quality, minimal distortion and seamless continuation in both qualitative and quantitative evaluations. To the best of our knowledge, this is the first work to generate a controllable, consistent, and temporally expandable 360-degree world from a single image.

AAAI Conference 2025 Conference Paper

MEATRD: Multimodal Anomalous Tissue Region Detection Enhanced with Spatial Transcriptomics

  • Kaichen Xu
  • Qilong Wu
  • Yan Lu
  • Yinan Zheng
  • Wenlin Li
  • Xingjie Tang
  • Jun Wang
  • Xiaobo Sun

The detection of anomalous tissue regions (ATRs) within affected tissues is crucial in clinical diagnosis and pathological studies. Conventional automated ATR detection methods, primarily based on histology images alone, falter in cases where ATRs and normal tissues have subtle visual differences. The recent spatial transcriptomics (ST) technology profiles gene expressions across tissue regions, offering a molecular perspective for detecting ATRs. However, there is a dearth of ATR detection methods that effectively harness complementary information from both histology images and ST. To address this gap, we propose MEATRD, a novel ATR detection method that integrates histology image and ST data. MEATRD is trained to reconstruct image patches and gene expression profiles of normal tissue spots (inliers) from their multimodal embeddings, followed by learning a one-class classification AD model based on latent multimodal reconstruction errors. This strategy harmonizes the strengths of reconstruction-based and one-class classification approaches. At the heart of MEATRD is an innovative masked graph dual-attention transformer (MGDAT) network, which not only facilitates cross-modality and cross-node information sharing but also addresses the model over-generalization issue commonly seen in reconstruction-based AD methods. Additionally, we demonstrate that modality-specific, task-relevant information is collated and condensed in multimodal bottleneck encoding generated in MGDAT, marking the first theoretical analysis of the informational properties of multimodal bottleneck encoding. Extensive evaluations across eight real ST datasets reveal MEATRD's superior performance in ATR detection, surpassing various state-of-the-art AD methods. Remarkably, MEATRD also proves adept at discerning ATRs that only show slight visual deviations from normal tissues.

NeurIPS Conference 2025 Conference Paper

Omnidirectional 3D Scene Reconstruction from Single Image

  • Ren Yang
  • Jiahao Li
  • Yan Lu

Reconstruction of 3D scenes from a single image is a crucial step towards enabling next-generation AI-powered immersive experiences. However, existing diffusion-based methods often struggle with reconstructing omnidirectional scenes due to geometric distortions and inconsistencies across the generated novel views, hindering accurate 3D recovery. To overcome this challenge, we propose Omni3D, an approach designed to enhance the geometric fidelity of diffusion-generated views for robust omnidirectional reconstruction. Our method leverages priors from pose estimation techniques, such as MASt3R, to iteratively refine both the generated novel views and their estimated camera poses. Specifically, we minimize the 3D reprojection errors between paired views to optimize the generated images, and simultaneously, correct the pose estimation based on the refined views. This synergistic optimization process yields geometrically consistent views and accurate poses, which are then used to build an explicit 3D Gaussian Splatting representation capable of omnidirectional rendering. Experimental results validate the effectiveness of Omni3D, demonstrating significantly advanced 3D reconstruction quality in the omnidirectional space, compared to previous state-of-the-art methods. Project page: https: //omni3d-neurips. github. io.

NeurIPS Conference 2025 Conference Paper

One-Step Diffusion-Based Image Compression with Semantic Distillation

  • Naifu Xue
  • Zhaoyang Jia
  • Jiahao Li
  • Bin Li
  • Yuan Zhang
  • Yan Lu

While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasant latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec—that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 39% bitrate reduction and 20× faster decoding compared to prior multi-step diffusion-based codecs. Project: https: //onedc-codec. github. io/

NeurIPS Conference 2025 Conference Paper

PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs

  • Xinzhe Zheng
  • Hao Du
  • Fanding Xu
  • Jinzhe Li
  • Zhiyuan Liu
  • Wenkang Wang
  • Tao Chen
  • Wanli Ouyang

Deep learning-based computational methods have achieved promising results in predicting protein-protein interactions (PPIs). However, existing benchmarks predominantly focus on isolated pairwise evaluations, overlooking a model's capability to reconstruct biologically meaningful PPI networks, which is crucial for biology research. To address this gap, we introduce PRING, the first comprehensive benchmark that evaluates PRotein-protein INteraction prediction from a Graph-level perspective. PRING curates a high-quality, multi-species PPI network dataset comprising 21, 484 proteins and 186, 818 interactions, with well-designed strategies to address both data redundancy and leakage. Building on this golden-standard dataset, we establish two complementary evaluation paradigms: (1) topology-oriented tasks, which assess intra and cross-species PPI network construction, and (2) function-oriented tasks, including protein complex pathway prediction, GO module analysis, and essential protein justification. These evaluations not only reflect the model's capability to understand the network topology but also facilitate protein function annotation, biological module detection, and even disease mechanism analysis. Extensive experiments on four representative model categories, consisting of sequence similarity-based, naive sequence-based, protein language model-based, and structure-based approaches, demonstrate that current PPI models have potential limitations in recovering both structural and functional properties of PPI networks, highlighting the gap in supporting real-world biological applications. We believe PRING provides a reliable platform to guide the development of more effective PPI prediction models for the community. The dataset and source code of PRING are available at https: //github. com/SophieSarceau/PRING.

NeurIPS Conference 2025 Conference Paper

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

  • Yuhao Zhou
  • Yiheng Wang
  • Xuming He
  • Ruoyao Xiao
  • Zhiwei Li
  • Qiantai Feng
  • Zijie Guo
  • Yuejin Yang

Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34. 08% and 26. 52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.

NeurIPS Conference 2025 Conference Paper

STAR: A Benchmark for Astronomical Star Fields Super-Resolution

  • WU KUO-CHENG
  • Guohang Zhuang
  • Jinyang Huang
  • Xiang Zhang
  • Wanli Ouyang
  • Yan Lu

Super-resolution (SR) advances astronomical imaging by enabling cost-effective high-resolution capture, crucial for detecting faraway celestial objects and precise structural analysis. However, existing datasets for astronomical SR (ASR) exhibit three critical limitations: flux inconsistency, object-crop setting, and insufficient data diversity, significantly impeding ASR development. We propose STAR, a large-scale astronomical SR dataset containing 54, 738 flux-consistent star field image pairs covering wide celestial regions. These pairs combine Hubble Space Telescope high-resolution observations with physically faithful low-resolution counterparts generated through a flux-preserving data generation pipeline, enabling systematic development of field-level ASR models. To further empower the ASR community, STAR provides a novel Flux Error (FE) to evaluate SR models in physical view. Leveraging this benchmark, we propose a Flux-Invariant Super Resolution (FISR) model that could accurately infer the flux-consistent high-resolution images from input photometry, suppressing several SR state-of-the-art methods by 24. 84% on a novel designed flux consistency metric, showing the priority of our method for astrophysics. Extensive experiments demonstrate the effectiveness of our proposed method and the value of our dataset. Code and models are available at https: //github. com/GuoCheng12/STAR.

IJCAI Conference 2025 Conference Paper

Towards Anytime Retrieval: A Benchmark for Anytime Person Re-Identification

  • Xulin Li
  • Yan Lu
  • Bin Liu
  • Jiaze Li
  • Qinhong Yang
  • Tao Gong
  • Qi Chu
  • Mang Ye

In real applications, person re-identification (ReID) expects to retrieve the target person at any time, including both daytime and nighttime, ranging from short-term to long-term. However, existing ReID tasks and datasets cannot meet this requirement, as they are constrained by available time and only provide training and evaluation for specific scenarios. Therefore, we investigate a new task called Anytime Person Re-identification (AT-ReID), which aims to achieve effective retrieval in multiple scenarios based on variations in time. To address the AT-ReID problem, we collect the first large-scale dataset, AT-USTC, which contains 135k images of individuals wearing multiple clothes captured by RGB and IR cameras. Our data collection spans over an entire year and 270 volunteers were photographed on average 29. 1 times across different dates or scenes, 4-15 times more than current datasets, providing conditions for follow-up investigations in AT-ReID. Further, to tackle the new challenge of multi-scenario retrieval, we propose a unified model named Uni-AT, which comprises a multi-scenario ReID (MS-ReID) framework for scenario-specific features learning, a Mixture-of-Attribute-Experts (MoAE) module to alleviate inter-scenario interference, and a Hierarchical Dynamic Weighting (HDW) strategy to ensure balanced training across all scenarios. Extensive experiments show that our model leads to satisfactory results and exhibits excellent generalization to all scenarios.

ICLR Conference 2025 Conference Paper

UniGEM: A Unified Approach to Generation and Property Prediction for Molecules

  • Shikun Feng
  • Yuyan Ni
  • Yan Lu
  • Zhiming Ma
  • Wei-Ying Ma
  • Yanyan Lan

Molecular generation and molecular property prediction are both crucial for drug discovery, but they are often developed independently. Inspired by recent studies, which demonstrate that diffusion model, a prominent generative approach, can learn meaningful data representations that enhance predictive tasks, we explore the potential for developing a unified generative model in the molecular domain that effectively addresses both molecular generation and property prediction tasks. However, the integration of these tasks is challenging due to inherent inconsistencies, making simple multi-task learning ineffective. To address this, we propose UniGEM, the first unified model to successfully integrate molecular generation and property prediction, delivering superior performance in both tasks. Our key innovation lies in a novel two-phase generative process, where predictive tasks are activated in the later stages, after the molecular scaffold is formed. We further enhance task balance through innovative training strategies. Rigorous theoretical analysis and comprehensive experiments demonstrate our significant improvements in both tasks. The principles behind UniGEM hold promise for broader applications, including natural language processing and computer vision.

NeurIPS Conference 2025 Conference Paper

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

  • Yichao Shen
  • Fangyun Wei
  • Zhiying Du
  • Yaobo Liang
  • Yan Lu
  • Jiaolong Yang
  • Nanning Zheng
  • Baining Guo

Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy—forecasting both actions and their visual consequences—explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.

AAAI Conference 2024 Conference Paper

Arbitrary-Scale Video Super-resolution Guided by Dynamic Context

  • Cong Huang
  • Jiahao Li
  • Lei Chu
  • Dong Liu
  • Yan Lu

We propose a Dynamic Context-Guided Upsampling (DCGU) module for video super-resolution (VSR) that leverages temporal context guidance to achieve efficient and effective arbitrary-scale VSR. While most VSR research focuses on backbone design, the importance of the upsampling part is often overlooked. Existing methods rely on pixelshuffle-based upsampling, which has limited capabilities in handling arbitrary upsampling scales. Recent attempts to replace pixelshuffle-based modules with implicit neural function-based and filter-based approaches suffer from slow inference speeds and limited representation capacity, respectively. To overcome these limitations, our DCGU module predicts non-local sampling locations and content-dependent filter weights, enabling efficient and effective arbitrary-scale VSR. Our proposed multi-granularity location search module efficiently identifies non-local sampling locations across the entire low-resolution grid, and the temporal bilateral filter modulation module integrates content information with the filter weight to enhance textual details. Extensive experiments demonstrate the superiority of our method in terms of performance and speed on arbitrary-scale VSR.

NeurIPS Conference 2024 Conference Paper

Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement

  • Tao Yang
  • Cuiling Lan
  • Yan Lu
  • Nanning Zheng

Disentangled representation learning strives to extract the intrinsic factors within the observed data. Factoring these representations in an unsupervised manner is notably challenging and usually requires tailored loss functions or specific structural designs. In this paper, we introduce a new perspective and framework, demonstrating that diffusion models with cross-attention itself can serve as a powerful inductive bias to facilitate the learning of disentangled representations. We propose to encode an image into a set of concept tokens and treat them as the condition of the latent diffusion model for image reconstruction, where cross attention over the concept tokens is used to bridge the encoder and the U-Net of the diffusion model. We analyze that the diffusion process inherently possesses the time-varying information bottlenecks. Such information bottlenecks and cross attention act as strong inductive biases for promoting disentanglement. Without any regularization term in the loss function, this framework achieves superior disentanglement performance on the benchmark datasets, surpassing all previous methods with intricate designs. We have conducted comprehensive ablation studies and visualization analyses, shedding a light on the functioning of this model. We anticipate that our findings will inspire more investigation on exploring diffusion model for disentangled representation learning towards more sophisticated data analysis and understanding.

AAAI Conference 2024 Conference Paper

MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators

  • Yaqi Zhang
  • Di Huang
  • Bin Liu
  • Shixiang Tang
  • Yan Lu
  • Lu Chen
  • Lei Bai
  • Qi Chu

Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in large language models (LLMs). Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Visit our webpage at https://qiqiapink.github.io/MotionGPT/.

NeurIPS Conference 2024 Conference Paper

Slot-VLM: Object-Event Slots for Video-Language Modeling

  • Jiaqi Xu
  • Cuiling Lan
  • Wenxuan Xie
  • Xuejin Chen
  • Yan Lu

Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the development of an effective method to encapsulate video content into a set of representative tokens to align with LLMs. In this work, we introduce Slot-VLM, a new framework designed to generate semantically decomposed video tokens, in terms of object-wise and event-wise visual representations, to facilitate LLM inference. Particularly, we design an Object-Event Slots module, i. e. , OE-Slots, that adaptively aggregates the dense video tokens from the vision encoder to a set of representative slots. In order to take into account both the spatial object details and the varied temporal dynamics, we build OE-Slots with two branches: the Object-Slots branch and the Event-Slots branch. The Object-Slots branch focuses on extracting object-centric slots from features of high spatial resolution but low frame sample rate, emphasizing detailed object information. The Event-Slots branch is engineered to learn event-centric slots from high temporal sample rate but low spatial resolution features. These complementary slots are combined to form the vision context, serving as the input to the LLM for effective video reasoning. Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering.

AAAI Conference 2024 Conference Paper

Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

  • Zhiwei Zhao
  • Bin Liu
  • Yan Lu
  • Qi Chu
  • Nenghai Yu

Text-to-Image person re-identification (TI-ReID) aims to retrieve the images of target identity according to the given textual description. The existing methods in TI-ReID focus on aligning the visual and textual modalities through contrastive feature alignment or reconstructive masked language modeling (MLM). However, these methods parameterize the image/text instances as deterministic embeddings and do not explicitly consider the inherent uncertainty in pedestrian images and their textual descriptions, leading to limited image-text relationship expression and semantic alignment. To address the above problem, in this paper, we propose a novel method that unifies multi-modal uncertainty modeling and semantic alignment for TI-ReID. Specifically, we model the image and textual feature vectors of pedestrian as Gaussian distributions, where the multi-granularity uncertainty of the distribution is estimated by incorporating batch-level and identity-level feature variances for each modality. The multi-modal uncertainty modeling acts as a feature augmentation and provides richer image-text semantic relationship. Then we present a bi-directional cross-modal circle loss to more effectively align the probabilistic features between image and text in a self-paced manner. To further promote more comprehensive image-text semantic alignment, we design a task that complements the masked language modeling, focusing on the cross-modality semantic recovery of global masked token after cross-modal interaction. Extensive experiments conducted on three TI-ReID datasets highlight the effectiveness and superiority of our method over state-of-the-arts.

AAAI Conference 2023 Conference Paper

Active Token Mixer

  • Guoqiang Wei
  • Zhizheng Zhang
  • Cuiling Lan
  • Yan Lu
  • Zhibo Chen

The three existing dominant network families, i.e., CNNs, Transformers and MLPs, differ from each other mainly in the ways of fusing spatial contextual information, leaving designing more effective token-mixing mechanisms at the core of backbone architecture development. In this work, we propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate contextual information from other tokens in the global scope into the given query token. This fundamental operator actively predicts where to capture useful contexts and learns how to fuse the captured contexts with the query token at channel level. In this way, the spatial range of token-mixing can be expanded to a global scope with limited computational complexity, where the way of token-mixing is reformed. We take ATMs as the primary operators and assemble them into a cascade architecture, dubbed ATMNet. Extensive experiments demonstrate that ATMNet is generally applicable and comprehensively surpasses different families of SOTA vision backbones by a clear margin on a broad range of vision tasks, including visual recognition and dense prediction tasks. Code is available at https://github.com/microsoft/ActiveMLP.

NeurIPS Conference 2023 Conference Paper

DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models

  • Tao Yang
  • Yuwang Wang
  • Yan Lu
  • Nanning Zheng

Targeting to understand the underlying explainable factors behind observations and modeling the conditional generation process on these factors, we connect disentangled representation learning to diffusion probabilistic models (DPMs) to take advantage of the remarkable modeling ability of DPMs. We propose a new task, disentanglement of (DPMs): given a pre-trained DPM, without any annotations of the factors, the task is to automatically discover the inherent factors behind the observations and disentangle the gradient fields of DPM into sub-gradient fields, each conditioned on the representation of each discovered factor. With disentangled DPMs, those inherent factors can be automatically discovered, explicitly represented and clearly injected into the diffusion process via the sub-gradient fields. To tackle this task, we devise an unsupervised approach, named DisDiff, and for the first time achieving disentangled representation learning in the framework of DPMs. Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness of DisDiff.

NeurIPS Conference 2023 Conference Paper

Learning Trajectories are Generalization Indicators

  • Jingwen Fu
  • Zhizheng Zhang
  • Dacheng Yin
  • Yan Lu
  • Nanning Zheng

This paper explores the connection between learning trajectories of Deep Neural Networks (DNNs) and their generalization capabilities when optimized using (stochastic) gradient descent algorithms. Instead of concentrating solely on the generalization error of the DNN post-training, we present a novel perspective for analyzing generalization error by investigating the contribution of each update step to the change in generalization error. This perspective enable a more direct comprehension of how the learning trajectory influences generalization error. Building upon this analysis, we propose a new generalization bound that incorporates more extensive trajectory information. Our proposed generalization bound depends on the complexity of learning trajectory and the ratio between the bias and diversity of training set. Experimental observations reveal that our method effectively captures the generalization error throughout the training process. Furthermore, our approach can also track changes in generalization error when adjustments are made to learning rates and label noise levels. These results demonstrate that learning trajectory information is a valuable indicator of a model's generalization capabilities.

AAAI Conference 2023 Conference Paper

Multi-View Domain Adaptive Object Detection on Camera Networks

  • Yan Lu
  • Zhun Zhong
  • Yuanchao Shu

In this paper, we study a new domain adaptation setting on camera networks, namely Multi-View Domain Adaptive Object Detection (MVDA-OD), in which labeled source data is unavailable in the target adaptation process and target data is captured from multiple overlapping cameras. In such a challenging context, existing methods including adversarial training and self-training fall short due to multi-domain data shift and the lack of source data. To tackle this problem, we propose a novel training framework consisting of two stages. First, we pre-train the backbone using self-supervised learning, in which a multi-view association is developed to construct an effective pretext task. Second, we fine-tune the detection head using robust self-training, where a tracking-based single-view augmentation is introduced to achieve weak-hard consistency learning. By doing so, an object detection model can take advantage of informative samples generated by multi-view association and single-view augmentation to learn discriminative backbones as well as robust detection classifiers. Experiments on two real-world multi-camera datasets demonstrate significant advantages of our approach over the state-of-the-art domain adaptive object detection methods.

NeurIPS Conference 2022 Conference Paper

Alignment-guided Temporal Attention for Video Action Recognition

  • Yizhou Zhao
  • Zhenyang Li
  • Xun Guo
  • Yan Lu

Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D+1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more efficient in computation, the latter often obtains better performance. In this paper, we attribute this to a dilemma between the sufficiency and the efficiency of interactions among various positions in different frames. These interactions affect the extraction of task-relevant information shared among frames. To resolve this issue, we prove that frame-by-frame alignments have the potential to increase the mutual information between frame representations, thereby including more task-relevant information to boost effectiveness. Then we propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames. It can act as a general plug-in for image backbones to conduct the action recognition task without any model-specific design. Extensive experiments on multiple benchmarks demonstrate the superiority and generality of our module.

AAAI Conference 2022 Conference Paper

Hybrid Instance-Aware Temporal Fusion for Online Video Instance Segmentation

  • Xiang Li
  • Jinglu Wang
  • Xiao Li
  • Yan Lu

Recently, transformer-based image segmentation methods have achieved notable success against previous solutions. While for video domains, how to effectively model temporal context with the attention of object instances across frames remains an open problem. In this paper, we propose an online video instance segmentation framework with a novel instance-aware temporal fusion method. We first leverage the representation, i. e. , a latent code in the global context (instance code) and CNN feature maps to represent instance- and pixel-level features. Based on this representation, we introduce a cropping-free temporal fusion approach to model the temporal consistency between video frames. Specifically, we encode global instance-specific information in the instance code and build up inter-frame contextual fusion with hybrid attentions between the instance codes and CNN feature maps. Inter-frame consistency between the instance codes is further enforced with order constraints. By leveraging the learned hybrid temporal consistency, we are able to directly retrieve and maintain instance identities across frames, eliminating the complicated frame-wise instance matching in prior methods. Extensive experiments have been conducted on popular VIS datasets, i. e. Youtube-VIS-19/21. Our model achieves the best performance among all online VIS methods. Notably, our model also eclipses all offline methods when using the ResNet-50 backbone.

NeurIPS Conference 2022 Conference Paper

Mask-based Latent Reconstruction for Reinforcement Learning

  • Tao Yu
  • Zhizheng Zhang
  • Cuiling Lan
  • Yan Lu
  • Zhibo Chen

For deep reinforcement learning (RL) from pixels, learning effective state representations is crucial for achieving high performance. However, in practice, limited experience and high-dimensional inputs prevent effective representation learning. To address this, motivated by the success of mask-based modeling in other research fields, we introduce mask-based reconstruction to promote state representation learning in RL. Specifically, we propose a simple yet effective self-supervised method, Mask-based Latent Reconstruction (MLR), to predict complete state representations in the latent space from the observations with spatially and temporally masked pixels. MLR enables better use of context information when learning state representations to make them more informative, which facilitates the training of RL agents. Extensive experiments show that our MLR significantly improves the sample efficiency in RL and outperforms the state-of-the-art sample-efficient RL methods on multiple continuous and discrete control benchmarks. Our code is available at https: //github. com/microsoft/Mask-based-Latent-Reconstruction.

AAAI Conference 2022 Conference Paper

Reliable Propagation-Correction Modulation for Video Object Segmentation

  • Xiaohao Xu
  • Jinglu Wang
  • Xiao Li
  • Yan Lu

Error propagation is a general but crucial problem in online semi-supervised video object segmentation. We aim to suppress error propagation through a correction mechanism with high reliability. The key insight is to disentangle the correction from the conventional mask propagation process with reliable cues. We introduce two modulators, propagation and correction modulators, to separately perform channel-wise re-calibration on the target frame embeddings according to local temporal correlations and reliable references respectively. Specifically, we assemble the modulators with a cascaded propagation-correction scheme. This avoids overriding the effects of the reliable correction modulator by the propagation modulator. Although the reference frame with the ground truth label provides reliable cues, it could be very different from the target frame and introduce uncertain or incomplete correlations. We augment the reference cues by supplementing reliable feature patches to a maintained pool, thus offering more comprehensive and expressive object representations to the modulators. In addition, a reliability filter is designed to retrieve reliable patches and pass them in subsequent frames. Our model achieves state-of-the-art performance on YouTube-VOS18/19 and DAVIS17-Val/Test benchmarks. Extensive experiments demonstrate that the correction mechanism provides considerable performance gain by fully utilizing reliable guidance.

NeurIPS Conference 2022 Conference Paper

Visual Concepts Tokenization

  • Tao Yang
  • Yuwang Wang
  • Yan Lu
  • Nanning Zheng

Obtaining the human-like perception ability of abstracting visual concepts from concrete pixels has always been a fundamental and important target in machine learning research fields such as disentangled representation learning and scene decomposition. Towards this goal, we propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens, with each concept token responding to one type of independent visual concept. Particularly, to obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens, preventing information leakage across concept tokens. We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts. The cross-attention and disentangling loss play the role of induction and mutual exclusion for the concept tokens, respectively. Extensive experiments on several popular datasets verify the effectiveness of VCT on the tasks of disentangled representation learning and scene decomposition. VCT achieves the state of the art results by a large margin.

NeurIPS Conference 2021 Conference Paper

Deep Contextual Video Compression

  • Jiahao Li
  • Bin Li
  • Yan Lu

Most of the existing neural video compression methods adopt the predictive coding framework, which first generates the predicted frame and then encodes its residue with the current frame. However, as for compression ratio, predictive coding is only a sub-optimal solution as it uses simple subtraction operation to remove the redundancy across frames. In this paper, we propose a deep contextual video compression framework to enable a paradigm shift from predictive coding to conditional coding. In particular, we try to answer the following questions: how to define, use, and learn condition under a deep video compression framework. To tap the potential of conditional coding, we propose using feature domain context as condition. This enables us to leverage the high dimension context to carry rich information to both the encoder and the decoder, which helps reconstruct the high-frequency contents for higher video quality. Our framework is also extensible, in which the condition can be flexibly designed. Experiments show that our method can significantly outperform the previous state-of-the-art (SOTA) deep video compression methods. When compared with x265 using veryslow preset, we can achieve 26. 0% bitrate saving for 1080P standard test videos.

AAAI Conference 2021 Conference Paper

Interactive Speech and Noise Modeling for Speech Enhancement

  • Chengyu Zheng
  • Xiulian Peng
  • Yuan Zhang
  • Sriram Srinivasan
  • Yan Lu

Speech enhancement is challenging because of the diversity of background noise types. Most of the existing methods are focused on modelling the speech rather than the noise. In this paper, we propose a novel idea to model speech and noise simultaneously in a two-branch convolutional neural network, namely SN-Net. In SN-Net, the two branches predict speech and noise, respectively. Instead of information fusion only at the final output layer, interaction modules are introduced at several intermediate feature domains between the two branches to benefit each other. Such an interaction can leverage features learned from one branch to counteract the undesired part and restore the missing component of the other and thus enhance their discrimination capabilities. We also design a feature extraction module, namely residualconvolution-and-attention (RA), to capture the correlations along temporal and frequency dimensions for both the speech and the noises. Evaluations on public datasets show that the interaction module plays a key role in simultaneous modeling and the SN-Net outperforms the state-of-the-art by a large margin on various evaluation metrics. The proposed SN-Net also shows superior performance for speaker separation.

AAAI Conference 2021 Conference Paper

Joint Color-irrelevant Consistency Learning and Identity-aware Modality Adaptation for Visible-infrared Cross Modality Person Re-identification

  • Zhiwei Zhao
  • Bin Liu
  • Qi Chu
  • Yan Lu
  • Nenghai Yu

Visible-infrared cross modality person re-identification (VI- ReID) is a core but challenging technology in the 24-hours intelligent surveillance system. How to eliminate the large modality gap lies in the heart of VI-ReID. Conventional methods mainly focus on directly aligning the heterogeneous modalities into the same space. However, due to the unbalanced color information between the visible and infrared images, the features of visible images tend to overfit the clothing color information, which would be harmful to the modality alignment. Besides, these methods mainly align the heterogeneous feature distributions in dataset-level while ignoring the valuable identity information, which may cause the feature misalignment of some identities and weaken the discrimination of features. To tackle above problems, we propose a novel approach for VI-ReID. It learns the colorirrelevant features through the color-irrelevant consistency learning (CICL) and aligns the identity-level feature distributions by the identity-aware modality adaptation (IAMA). The CICL and IAMA are integrated into a joint learning framework and can promote each other. Extensive experiments on two popular datasets SYSU-MM01 and RegDB demonstrate the superiority and effectiveness of our approach against the state-of-the-art methods.

AAAI Conference 2021 Conference Paper

Weakly-supervised Temporal Action Localization by Uncertainty Modeling

  • Pilhyeon Lee
  • Jinglu Wang
  • Yan Lu
  • Hyeran Byun

Weakly-supervised temporal action localization aims to learn detecting temporal intervals of action classes with only videolevel labels. To this end, it is crucial to separate frames of action classes from the background frames (i. e. , frames not belonging to any action classes). In this paper, we present a new perspective on background frames where they are modeled as out-of-distribution samples regarding their inconsistency. Then, background frames can be detected by estimating the probability of each frame being out-of-distribution, known as uncertainty, but it is infeasible to directly learn uncertainty without frame-level labels. To realize the uncertainty learning in the weakly-supervised setting, we leverage the multiple instance learning formulation. Moreover, we further introduce a background entropy loss to better discriminate background frames by encouraging their in-distribution (action) probabilities to be uniformly distributed over all action classes. Experimental results show that our uncertainty modeling is effective at alleviating the interference of background frames and brings a large performance gain without bells and whistles. We demonstrate that our model significantly outperforms state-of-the-art methods on the benchmarks, THU- MOS’14 and ActivityNet (1. 2 & 1. 3). Our code is available at https: //github. com/Pilhyeon/WTAL-Uncertainty-Modeling.

IROS Conference 2020 Conference Paper

Model Quality Aware RANSAC: A Robust Camera Motion Estimator

  • Shu-Hao Yeh
  • Yan Lu
  • Dezhen Song

Robust estimation of camera motion under the presence of outlier noisevision. Despite existing efforts that focus on detecting motion and scene degeneracies, the best existing approach that builds on Random Consensus Sampling (RANSAC) still has non-negligible failure rate. Since a single failure can lead to the failure of the entire visual simultaneous localization and mapping, it is important to further improve the robust estimation algorithm. We propose a new robust camera motion estimator (RCME) by incorporating two main changes: a model-sample consistency test at the model instantiation step and an inlier set quality test that verifies model-inlier consistency using differential entropy. We have implemented our RCME algorithm and tested it under many public datasets. The results have shown a consistent reduction in failure rate when comparing to the RANSAC-based Gold Standard approach and two recent variations of RANSAC methods.

AAAI Conference 2019 Conference Paper

MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization

  • Zengyi Qin
  • Jinglu Wang
  • Yan Lu

Localizing objects in the real 3D space, which plays a crucial role in scene understanding, is particularly challenging given only a single RGB image due to the geometric information loss during imagery projection. We propose Mono- GRNet for the amodal 3D object localization from a monocular RGB image via geometric reasoning in both the observed 2D projection and the unobserved depth dimension. MonoGRNet is a single, unified network composed of four task-specific subnetworks, responsible for 2D object detection, instance depth estimation (IDE), 3D localization and local corner regression. Unlike the pixel-level depth estimation that needs per-pixel annotations, we propose a novel IDE method that directly predicts the depth of the targeting 3D bounding box’s center using sparse supervision. The 3D localization is further achieved by estimating the position in the horizontal and vertical dimensions. Finally, MonoGRNet is jointly learned by optimizing the locations and poses of the 3D bounding boxes in the global context. We demonstrate that MonoGRNet achieves state-of-the-art performance on challenging datasets.

AAAI Conference 2019 Conference Paper

MVPNet: Multi-View Point Regression Networks for 3D Object Reconstruction from A Single Image

  • Jinglu Wang
  • Bo Sun
  • Yan Lu

In this paper, we address the problem of reconstructing an object’s surface from a single image using generative networks. First, we represent a 3D surface with an aggregation of dense point clouds from multiple views. Each point cloud is embedded in a regular 2D grid aligned on an image plane of a viewpoint, making the point cloud convolution-favored and ordered so as to fit into deep network architectures. The point clouds can be easily triangulated by exploiting connectivities of the 2D grids to form mesh-based surfaces. Second, we propose an encoder-decoder network that generates such kind of multiple view-dependent point clouds from a single image by regressing their 3D coordinates and visibilities. We also introduce a novel geometric loss that is able to interpret discrepancy over 3D surfaces as opposed to 2D projective planes, resorting to the surface discretization on the constructed meshes. We demonstrate that the multi-view point regression network outperforms state-of-the-art methods with a significant improvement on challenging datasets.

IROS Conference 2016 Conference Paper

Visual programming for mobile robot navigation using high-level landmarks

  • Joseph Lee
  • Yan Lu
  • Yiliang Xu
  • Dezhen Song

We propose a visual programming system that allows users to specify navigation tasks for mobile robots using high-level landmarks in a virtual reality (VR) environment constructed from the output of visual simultaneous localization and mapping (vSLAM). The VR environment provides a Google Street View-like interface for users to familiarize themselves with the robot's working environment, specify high-level landmarks, and determine task-level motion commands related to each landmark. Our system builds a roadmap by using the pose graph from the vSLAM outputs. Based on the roadmap, the high-level landmarks, and task-level motion commands, our system generates an output path for the robot to accomplish the navigation task. We present data structures, architecture, interface, and algorithms for our system and show that, given n s search-type motion commands, our system generates a path in O(n s (n r logn r +m r )) time, where n r and m r are the number of roadmap nodes and edges, respectively. We have implemented our system and tested it on real world data.

IROS Conference 2015 Conference Paper

Robustness to lighting variations: An RGB-D indoor visual odometry using line segments

  • Yan Lu
  • Dezhen Song

Large lighting variation challenges all visual odometry methods, even with RGB-D cameras. Here we propose a line segment-based RGB-D indoor odometry algorithm robust to lighting variation. We know line segments are abundant indoors and less sensitive to lighting change than point features. However, depth data are often noisy, corrupted or even missing for line segments which are often found on object boundaries where significant depth discontinuities occur. Our algorithm samples depth data along line segments, and uses a random sample consensus approach to identify correct depth and estimate 3D line segments. We analyze 3D line segment uncertainties and estimate camera motion by minimizing the Mahalanobis distance. In experiments we compare our method with two state-of-the-art methods including a keypoint-based approach and a dense visual odometry algorithm, under both constant and varying lighting. Our method demonstrates superior robustness to lighting change by outperforming the competing methods on 6 out of 8 long indoor sequences under varying lighting. Meanwhile our method also achieves improved accuracy even under constant lighting when tested using public data.

ICRA Conference 2014 Conference Paper

High level landmark-based visual navigation using unsupervised geometric constraints in local bundle adjustment

  • Yan Lu
  • Dezhen Song
  • Jingang Yi

We present a high level landmark-based visual navigation approach for a monocular mobile robot. We utilize heterogeneous features, such as points, line segments, lines, planes, and vanishing points, and their inner geometric constraints as the integrated high level landmarks. This is managed through a multilayer feature graph (MFG). Our method extends local bundle adjustment (LBA)-based framework by explicitly exploiting different features and their geometric relationships in an unsupervised manner. The algorithm takes a video stream as input, initializes and incrementally updates MFG based on extracted key frames; it also refines localization and MFG landmarks through the LBA. Physical experiments show that our method can reduce the absolute trajectory error of a traditional point landmark-based LBA method by up to 63. 9%.

IROS Conference 2014 Conference Paper

Planar building facade segmentation and mapping using appearance and geometric constraints

  • Joseph Lee
  • Yan Lu
  • Dezhen Song

Segmentation and mapping of planar building facades (PBFs) can increase a robot's ability of scene understanding and localization in urban environments which are often quasi-rectilinear and GPS-challenged. PBFs are basic components of the quasi-rectilinear environment. We propose a passive vision-based PBF segmentation and mapping algorithm by combining both appearance and geometric constraints. We propose a rectilinear index which allows us to segment out planar regions using appearance data. Then we combine geometric constraints such as reprojection errors, orientation constraints, and coplanarity constraints in an optimization process to improve the mapping of PBFs. We have implemented the algorithm and tested it in comparison with state-of-the-art. The results show that our method can reduce the angular error of scene structure by an average of 82. 82%.

ICRA Conference 2012 Conference Paper

A two-view based multilayer feature graph for robot navigation

  • Haifeng Li 0008
  • Dezhen Song
  • Yan Lu
  • Jingtai Liu

To facilitate scene understanding and robot navigation in a modern urban area, we design a multilayer feature graph (MFG) based on two views from an on-board camera. The nodes of an MFG are features such as scale invariant feature transformation (SIFT) feature points, line segments, lines, and planes while edges of the MFG represent different geometric relationships such as adjacency, parallelism, collinearity, and coplanarity. MFG also connects the features in two views and the corresponding 3D coordinate system. Building on SIFT feature points and line segments, MFG is constructed using feature fusion which incrementally, iteratively, and extensively verifies the aforementioned geometric relationships using random sample consensus (RANSAC) framework. Physical experiments show that MFG can be successfully constructed in urban area and the construction method is demonstrated to be very robust in identifying feature correspondence.

IROS Conference 2012 Conference Paper

Simplified markov random fields for efficient semantic labeling of 3D point clouds

  • Yan Lu
  • Christopher Rasmussen

In this paper, we focus on 3D point cloud classification by assigning semantic labels to each point in the scene. We propose to use simplified Markov networks to model the contextual relations between points, where the node potentials are calculated from point-wise classification results using off-the-shelf classifiers, such as Random Forest and Support Vector Machines, and the edge potentials are set by physical distance between points. Our experimental results show that this approach yields comparable if not better results with improved speed compared with state-of-the-art methods. We also propose a novel robust neighborhood filtering method to exclude outliers in the neighborhood of points, in order to reduce noise in local geometric statistics when extracting features and also to reduce number of false edges when constructing Markov networks. We show that applying robust neighborhood filtering improves the results when classifying point clouds with more object categories.

IROS Conference 2011 Conference Paper

Integrating stereo structure for omnidirectional trail following

  • Christopher Rasmussen
  • Yan Lu
  • Mehmet Kemal Kocamaz

We describe a system which follows “trails” for autonomous outdoor robot navigation. Through a combination of appearance and structural cues derived from stereo omnidirectional color cameras, the algorithm is able to detect and track rough paths despite widely varying tread material, border vegetation, and illumination conditions. The approaching trail region is modeled as a circular arc segment of constant width. Using likelihood formulations which measure color, brightness, and/or height contrast between a hypothetical region and flanking areas, the tracker performs a robust randomized search for the most likely trail region and robot pose relative to it with no a priori appearance model. The addition of the structural information, which is derived from a semi-global dense stereo algorithm with ground-plane fitting, is shown to improve trail segmentation accuracy and provide an additional layer of safety beyond solely ladar-based obstacle avoidance. Our system's ability to follow a variety of trails is demonstrated through live runs as well as analysis of offline runs on several long sequences with diverse appearance and structural characteristics using ground-truth segmentations.

IROS Conference 2010 Conference Paper

Trail following with omnidirectional vision

  • Christopher Rasmussen
  • Yan Lu
  • Mehmet Kemal Kocamaz

We describe a system which follows “trails” for autonomous outdoor robot navigation. Through a combination of visual cues provided by stereo omnidirectional color cameras and ladar-based structural information, the algorithm is able to detect and track rough paths despite widely varying tread material, border vegetation, and illumination conditions. The approaching trail region is simply modeled as a circular arc of constant width. Using an adaptive measure of color and brightness contrast between a hypothetical region and flanking areas, the tracker performs a robust randomized search for the most likely trail region and robot pose relative to it with no a priori appearance model. Stereo visual odometry improves tracker dynamics on uneven terrain and permits local obstacle map maintenance. A motion planner is also described which takes the trail shape estimate and local map to plan smooth trajectories around in-trail and near-trail hazards. Our system's performance is analyzed on several long sequences with diverse appearance and structural characteristics using ground-truth segmentations.

IROS Conference 2009 Conference Paper

Appearance contrast for fast, robust trail-following

  • Christopher Rasmussen
  • Yan Lu
  • Mehmet Kemal Kocamaz

We describe a framework for finding and tracking ¿trails¿ for autonomous outdoor robot navigation. Through a combination of visual cues and ladar-derived structural information, the algorithm is able to follow paths which pass through multiple zones of terrain smoothness, border vegetation, tread material, and illumination conditions. Our shape-based visual trail tracker assumes that the approaching trail region is approximately triangular under perspective. It generates region hypotheses from a learned distribution of expected trail width and curvature variation, and scores them using a robust measure of color and brightness contrast with flanking regions. The structural component analogously rewards hypotheses which correspond to empty or low-density regions in a groundstrike-filtered ladar obstacle map. Our system's performance is analyzed on several long sequences with diverse appearance and structural characteristics. Ground-truth segmentations are used to quantify performance where available, and several alternative algorithms are compared on the same data.