Arrow Research search

Author name cluster

Wanli Ouyang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

99 papers
2 author rows

Possible papers

99

AAAI Conference 2026 Conference Paper

ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

  • Pengze Li
  • Jiaqi Liu
  • Junchi Yu
  • Lihao Liu
  • Mingyu Ding
  • Wanli Ouyang
  • Shixiang Tang
  • Xi Chen

Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In an RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce’s fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints. We propose two logic-aware evaluation metrics: Entity Coverage (EC) for content completeness and Reasoning Edge Accuracy (REA) for step-by-step logical validity. Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain. These findings highlight a substantial gap between the abilities of current reasoning models and the rigor required for scientific argumentation.

AAAI Conference 2026 Conference Paper

Mitigating Low-Quality Reasoning in MLLMs: Self-Driven Refined Multimodal CoT with Selective Thinking and Step-wise Visual Enhancement

  • Chongjun Tu
  • Peng Ye
  • Dongzhan Zhou
  • Tao Chen
  • Wanli Ouyang

Current Multimodal Chain-of-Thought (MCoT) methods suffer from low-quality multimodal reasoning, characterized by overthinking on simple queries and inefficient utilization of visual information, resulting in vast inefficient and ineffective computations. In this paper, we discover that Multimodal Large Language Models (MLLMs) possess inherent capabilities to distinguish between simple and difficult queries and enhance task-related visual information, which remain underutilized by existing approaches. Based on this insight, we propose Self-Driven Refined Multimodal CoT (SDR-MCoT), a training-free framework that mitigates these issues through two self-driven modules. First, our selective thinking module employs entropy-based confidence estimation to determine whether queries require detailed reasoning, preventing overthinking on simple questions. Second, our step-wise visual enhancement module strengthens attention to relevant visual regions at each reasoning step without inserting additional tokens, achieving fine-grained visual grounding and enhancement with minimal overhead. Moreover, SDR-MCoT can be seamlessly integrated into various MLLMs, offering a practical solution for improving multimodal reasoning. Comprehensive experiments across eight benchmarks from diverse domains (multimodal reasoning, visual understanding, hallucination, and mathematical reasoning) demonstrate that SDR-MCoT consistently outperforms existing MCoT methods on four different base models with reduced overhead. For instance, on Qwen2-VL-7B, our method improves average accuracy by over 6% while reducing token consumption by approximately 60% compared to zero-shot CoT.

ICLR Conference 2025 Conference Paper

A CLIP-Powered Framework for Robust and Generalizable Data Selection

  • Suorong Yang
  • Peng Ye 0006
  • Wanli Ouyang
  • Dongzhan Zhou
  • Furao Shen

Large-scale datasets have been pivotal to the advancements of deep learning models in recent years, but training on such large datasets inevitably incurs substantial storage and computational overhead. Meanwhile, real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset, which aims to minimize the performance gap with reduced training costs. Existing works typically rely on single-modality information to assign importance scores for individual samples, which may lead to inaccurate assessments, especially when dealing with noisy or corrupted samples. To address this limitation, we propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection. Specifically, our framework consists of three key modules—dataset adaptation, sample scoring, and selection optimization—that together harness extensive pre-trained multimodal knowledge to comprehensively assess sample influence and optimize the selection results through multi-objective optimization. Extensive experiments demonstrate that our approach consistently outperforms existing state-of-the-art baselines on various benchmark datasets. Notably, our method effectively removes noisy or damaged samples from the dataset, enabling it to achieve even higher performance with less data. This indicates that it is not only a way to accelerate training but can also improve overall data quality. The implementation is available at https://github.com/Jackbrocp/clip-powered-data-selection.

NeurIPS Conference 2025 Conference Paper

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

  • Xiaoyu Zhan
  • Wenxuan Huang
  • Hao Sun
  • Xinyu Fu
  • Changfeng Ma
  • Shaosheng Cao
  • Bohan Jia
  • Shaohui Lin

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved 2D visual understanding, prompting interest in their application to complex 3D reasoning tasks. However, it remains unclear whether these models can effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning. Considering this issue, we introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs. Our approach employs a two-stage fine-tuning strategy: first, foundational knowledge is injected to the baseline MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in significant improvements across multiple tasks; second, generalization is enhanced through Reinforcement Learning using the Group Relative Policy Optimization (GRPO) algorithm on a broader set of questions. Additionally, we introduce a hybrid cold-start initialization method designed to simultaneously learn viewpoint representations and maintain coherent reasoning thinking. Experimental results show that our approach significantly activates the spatial reasoning ability of MLLM, improving performance on both in-domain and out-of-domain reasoning tasks. Our findings highlight the value of developing foundational spatial skills in MLLMs, supporting future progress in robotics, autonomous systems, and 3D scene understanding.

NeurIPS Conference 2025 Conference Paper

Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression

  • Xiaohui Wang
  • Peng Ye
  • Chenyu Huang
  • Shenghe Zheng
  • Bo Zhang
  • Lei Bai
  • Wanli Ouyang
  • Tao Chen

With the rise of the fine-tuned–pretrained paradigm, storing numerous fine-tuned models for multi-tasking creates significant storage overhead. Delta compression alleviates this by storing only the pretrained model and the highly compressed delta weights (the differences between fine-tuned and pretrained model weights). However, existing methods fail to maintain both high compression and performance, and often rely on data. To address these challenges, we propose UltraDelta, the first data-free delta compression pipeline that achieves both ultra-high compression and strong performance. UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions, using three key components: (1) Variance-Based Mixed Sparsity Allocation assigns sparsity based on variance, giving lower sparsity to high-variance layers to preserve inter-layer information. (2) Distribution-Aware Compression applies uniform quantization and then groups parameters by value, followed by group-wise pruning, to better preserve intra-layer distribution. (3) Trace-Norm-Guided Rescaling uses the trace norm of delta weights to estimate a global rescaling factor, improving model stability under higher compression. Extensive experiments across (a) large language models (fine-tuned on LLaMA-2 7B and 13B) with up to 50$\times$ compression, (b) general NLP models (RoBERTa-base, T5-base) with up to 224$\times$ compression, (c) vision models (ViT-B/32, ViT-L/14) with up to 132$\times$ compression, and (d) multi-modal models (BEiT-3) with 18$\times$ compression, demonstrate that UltraDelta consistently outperforms existing methods, especially under ultra-high compression. Code is available at https: //github. com/xiaohuiwang000/UltraDelta.

AAAI Conference 2025 Conference Paper

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

  • Junxian Li
  • Di Zhang
  • Xunzhi Wang
  • Zeying Hao
  • Jingdi Lei
  • Qian Tan
  • Cai Zhou
  • Wei Liu

Large Language Models (LLMs) have achieved remarkable success and have been applied across various scientific fields, including chemistry. However, many chemical tasks require the processing of visual information, which cannot be successfully handled by existing chemical LLMs. This brings a growing need for models capable of integrating multimodal information in the chemical domain. In this paper, we introduce ChemVLM, an open-source chemical multimodal large language model specifically designed for chemical applications. ChemVLM is trained on a carefully curated bilingual multimodal dataset that enhances its ability to understand both textual and visual chemical information, including molecular structures, reactions, and chemistry examination questions. We develop three datasets for comprehensive evaluation, tailored to Chemical Optical Character Recognition (OCR), Multimodal Chemical Reasoning (MMCR), and Multimodal Molecule Understanding tasks. We benchmark ChemVLM against a range of open-source and proprietary multimodal large language models on various tasks. Experimental results demonstrate that ChemVLM achieves competitive performance across all evaluated tasks.

NeurIPS Conference 2025 Conference Paper

CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming

  • Han Deng
  • Yuan Meng
  • Shixiang Tang
  • Wanli Ouyang
  • Xinzhu Ma

Competitive programming is widely used to evaluate the coding and reasoning abilities of large language models. However, the growing presence of duplicate or highly similar problems raises concerns not only about competition fairness, but also about the validity of competitive programming as a benchmark for model evaluation. We introduce a retrieval-oriented benchmark suite for competitive programming, covering four retrieval tasks—two code-centric (Text-to-Code, Code-to-Code) and two newly proposed problem-centric tasks (Problem-to-Duplicate, Simplified-to-Full)—built from a combination of automatically crawled problem–solution data and manually curated annotations. Our contribution includes both high-quality training data and temporally separated test sets for reliable evaluation. We develop two task-specialized retrievers based on this dataset: CPRetriever-Code, trained with a novel Group-InfoNCE loss for problem–code alignment, and CPRetriever-Prob, fine-tuned for problem-level similarity. Both models achieve strong results and are open-sourced for local use. Finally, we analyze LiveCodeBench and find that high-similarity problems inflate model pass rates and reduce differentiation, underscoring the need for similarity-aware evaluation in future benchmarks.

NeurIPS Conference 2025 Conference Paper

CSBrain: A Cross-scale Spatiotemporal Brain Foundation Model for EEG Decoding

  • Yuchen Zhou
  • Jiamin Wu
  • Zichen Ren
  • Zhouheng Yao
  • Weiheng Lu
  • Kunyu Peng
  • Qihao Zheng
  • Chunfeng Song

Understanding and decoding human brain activity from electroencephalography (EEG) signals is a fundamental problem in neuroscience and artificial intelligence, with applications ranging from cognition and emotion recognition to clinical diagnosis and brain–computer interfaces. While recent EEG foundation models have made progress in generalized brain decoding by leveraging unified architectures and large-scale pretraining, they inherit a scale-agnostic dense modeling paradigm from NLP and vision. This design overlooks an intrinsic property of neural activity—cross-scale spatiotemporal structure. Different EEG task patterns span a broad range of temporal and spatial scales, from brief neural activations to slow-varying rhythms, and from localized cortical activations to large-scale distributed interactions. Ignoring this diversity may lead to suboptimal representations and weakened generalization ability. To address these limitations, we propose CSBrain, a Cross-scale Spatiotemporal Brain foundation model for generalized EEG decoding. CSBrain introduces two key components: (i) Cross-scale Spatiotemporal Tokenization (CST), which aggregates multi-scale features within localized temporal windows and anatomical brain regions into compact scale-aware token representations; and (ii) Structured Sparse Attention (SSA), which models cross-window and cross-region dependencies for diverse decoding tasks, further enriching scale diversities while eliminating the spurious dependencies. CST and SSA are alternately stacked to progressively integrate cross-scale spatiotemporal dependencies. Extensive experiments across 11 representative EEG tasks and 16 datasets demonstrate that CSBrain consistently outperforms both task-specific models and strong foundation baselines. These results establish cross-scale modeling as a key inductive bias for generalized EEG decoding and highlight CSBrain as a robust backbone for future brain–AI research.

ICLR Conference 2025 Conference Paper

Depth Any Video with Scalable Synthetic Data

  • Honghui Yang
  • Di Huang
  • Wei Yin 0006
  • Chunhua Shen
  • Haifeng Liu 0001
  • Xiaofei He 0001
  • Binbin Lin
  • Wanli Ouyang

Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse virtual environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates—even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency. The code and model weights are open-sourced.

NeurIPS Conference 2025 Conference Paper

Flow-GRPO: Training Flow Matching Models via Online RL

  • Jie Liu
  • Gongye Liu
  • Jiajun Liang
  • Yangguang Li
  • Jiaheng Liu
  • Xintao Wang
  • Pengfei Wan
  • Di Zhang

We propose Flow-GRPO, the first method to integrate online policy gradient reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original number of inference steps, significantly improving sampling efficiency without sacrificing performance. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For compositional generation, RL-tuned SD3. 5-M generates nearly perfect object counts, spatial relations, and fine-grained attributes, increasing GenEval accuracy from $63\%$ to $95\%$. In visual text rendering, accuracy improves from $59\%$ to $92\%$, greatly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, very little reward hacking occurred, meaning rewards did not increase at the cost of appreciable image quality or diversity degradation.

NeurIPS Conference 2025 Conference Paper

FuncGenFoil: Airfoil Generation and Editing Model in Function Space

  • Jinouwen Zhang
  • Junjie Ren
  • Ma Qianhong
  • Jianyu Wu
  • Aobo Yang
  • Yan Lu
  • Lu Chen
  • Hairun Xie

Aircraft manufacturing is the jewel in the crown of industry, in which generating high-fidelity airfoil geometries with controllable and editable representations remains a fundamental challenge. Existing deep learning methods, which typically rely on predefined parametric representations (e. g. , Bézier curves) or discrete point sets, face an inherent trade-off between expressive power and resolution adaptability. To tackle this challenge, we introduce FuncGenFoil, a novel function-space generative model that directly reconstructs airfoil geometries as function curves. Our method inherits the advantages of arbitrary-resolution sampling and smoothness from parametric functions, as well as the strong expressiveness of discrete point-based representations. Empirical evaluations demonstrate that FuncGenFoil improves upon state-of-the-art methods in airfoil generation, achieving a relative 74. 4% reduction in label error and a 23. 2% increase in diversity on the AF-200K dataset. Our results highlight the advantages of function-space modeling for aerodynamic shape optimization, offering a powerful and flexible framework for high-fidelity airfoil design.

AAAI Conference 2025 Conference Paper

GigaGS: 3D Gaussian Based Planar Representation for Large-Scene Surface Reconstruction

  • Junyi Chen
  • Weicai Ye
  • Yifan Wang
  • Danpeng Chen
  • Di Huang
  • Wanli Ouyang
  • Guofeng Zhang
  • Yu Qiao

3D Gaussian Splatting (3DGS) has shown promising performance in novel view synthesis. Previous methods adapt it to obtaining surfaces of either individual 3D objects or within limited scenes. In this paper, we make the first attempt to tackle the challenging task of large-scale scene surface reconstruction. This task is particularly difficult due to the high GPU memory consumption, different levels of details for geometric representation, and noticeable inconsistencies in appearance. To this end, we propose GigaGS, the first work for high-quality surface reconstruction for large-scale scenes using 3DGS. GigaGS first applies a partitioning strategy based on the mutual visibility of spatial regions, which effectively grouping cameras for parallel processing. To enhance the quality of the surface, we also propose novel multi-view photometric and geometric consistency constraints based on Level-of-Detail representation. In doing so, our method can reconstruct detailed surface structures. Comprehensive experiments are conducted on various datasets. The consistent improvement demonstrates the superiority of GigaGS.

ICLR Conference 2025 Conference Paper

HiSplat: Hierarchical 3D Gaussian Splatting for Generalizable Sparse-View Reconstruction

  • Shengji Tang
  • Weicai Ye
  • Peng Ye 0006
  • Weihao Lin 0002
  • Yang Zhou
  • Tao Chen 0003
  • Wanli Ouyang

Reconstructing 3D scenes from multiple viewpoints is a fundamental task in stereo vision. Recently, advances in generalizable 3D Gaussian Splatting have enabled high-quality novel view synthesis for unseen scenes from sparse input views by feed-forward predicting per-pixel Gaussian parameters without extra optimization. However, existing methods typically generate single-scale 3D Gaussians, which lack representation of both large-scale structure and texture details, resulting in mislocation and artefacts. In this paper, we propose a novel framework, HiSplat, which introduces a hierarchical manner in generalizable 3D Gaussian Splatting to construct hierarchical 3D Gaussians via a coarse-to-fine strategy. Specifically, HiSplat generates large coarse-grained Gaussians to capture large-scale structures, followed by fine-grained Gaussians to enhance delicate texture details. To promote inter-scale interactions, we propose an Error Aware Module for Gaussian compensation and a Modulating Fusion Module for Gaussian repair. Our method achieves joint optimization of hierarchical representations, allowing for novel view synthesis using only two-view reference images. Comprehensive experiments on various datasets demonstrate that HiSplat significantly enhances reconstruction quality and cross-dataset generalization compared to prior single-scale methods. The corresponding ablation study and analysis of different-scale 3D Gaussians reveal the mechanism behind the effectiveness. Code is at https://github.com/Open3DVLab/HiSplat.

IJCAI Conference 2025 Conference Paper

Human-Centric Foundation Models: Perception, Generation and Agentic Modeling

  • Shixiang Tang
  • Yizhou Wang
  • Lu Chen
  • Yuan Wang
  • Sida Peng
  • Dan Xu
  • Wanli Ouyang

Human understanding and generation are critical for modeling digital humans and humanoid embodiments. Recently, Human-centric Foundation Models (HcFMs)—inspired by the success of generalist models such as large language and vision models—have emerged to unify diverse human-centric tasks into a single framework, surpassing traditional task-specific approaches. In this survey, we present a comprehensive overview of HcFMs by proposing a taxonomy that categorizes current approaches into four groups: (1) Human-centric Perception Foundation Models that capture fine-grained features for multi-modal 2D and 3D understanding; (2) Human-centric AIGC Foundation Models that generate high-fidelity, diverse human-related content; (3) Unified Perception and Generation Models that integrate these capabilities to enhance both human understanding and synthesis; and (4) Human-centric Agentic Foundation Models that extend beyond perception and generation to learn human-like intelligence and interactive behaviors for humanoid embodied tasks. We review state-of-the-art techniques, discuss emerging challenges and future research directions. This survey aims to serve as a roadmap for researchers and practitioners working towards more robust, versatile, and intelligent digital human and embodiments modeling. Website is https: //github. com/HumanCentricModels/Awesome-Human-Centric-Foundation-Models/

NeurIPS Conference 2025 Conference Paper

Improving Video Generation with Human Feedback

  • Jie Liu
  • Gongye Liu
  • Jiajun Liang
  • Ziyang Yuan
  • Xiaokun Liu
  • Mingwu Zheng
  • Xiele Wu
  • Qiulin Wang

Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs.

NeurIPS Conference 2025 Conference Paper

LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents

  • Rui Li
  • Zixuan Hu
  • Wenxi Qu
  • Jinouwen Zhang
  • Zhenfei Yin
  • Sha Zhang
  • Xuantuo Huang
  • Hanqing Wang

Scientific embodied agents play a crucial role in modern laboratories by automating complex experimental workflows. Compared to typical household environments, laboratory settings impose significantly higher demands on perception of physical-chemical transformations and long-horizon planning, making them an ideal testbed for advancing embodied intelligence. However, its development has been long hampered by the lack of suitable simulator and benchmarks. In this paper, we address this gap by introducing LabUtopia, a comprehensive simulation and benchmarking suite designed to facilitate the development of generalizable, reasoning-capable embodied agents in laboratory settings. Specifically, it integrates i) LabSim, a high-fidelity simulator supporting multi-physics and chemically meaningful interactions; ii) LabScene, a scalable procedural generator for diverse scientific scenes; and iii) LabBench, a hierarchical benchmark spanning five levels of complexity from atomic actions to long-horizon mobile manipulation. LabUtopia supports 30 distinct tasks and includes more than 200 scene and instrument assets, enabling large-scale training and principled evaluation in high-complexity environments. We demonstrate that LabUtopia offers a powerful platform for advancing the integration of perception, planning, and control in scientific-purpose agents and provides a rigorous testbed for exploring the practical capabilities and generalization limits of embodied intelligence in future research. Project web page: https: //rui-li023. github. io/labutopia-site/

ICRA Conference 2025 Conference Paper

Learning to Predict the Future from Monocular Vision for Efficient Human-Aware Navigation

  • Yushuang Huang
  • Hao Jiang
  • Zihan Liu
  • Wanli Ouyang
  • Zhaoqi Wang

Human-aware navigation (HAN) aims to build autonomous agents that robustly and naturally navigate in human-centered environments. Due to the complex and dynamic nature of this task, existing approaches typically rely on sophisticated pipelines that separately process perception and decision-making to solve it. In this work, we propose an Obstruction Distance Vector based End-to-End Model (ODVEEM), using monocular vision for navigation around humans. The Obstruction Distance Vector (ODV) is an intermediate representation in our model, leveraged to describe the Obstruction Distance to the first future collision in all possible directions in the horizontal field of view. As ODV cannot be calculated directly in the real world, we design a neural network for ODV estimation, formulating it as a classification problem with auxiliary proxy tasks, which play a key role in effectively predicting the implicit future motion of nearby humans. Taking advantage of ODV, ODVEEM supervised by human behavioral heuristics is employed to guide the agent to reach a goal efficiently and avoid potential collisions. Several challenging experiments show our method's substantial improvement over a number of baseline methods, attaining solid performance with zero-shot transfer to unseen simulated and real-world environments.

ICML Conference 2025 Conference Paper

MindAligner: Explicit Brain Functional Alignment for Cross-Subject Visual Decoding from Limited fMRI Data

  • Yuqin Dai
  • Zhouheng Yao
  • Chunfeng Song
  • Qihao Zheng
  • Weijian Mai
  • Kunyu Peng
  • Shuai Lu
  • Wanli Ouyang

Brain decoding aims to reconstruct visual perception of human subject from fMRI signals, which is crucial for understanding brain’s perception mechanisms. Existing methods are confined to the single-subject paradigm due to substantial brain variability, which leads to weak generalization across individuals and incurs high training costs, exacerbated by limited availability of fMRI data. To address these challenges, we propose MindAligner, an explicit functional alignment framework for cross-subject brain decoding from limited fMRI data. The proposed MindAligner enjoys several merits. First, we learn a Brain Transfer Matrix (BTM) that projects the brain signals of an arbitrary new subject to one of the known subjects, enabling seamless use of pre-trained decoding models. Second, to facilitate reliable BTM learning, a Brain Functional Alignment module is proposed to perform soft cross-subject brain alignment under different visual stimuli with a multi-level brain alignment loss, uncovering fine-grained functional correspondences with high interpretability. Experiments indicate that MindAligner not only outperforms existing methods in visual decoding under data-limited conditions, but also provides valuable neuroscience insights in cross-subject functional analysis. The code will be made publicly available.

NeurIPS Conference 2025 Conference Paper

MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search

  • Zonglin Yang
  • Wanhao Liu
  • Ben Gao
  • Yujie Liu
  • Wei Li
  • Tong Xie
  • Lidong Bing
  • Wanli Ouyang

Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the new task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM's internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent literature show that our method consistently outperforms strong baselines.

ICLR Conference 2025 Conference Paper

MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses

  • Zonglin Yang 0001
  • Wanhao Liu
  • Ben Gao
  • Tong Xie
  • Yuqiang Li
  • Wanli Ouyang
  • Soujanya Poria
  • Erik Cambria

Scientific discovery contributes largely to the prosperity of human society, and recent progress shows that LLMs could potentially catalyst the process. However, it is still unclear whether LLMs can discover novel and valid hypotheses in chemistry. In this work, we investigate this main research question: whether LLMs can automatically discover novel and valid chemistry research hypotheses, given only a research question? With extensive discussions with chemistry experts, we adopt the assumption that a majority of chemistry hypotheses can be resulted from a research background question and several inspirations. With this key insight, we break the main question into three smaller fundamental questions. In brief, they are: (1) given a background question, whether LLMs can retrieve good inspirations; (2) with background and inspirations, whether LLMs can lead to hypothesis; and (3) whether LLMs can identify good hypotheses to rank them higher. To investigate these questions, we construct a benchmark consisting of 51 chemistry papers published in Nature or a similar level in 2024 (all papers are only available online since 2024). Every paper is divided by chemistry PhD students into three components: background, inspirations, and hypothesis. The goal is to rediscover the hypothesis given only the background and a large chemistry literature corpus consisting the ground truth inspiration papers, with LLMs trained with data up to 2023. We also develop an LLM-based multi-agent framework that leverages the assumption, consisting of three stages reflecting the more smaller questions. The proposed method can rediscover many hypotheses with very high similarity with the ground truth ones, covering the main innovations.

AAAI Conference 2025 Conference Paper

Multi-Modal Latent Variables for Cross-Individual Primary Visual Cortex Modeling and Analysis

  • Yu Zhu
  • Bo Lei
  • Chunfeng Song
  • Wanli Ouyang
  • Shan Yu
  • Tiejun Huang

Elucidating the functional mechanisms of the primary visual cortex (V1) remains a fundamental challenge in systems neuroscience. Current computational models face two critical limitations, namely the challenge of cross-modal integration between partial neural recordings and complex visual stimuli, and the inherent variability in neural characteristics across individuals, including differences in neuron populations and firing patterns. To address these challenges, we present a multi-modal identifiable variational autoencoder (miVAE) that employs a two-level disentanglement strategy to map neural activity and visual stimuli into a unified latent space. This framework enables robust identification of cross-modal correlations through refined latent space modeling. We complement this with a novel score-based attribution analysis that traces latent variables back to their origins in the source data space. Evaluation on a large-scale mouse V1 dataset demonstrates that our method achieves state-of-the-art performance in cross-individual latent representation and alignment, without requiring subject-specific fine-tuning, and exhibits improved performance with increasing data size. Significantly, our attribution algorithm successfully identifies distinct neuronal subpopulations characterized by unique temporal patterns and stimulus discrimination properties, while simultaneously revealing stimulus regions that show specific sensitivity to edge features and luminance variations. This scalable framework offers promising applications not only for advancing V1 research but also for broader investigations in neuroscience.

NeurIPS Conference 2025 Conference Paper

Native-Resolution Image Synthesis

  • Zidong Wang
  • Lei Bai
  • Xiangyu Yue
  • Wanli Ouyang
  • Yiyuan Zhang

We introduce native-resolution image synthesis, a novel paradigm in generative modeling capable of synthesizing images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of standard fixed-resolution, square-image methods by inherently handling variable-length visual tokens—a core challenge for conventional techniques. To this end, we propose the Native-resolution diffusion Transformer (NiT), an architecture that explicitly models varying resolutions and aspect ratios within its denoising process. Unconstrained by fixed formats, NiT learns intrinsic visual distributions from images encompassing a wide range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen in advanced Large Language Models, NiT, pretrained solely on ImageNet, demonstrates excellent zero-shot generalization performance. It successfully generates high-fidelity images at previously unseen high resolutions (e. g. , 1024x1024, 1536x1536) and diverse aspect ratios (e. g. , 16: 9, 3: 1, 4: 3), as shown in Figure 1. These findings indicate the significant potential of native-resolution modeling as a bridge between visual generative modeling and advanced LLM methodologies.

ICML Conference 2025 Conference Paper

Neural Representational Consistency Emerges from Probabilistic Neural-Behavioral Representation Alignment

  • Yu Zhu
  • Chunfeng Song
  • Wanli Ouyang
  • Shan Yu
  • Tiejun Huang 0003

Individual brains exhibit striking structural and physiological heterogeneity, yet neural circuits can generate remarkably consistent functional properties across individuals, an apparent paradox in neuroscience. While recent studies have observed preserved neural representations in motor cortex through manual alignment across subjects, the zero-shot validation of such preservation and its generalization to more cortices remain unexplored. Here we present PNBA (Probabilistic Neural-Behavioral Representation Alignment), a new framework that leverages probabilistic modeling to address hierarchical variability across trials, sessions, and subjects, with generative constraints preventing representation degeneration. By establishing reliable cross-modal representational alignment, PNBA reveals robust preserved neural representations in monkey primary motor cortex (M1) and dorsal premotor cortex (PMd) through zero-shot validation. We further establish similar representational preservation in mouse primary visual cortex (V1), reflecting a general neural basis. These findings resolve the paradox of neural heterogeneity by establishing zero-shot preserved neural representations across cortices and species, enriching neural coding insights and enabling zero-shot behavior decoding.

ICLR Conference 2025 Conference Paper

PostCast: Generalizable Postprocessing for Precipitation Nowcasting via Unsupervised Blurriness Modeling

  • Junchao Gong
  • Siwei Tu
  • Weidong Yang 0001
  • Ben Fei
  • Kun Chen 0004
  • Wenlong Zhang
  • Xiaokang Yang 0001
  • Wanli Ouyang

Precipitation nowcasting plays a pivotal role in socioeconomic sectors, especially in severe convective weather warnings. Although notable progress has been achieved by approaches mining the spatiotemporal correlations with deep learning, these methods still suffer severe blurriness as the lead time increases, which hampers accurate predictions for extreme precipitation. To alleviate blurriness, researchers explore generative methods conditioned on blurry predictions. However, the pairs of blurry predictions and corresponding ground truth need to be given in advance, making the training pipeline cumbersome and limiting the generality of generative models within blurry modes that appear in training data. By rethinking the blurriness in precipitation nowcasting as a blur kernel acting on predictions, we propose an unsupervised postprocessing method to eliminate the blurriness without the requirement of training with the pairs of blurry predictions and corresponding ground truth. Specifically, we utilize blurry predictions to guide the generation process of a pre-trained unconditional denoising diffusion probabilistic model (DDPM) to obtain high-fidelity predictions with eliminated blurriness. A zero-shot blur kernel estimation mechanism and an auto-scale denoise guidance strategy are introduced to adapt the unconditional DDPM to any blurriness modes varying from datasets and lead times in precipitation nowcasting. Extensive experiments are conducted on 7 precipitation radar datasets, demonstrating the generality and superiority of our method.

NeurIPS Conference 2025 Conference Paper

PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs

  • Xinzhe Zheng
  • Hao Du
  • Fanding Xu
  • Jinzhe Li
  • Zhiyuan Liu
  • Wenkang Wang
  • Tao Chen
  • Wanli Ouyang

Deep learning-based computational methods have achieved promising results in predicting protein-protein interactions (PPIs). However, existing benchmarks predominantly focus on isolated pairwise evaluations, overlooking a model's capability to reconstruct biologically meaningful PPI networks, which is crucial for biology research. To address this gap, we introduce PRING, the first comprehensive benchmark that evaluates PRotein-protein INteraction prediction from a Graph-level perspective. PRING curates a high-quality, multi-species PPI network dataset comprising 21, 484 proteins and 186, 818 interactions, with well-designed strategies to address both data redundancy and leakage. Building on this golden-standard dataset, we establish two complementary evaluation paradigms: (1) topology-oriented tasks, which assess intra and cross-species PPI network construction, and (2) function-oriented tasks, including protein complex pathway prediction, GO module analysis, and essential protein justification. These evaluations not only reflect the model's capability to understand the network topology but also facilitate protein function annotation, biological module detection, and even disease mechanism analysis. Extensive experiments on four representative model categories, consisting of sequence similarity-based, naive sequence-based, protein language model-based, and structure-based approaches, demonstrate that current PPI models have potential limitations in recovering both structural and functional properties of PPI networks, highlighting the gap in supporting real-world biological applications. We believe PRING provides a reliable platform to guide the development of more effective PPI prediction models for the community. The dataset and source code of PRING are available at https: //github. com/SophieSarceau/PRING.

NeurIPS Conference 2025 Conference Paper

Retrieval is Not Enough: Enhancing RAG through Test-Time Critique and Optimization

  • Jiaqi Wei
  • Hao Zhou
  • Xiang Zhang
  • Di Zhang
  • Zijie Qiu
  • Noah Wei
  • Jinzhe Li
  • Wanli Ouyang

Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs). However, standard RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions. In this work, we reinterpret RAG as \textit{Retrieval-Augmented Reasoning} and identify a central but underexplored problem: \textit{Reasoning Misalignment}—the divergence between an LLM's internal reasoning trajectory and the evidential constraints provided by retrieval. To address this issue, we propose \textsc{AlignRAG}, a novel iterative framework grounded in \textit{Critique-Driven Alignment (CDA)}. We further introduce \textsc{AlignRAG-auto}, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations. At the heart of \textsc{AlignRAG} lies a \textit{contrastive critique synthesis} mechanism that generates retrieval-sensitive critiques while mitigating self-bias. This mechanism trains a dedicated retrieval-augmented \textit{Critic Language Model (CLM)} using labeled critiques that distinguish between evidence-aligned and misaligned reasoning. Empirical evaluations show that our approach significantly improves reasoning fidelity. Our 8B-parameter CLM improves performance over the Self-Refine baseline by \textbf{12. 1\%} on out-of-domain tasks and outperforms a standard 72B-parameter CLM by \textbf{2. 2\%}. Furthermore, \textsc{AlignRAG-auto} achieves this state-of-the-art performance while dynamically determining the optimal number of refinement steps, enhancing efficiency and usability. \textsc{AlignRAG} remains compatible with existing RAG architectures as a \textit{plug-and-play} module and demonstrates strong robustness under both informative and noisy retrieval scenarios. Overall, \textsc{AlignRAG} offers a principled solution for aligning model reasoning with retrieved evidence, substantially improving the factual reliability and robustness of RAG systems. Our source code is provided at \href{https: //github. com/upup-wei/RAG-ReasonAlignment}{link}.

IROS Conference 2025 Conference Paper

RH20T-P: A Primitive-Level Robotic Manipulation Dataset towards Composable Generalization Agents in Real-world Scenarios

  • Zeren Chen
  • Zhelun Shi
  • Xiaoya Lu
  • Lehan He
  • Sucheng Qian
  • Enshen Zhou
  • Zhenfei Yin
  • Wanli Ouyang

Achieving generalizability in solving out-of-distribution tasks is one of the ultimate goals of learning robotic manipulation. Recent progress of Vision-Language Models (VLMs) has shown that VLM-based task planners can alleviate the difficulty of solving novel tasks, by decomposing the compounded tasks as a plan of sequentially executing primitive-level skills that have been already mastered. It is also promising for robotic manipulation to adapt such composable generalization ability, in the form of composable generalization agents (CGAs). However, the community lacks of reliable design of primitive skills and a sufficient amount of primitive-level data annotations. Therefore, we propose RH20T-P, a primitive-level robotic manipulation dataset, which contains about 38k video clips covering 67 diverse manipulation tasks in real-world scenarios. Each clip is manually annotated according to a set of meticulously designed primitive skills that are common in robotic manipulation. Furthermore, we standardize a plan-execute CGA paradigm and implement an exemplar baseline called RA-P on our RH20T-P, whose positive performance on solving unseen tasks validates that the proposed dataset can offer composable generalization ability to robotic manipulation agents. Project homepage: https://sites.google.com/view/rh20t-primitive/main.

NeurIPS Conference 2025 Conference Paper

scMRDR: A scalable and flexible framework for unpaired single-cell multi-omics data integration

  • Jianle Sun
  • Chaoqi Liang
  • Ran Wei
  • Peng Zheng
  • Lei Bai
  • Wanli Ouyang
  • Hongliang Yan
  • Peng Ye

Advances in single-cell sequencing have enabled high-resolution profiling of diverse molecular modalities, while integrating unpaired multi-omics single-cell data remains challenging. Existing approaches either rely on pair information or prior correspondences, or require computing a global pairwise coupling matrix, limiting their scalability and flexibility. In this paper, we introduce a scalable and flexible generative framework called single-cell Multi-omics Regularized Disentangled Representations (scMRDR) for unpaired multi-omics integration. Specifically, we disentangle each cell’s latent representations into modality-shared and modality-specific components using a well-designed $\beta$-VAE architecture, which are augmented with isometric regularization to preserve intra-omics biological heterogeneity, adversarial objective to encourage cross-modal alignment, and masked reconstruction loss strategy to address the issue of missing features across modalities. Our method achieves excellent performance on benchmark datasets in terms of batch correction, modality alignment, and biological signal preservation. Crucially, it scales effectively to large-scale datasets and supports integration of more than two omics, offering a powerful and flexible solution for large-scale multi-omics data integration and downstream biological discovery.

NeurIPS Conference 2025 Conference Paper

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

  • Hongbo Liu
  • Jingwen He
  • Yi Jin
  • Dian Zheng
  • Yuhao Dong
  • Fan Zhang
  • Ziqi Huang
  • Yinan He

Recent Vision-Language Models (VLMs) have shown strong performance in general-purpose visual understanding and reasoning, but their ability to comprehend the visual grammar of movie shots remains underexplored and insufficiently evaluated. To bridge this gap, we present \textbf{ShotBench}, a dedicated benchmark for assessing VLMs’ understanding of cinematic language. ShotBench includes 3, 049 still images and 500 video clips drawn from more than 200 films, with each sample annotated by trained annotators or curated from professional cinematography resources, resulting in 3, 608 high-quality question-answer pairs. We conduct a comprehensive evaluation of over 20 state-of-the-art VLMs across eight core cinematography dimensions. Our analysis reveals clear limitations in fine-grained perception and cinematic reasoning of current VLMs. To improve VLMs capability in cinematography understanding, we construct a large-scale multimodal dataset, named ShotQA, which contains about 70k Question-Answer pairs derived from movie shots. Besides, we propose ShotVL and train this VLM model with a two-stage training strategy, integrating both supervised fine-tuning and Group Relative Policy Optimization (GRPO). Experimental results demonstrate that our model achieves substantial improvements, surpassing all existing strongest open-source and proprietary models evaluated on ShotBench, establishing a new state-of-the-art performance.

NeurIPS Conference 2025 Conference Paper

STAR: A Benchmark for Astronomical Star Fields Super-Resolution

  • WU KUO-CHENG
  • Guohang Zhuang
  • Jinyang Huang
  • Xiang Zhang
  • Wanli Ouyang
  • Yan Lu

Super-resolution (SR) advances astronomical imaging by enabling cost-effective high-resolution capture, crucial for detecting faraway celestial objects and precise structural analysis. However, existing datasets for astronomical SR (ASR) exhibit three critical limitations: flux inconsistency, object-crop setting, and insufficient data diversity, significantly impeding ASR development. We propose STAR, a large-scale astronomical SR dataset containing 54, 738 flux-consistent star field image pairs covering wide celestial regions. These pairs combine Hubble Space Telescope high-resolution observations with physically faithful low-resolution counterparts generated through a flux-preserving data generation pipeline, enabling systematic development of field-level ASR models. To further empower the ASR community, STAR provides a novel Flux Error (FE) to evaluate SR models in physical view. Leveraging this benchmark, we propose a Flux-Invariant Super Resolution (FISR) model that could accurately infer the flux-consistent high-resolution images from input photometry, suppressing several SR state-of-the-art methods by 24. 84% on a novel designed flux consistency metric, showing the priority of our method for astrophysics. Extensive experiments demonstrate the effectiveness of our proposed method and the value of our dataset. Code and models are available at https: //github. com/GuoCheng12/STAR.

NeurIPS Conference 2025 Conference Paper

SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning

  • Weijian Mai
  • Jiamin Wu
  • Yu Zhu
  • Zhouheng Yao
  • Dongzhan Zhou
  • Andrew Luo
  • Qihao Zheng
  • Wanli Ouyang

Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience. This visual-to-neural mapping is inherently a one-to-many relationship, as identical visual inputs reliably evoke variable hemodynamic responses across trials, contexts, and subjects. However, existing deterministic methods struggle to simultaneously model this biological variability while capturing the underlying functional consistency that encodes stimulus information. To address these limitations, we propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner. SynBrain introduces two key components: (i) BrainVAE models neural representations as continuous probability distributions via probabilistic learning while maintaining functional consistency through visual semantic constraints; (ii) A Semantic-to-Neural Mapper acts as a semantic transmission pathway, projecting visual semantics into the neural response manifold to facilitate high-fidelity fMRI synthesis. Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance. Furthermore, SynBrain adapts efficiently to new subjects with few-shot data and synthesizes high-quality fMRI signals that are effective in improving data-limited fMRI-to-image decoding performance. Beyond that, SynBrain reveals functional consistency across trials and subjects, with synthesized signals capturing interpretable patterns shaped by biological neural variability. Our code is available at https: //github. com/MichaelMaiii/SynBrain.

AAAI Conference 2025 Conference Paper

Towards Efficient and Intelligent Laser Weeding: Method and Dataset for Weed Stem Detection

  • Dingning Liu
  • Jinzhe Li
  • Haoyang Su
  • Bei Cui
  • Zhihui Wang
  • Qingbo Yuan
  • Wanli Ouyang
  • Nanqing Dong

Weed control is a critical challenge in modern agriculture, as weeds compete with crops for essential nutrient resources, significantly reducing crop yield and quality. Traditional weed control methods, including chemical and mechanical approaches, have real-life limitations such as associated environmental impact and efficiency. An emerging yet effective approach is laser weeding, which uses a laser beam as the stem cutter. Although there have been studies that use deep learning in weed recognition, its application in intelligent laser weeding still requires a comprehensive understanding. Thus, this study serves the first empirical study on weed recognition for laser weeding. To increase the efficiency of laser beam cut and avoid damaging the crops of interest, the laser beam shall be directly aimed at the weed root. Yet, weed stem detection remains an under-explored problem. We integrate the detection of crop and weed with the localization of weed stem into one end-to-end system. To train and validate the proposed system in a real-life scenario, we curate and construct a high-quality weed stem detection dataset with human annotations. The dataset consists of 7,161 high-resolution pictures collected in the field with annotations of 11,151 instances of weed. The dataset will be released upon acceptance. Experimental results show that, in contrast to seminal weed recognition systems, the proposed system can efficiently improve the weeding accuracy by 5.05% and reduce the energy cost by 32.3%.

NeurIPS Conference 2025 Conference Paper

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

  • Xiaoyu Yue
  • Zidong Wang
  • Yuqing Wang
  • Wenlong Zhang
  • Xihui Liu
  • Wanli Ouyang
  • Lei Bai
  • Luping Zhou

Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.

NeurIPS Conference 2025 Conference Paper

Venus-MAXWELL: Efficient Learning of Protein-Mutation Stability Landscapes using Protein Language Models

  • Yuanxi Yu
  • Fan Jiang
  • Xinzhu Ma
  • Liang Zhang
  • Bozitao Zhong
  • Wanli Ouyang
  • Guisheng Fan
  • Huiqun Yu

In-silico prediction of protein mutant stability, measured by the difference in Gibbs free energy change ($\Delta \Delta G$), is fundamental for protein engineering. Current sequence-to-label methods typically employ two-stage pipelines: (i) encoding mutant sequences using neural networks (e. g. , transformers), followed by (ii) the $\Delta \Delta G$ regression from the latent representations. Although these methods have demonstrated promising performance, their dependence on specialized neural network encoders significantly increases the complexity. Additionally, the requirement to compute latent representations individually for each mutant sequence negatively impacts computational efficiency and poses the risk of overfitting. This work proposes the Venus-MAXWELL framework, which reformulates mutation $\Delta \Delta G$ prediction as a sequence-to-landscape task. In Venus-MAXWELL, mutations of a protein and their corresponding $\Delta \Delta G$ values are organized into a landscape matrix, allowing our framework to learn the $\Delta \Delta G$ landscape of a protein with a single forward and backward pass during training. To this end, we curated a new $\Delta \Delta G$ benchmark dataset with strict controls on data leakage and redundancy to ensure robust evaluation. Leveraging the zero-shot scoring capability of protein language models (PLMs), Venus-MAXWELL effectively utilizes the evolutionary patterns learned by PLMs during pre-training. More importantly, Venus-MAXWELL is compatible with multiple protein language models. For example, when integrated with the ESM-IF, Venus-MAXWELL achieves higher accuracy than ThermoMPNN with 10$\times$ faster in inference speed (despite having 50$\times$ more parameters than ThermoMPNN). The training codes, model weights, and datasets are publicly available at https: //github. com/ai4protein/Venus-MAXWELL.

ICLR Conference 2025 Conference Paper

WeatherGFM: Learning a Weather Generalist Foundation Model via In-context Learning

  • Xiangyu Zhao
  • Zhiwang Zhou
  • Wenlong Zhang
  • Yihao Liu
  • Xiangyu Chen
  • Junchao Gong
  • Hao Chen 0045
  • Ben Fei

The Earth's weather system involves intricate weather data modalities and diverse weather understanding tasks, which hold significant value to human life. Existing data-driven models focus on single weather understanding tasks (e.g., weather forecasting). While these models have achieved promising results, they fail to tackle various complex tasks within a single and unified model. Moreover, the paradigm that relies on limited real observations for a single scenario hinders the model's performance upper bound. Inspired by the in-context learning paradigm from visual foundation models and large language models, in this paper, we introduce the first generalist weather generalist foundation model (WeatherGFM) to address weather understanding tasks in a unified manner. Specifically, we first unify the representation and definition for diverse weather understanding tasks. Subsequently, we design weather prompt formats to handle different weather data modalities, including single, multiple, and temporal modalities. Finally, we adopt a visual prompting question-answering paradigm for the training of unified weather understanding tasks. Extensive experiments indicate that our WeatherGFM can effectively handle up to 12 weather understanding tasks, including weather forecasting, super-resolution, weather image translation, and post-processing. Our method also showcases generalization ability on unseen tasks. The source code is available at https://github.com/xiangyu-mm/WeatherGFM.

ICLR Conference 2025 Conference Paper

Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction

  • Junyi Chen
  • Di Huang
  • Weicai Ye
  • Wanli Ouyang
  • Tong He 0001

Spatial intelligence is the ability of a machine to perceive, reason, and act in three dimensions within space and time. Recent advancements in large-scale auto-regressive models have demonstrated remarkable capabilities across various reasoning tasks. However, these models often struggle with fundamental aspects of spatial reasoning, particularly in answering questions like "Where am I?" and "What will I see?". While some attempts have been done, existing approaches typically treat them as separate tasks, failing to capture their interconnected nature. In this paper, we present **G**enerative **S**patial **T**ransformer (GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction. Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction. The proposed innovative camera tokenization method enables the model to learn the joint distribution of 2D projections and their corresponding spatial perspectives in an auto-regressive manner. This unified training paradigm demonstrates that joint optimization of pose estimation and novel view synthesis leads to improved performance in both tasks, for the first time, highlighting the inherent relationship between spatial awareness and visual prediction.

AAAI Conference 2024 Conference Paper

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

  • Yinmin Zhang
  • Jie Liu
  • Chuming Li
  • Yazhe Niu
  • Yaodong Yang
  • Yu Liu
  • Wanli Ouyang

Offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Built on offline RL algorithms, most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples. In this paper, from a novel perspective, we systematically study the challenges that remain in O2O RL and identify that the reason behind the slow improvement of the performance and the instability of online finetuning lies in the inaccurate Q-value estimation inherited from offline pretraining. Specifically, we demonstrate that the estimation bias and the inaccurate rank of Q-value cause a misleading signal for the policy update, making the standard offline RL algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based on this observation, we address the problem of Q-value estimation by two techniques: (1) perturbed value update and (2) increased frequency of Q-value updates. The first technique smooths out biased Q-value estimation with sharp peaks, preventing early-stage policy exploitation of sub-optimal actions. The second one alleviates the estimation bias inherited from offline pretraining by accelerating learning. Extensive experiments on the MuJoco and Adroit environments demonstrate that the proposed method, named SO2, significantly alleviates Q-value estimation issues, and consistently improves the performance against the state-of-the-art methods by up to 83.1%.

NeurIPS Conference 2024 Conference Paper

AFBench: A Large-scale Benchmark for Airfoil Design

  • Jian Liu
  • Jianyu Wu
  • Hairun Xie
  • Guoqing Zhang
  • Jing Wang
  • Wei Liu
  • Wanli Ouyang
  • Junjun Jiang

Data-driven generative models have emerged as promising approaches towards achieving efficient mechanical inverse design. However, due to prohibitively high cost in time and money, there is still lack of open-source and large-scale benchmarks in this field. It is mainly the case for airfoil inverse design, which requires to generate and edit diverse geometric-qualified and aerodynamic-qualified airfoils following the multimodal instructions, \emph{i. e. ,} dragging points and physical parameters. This paper presents the open-source endeavors in airfoil inverse design, \emph{AFBench}, including a large-scale dataset with 200 thousand airfoils and high-quality aerodynamic and geometric labels, two novel and practical airfoil inverse design tasks, \emph{i. e. ,} conditional generation on multimodal physical parameters, controllable editing, and comprehensive metrics to evaluate various existing airfoil inverse design methods. Our aim is to establish \emph{AFBench} as an ecosystem for training and evaluating airfoil inverse design methods, with a specific focus on data-driven controllable inverse design models by multimodal instructions capable of bridging the gap between ideas and execution, the academic research and industrial applications. We have provided baseline models, comprehensive experimental observations, and analysis to accelerate future research. Our baseline model is trained on an RTX 3090 GPU within 16 hours. The codebase, datasets and benchmarks will be available at \url{https: //hitcslj. github. io/afbench/}.

IJCAI Conference 2024 Conference Paper

An Embarrassingly Simple Approach to Enhance Transformer Performance in Genomic Selection for Crop Breeding

  • Renqi Chen
  • Wenwei Han
  • Haohao Zhang
  • Haoyang Su
  • Zhefan Wang
  • Xiaolei Liu
  • Hao Jiang
  • Wanli Ouyang

Genomic selection (GS), as a critical crop breeding strategy, plays a key role in enhancing food production and addressing the global hunger crisis. The predominant approaches in GS currently revolve around employing statistical methods for prediction. However, statistical methods often come with two main limitations: strong statistical priors and linear assumptions. A recent trend is to capture the non-linear relationships between markers by deep learning. However, as crop datasets are commonly long sequences with limited samples, the robustness of deep learning models, especially Transformers, remains a challenge. In this work, to unleash the unexplored potential of attention mechanism for the task of interest, we propose a simple yet effective Transformer-based framework that enables end-to-end training of the whole sequence. Via experiments on rice3k and wheat3k datasets, we show that, with simple tricks such as k-mer tokenization and random masking, Transformer can achieve overall superior performance against seminal methods on GS tasks of interest.

NeurIPS Conference 2024 Conference Paper

BEACON: Benchmark for Comprehensive RNA Tasks and Language Models

  • Yuchen Ren
  • Zhiyuan Chen
  • Lifeng Qiao
  • Hongtai Jing
  • Yuchen Cai
  • Sheng Xu
  • Peng Ye
  • Xinzhu Ma

RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms. Despite the emergence of numerous deep learning approaches for RNA, particularly universal RNA language models, there remains a significant lack of standardized benchmarks to assess the effectiveness of these methods. In this study, we introduce the first comprehensive RNA benchmark BEACON BE nchm A rk for CO mprehensive R N A Task and Language Models). First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications, enabling a comprehensive assessment of the performance of methods on various RNA understanding tasks. Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models. Third, we investigate the vital RNA language model components from the tokenizer and positional encoding aspects. Notably, our findings emphasize the superiority of single nucleotide tokenization and the effectiveness of Attention with Linear Biases (ALiBi) over traditional positional encoding methods. Based on these insights, a simple yet strong baseline called BEACON-B is proposed, which can achieve outstanding performance with limited data and computational resources. The datasets and source code of our benchmark are available at https: //github. com/terry-r123/RNABenchmark.

AAAI Conference 2024 Conference Paper

Boosting Residual Networks with Group Knowledge

  • Shengji Tang
  • Peng Ye
  • Baopu Li
  • Weihao Lin
  • Tao Chen
  • Tong He
  • Chong Yu
  • Wanli Ouyang

Recent research understands the residual networks from a new perspective of the implicit ensemble model. From this view, previous methods such as stochastic depth and stimulative training have further improved the performance of the residual network by sampling and training of its subnets. However, they both use the same supervision for all subnets of different capacities and neglect the valuable knowledge generated by subnets during training. In this manuscript, we mitigate the significant knowledge distillation gap caused by using the same kind of supervision and advocate leveraging the subnets to provide diverse knowledge. Based on this motivation, we propose a group knowledge based training framework for boosting the performance of residual networks. Specifically, we implicitly divide all subnets into hierarchical groups by subnet-in-subnet sampling, aggregate the knowledge of different subnets in each group during training, and exploit upper-level group knowledge to supervise lower-level subnet group. Meanwhile, we also develop a subnet sampling strategy that naturally samples larger subnets, which are found to be more helpful than smaller subnets in boosting performance for hierarchical groups. Compared with typical subnet training and other methods, our method achieves the best efficiency and performance trade-offs on multiple datasets and network structures. The code is at https://github.com/tsj-001/AAAI24-GKT.

ICML Conference 2024 Conference Paper

CasCast: Skillful High-resolution Precipitation Nowcasting via Cascaded Modelling

  • Junchao Gong
  • Lei Bai 0001
  • Peng Ye 0006
  • Wanghan Xu
  • Na Liu
  • Jianhua Dai
  • Xiaokang Yang 0001
  • Wanli Ouyang

Precipitation nowcasting based on radar data plays a crucial role in extreme weather prediction and has broad implications for disaster management. Despite progresses have been made based on deep learning, two key challenges of precipitation nowcasting are not well-solved: (i) the modeling of complex precipitation system evolutions with different scales, and (ii) accurate forecasts for extreme precipitation. In this work, we propose CasCast, a cascaded framework composed of a deterministic and a probabilistic part to decouple the predictions for mesoscale precipitation distributions and small-scale patterns. Then, we explore training the cascaded framework at the high resolution and conducting the probabilistic modeling in a low dimensional latent space with a frame-wise-guided diffusion transformer for enhancing the optimization of extreme events while reducing computational costs. Extensive experiments on three benchmark radar precipitation datasets show that CasCast achieves competitive performance. Especially, CasCast significantly surpasses the baseline (up to +91. 8%) for regional extreme-precipitation nowcasting.

AAAI Conference 2024 Conference Paper

ContraNovo: A Contrastive Learning Approach to Enhance De Novo Peptide Sequencing

  • Zhi Jin
  • Sheng Xu
  • Xiang Zhang
  • Tianze Ling
  • Nanqing Dong
  • Wanli Ouyang
  • Zhiqiang Gao
  • Cheng Chang

De novo peptide sequencing from mass spectrometry (MS) data is a critical task in proteomics research. Traditional de novo algorithms have encountered a bottleneck in accuracy due to the inherent complexity of proteomics data. While deep learning-based methods have shown progress, they reduce the problem to a translation task, potentially overlooking critical nuances between spectra and peptides. In our research, we present ContraNovo, a pioneering algorithm that leverages contrastive learning to extract the relationship between spectra and peptides and incorporates the mass information into peptide decoding, aiming to address these intricacies more efficiently. Through rigorous evaluations on two benchmark datasets, ContraNovo consistently outshines contemporary state-of-the-art solutions, underscoring its promising potential in enhancing de novo peptide sequencing.

NeurIPS Conference 2024 Conference Paper

Dense Connector for MLLMs

  • Huanjin Yao
  • Wenhao Wu
  • Taojiannan Yang
  • YuXin Song
  • Mengxi Zhang
  • Haocheng Feng
  • Yifan Sun
  • Zhiheng Li

Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1. 5 with only 25% of the visual tokens. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2. 7B→70B), and diverse architectures of MLLMs (e. g. , LLaVA-v1. 5, LLaVA-NeXT and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development. Code is available at https: //github. com/HJYao00/DenseConnector.

NeurIPS Conference 2024 Conference Paper

DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion

  • Weicai Ye
  • Chenhao Ji
  • Zheng Chen
  • Junyao Gao
  • Xiaoshui Huang
  • Song-Hai Zhang
  • Wanli Ouyang
  • Tong He

Diffusion-based methods have achieved remarkable achievements in 2D image or 3D object generation, however, the generation of 3D scenes and even $360^{\circ}$ images remains constrained, due to the limited number of scene datasets, the complexity of 3D scenes themselves, and the difficulty of generating consistent multi-view images. To address these issues, we first establish a large-scale panoramic video-text dataset containing millions of consecutive panoramic keyframes with corresponding panoramic depths, camera poses, and text descriptions. Then, we propose a novel text-driven panoramic generation framework, termed DiffPano, to achieve scalable, consistent, and diverse panoramic scene generation. Specifically, benefiting from the powerful generative capabilities of stable diffusion, we fine-tune a single-view text-to-panorama diffusion model with LoRA on the established panoramic video-text dataset. We further design a spherical epipolar-aware multi-view diffusion model to ensure the multi-view consistency of the generated panoramic images. Extensive experiments demonstrate that DiffPano can generate scalable, consistent, and diverse panoramic images with given unseen text descriptions and camera poses.

NeurIPS Conference 2024 Conference Paper

Empowering and Assessing the Utility of Large Language Models in Crop Science

  • Hang Zhang
  • Jiawei Sun
  • Renqi Chen
  • Wei Liu
  • Zhonghang Yuan
  • Xinzhe Zheng
  • Zhefan Wang
  • Zhiyuan Yang

Large language models (LLMs) have demonstrated remarkable efficacy across knowledge-intensive tasks. Nevertheless, their untapped potential in crop science presents an opportunity for advancement. To narrow this gap, we introduce CROP, which includes a novel instruction tuning dataset specifically designed to enhance LLMs’ professional capabilities in the crop science sector, along with a benchmark that serves as a comprehensive evaluation of LLMs’ understanding of the domain knowledge. The CROP dataset is curated through a task-oriented and LLM-human integrated pipeline, comprising 210, 038 single-turn and 1, 871 multi-turn dialogues related to crop science scenarios. The CROP benchmark includes 5, 045 multiple-choice questions covering three difficulty levels. Our experiments based on the CROP benchmark demonstrate notable enhancements in crop science-related tasks when LLMs are fine-tuned with the CROP dataset. To the best of our knowledge, CROP dataset is the first-ever instruction tuning dataset in the crop science domain. We anticipate that CROP will accelerate the adoption of LLMs in the domain of crop science, ultimately contributing to global food production.

NeurIPS Conference 2024 Conference Paper

EMR-Merging: Tuning-Free High-Performance Model Merging

  • Chenyu Huang
  • Peng Ye
  • Tao Chen
  • Tong He
  • Xiangyu Yue
  • Wanli Ouyang

The success of pretrain-finetune paradigm brings about the release of numerous model weights. In this case, merging models finetuned on different tasks to enable a single model with multi-task capabilities is gaining increasing attention for its practicability. Existing model merging methods usually suffer from (1) significant performance degradation or (2) requiring tuning by additional data or training. In this paper, we rethink and analyze the existing model merging paradigm. We discover that using a single model's weights can hardly simulate all the models' performance. To tackle this issue, we propose Elect, Mask & Rescale-Merging (EMR-Merging). We first (a) elect a unified model from all the model weights and then (b) generate extremely lightweight task-specific modulators, including masks and rescalers, to align the direction and magnitude between the unified model and each specific model, respectively. EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance. We find that EMR-Merging shows outstanding performance compared to existing merging methods under different classical and newly-established settings, including merging different numbers of vision models (up to 30), NLP models, PEFT models, and multi-modal models.

ICML Conference 2024 Conference Paper

FiT: Flexible Vision Transformer for Diffusion Model

  • Zeyu Lu
  • Zidong Wang 0004
  • Di Huang
  • Chengyue Wu
  • Xihui Liu
  • Wanli Ouyang
  • Lei Bai 0001

In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions. Repository available at https: //github. com/whlzy/FiT.

NeurIPS Conference 2024 Conference Paper

FNP: Fourier Neural Processes for Arbitrary-Resolution Data Assimilation

  • Kun Chen
  • Peng Ye
  • Hao Chen
  • Kang Chen
  • Tao Han
  • Wanli Ouyang
  • Tao Chen
  • Lei Bai

Data assimilation is a vital component in modern global medium-range weather forecasting systems to obtain the best estimation of the atmospheric state by combining the short-term forecast and observations. Recently, AI-based data assimilation approaches have attracted increasing attention for their significant advantages over traditional techniques in terms of computational consumption. However, existing AI-based data assimilation methods can only handle observations with a specific resolution, lacking the compatibility and generalization ability to assimilate observations with other resolutions. Considering that complex real-world observations often have different resolutions, we propose the Fourier Neural Processes (FNP) for arbitrary-resolution data assimilation in this paper. Leveraging the efficiency of the designed modules and flexible structure of neural processes, FNP achieves state-of-the-art results in assimilating observations with varying resolutions, and also exhibits increasing advantages over the counterparts as the resolution and the amount of observations increase. Moreover, our FNP trained on a fixed resolution can directly handle the assimilation of observations with out-of-distribution resolutions and the observational information reconstruction task without additional fine-tuning, demonstrating its excellent generalization ability across data resolutions as well as across tasks. Code is available at https: //github. com/OpenEarthLab/FNP.

AAAI Conference 2024 Conference Paper

Frozen CLIP Transformer Is an Efficient Point Cloud Encoder

  • Xiaoshui Huang
  • Zhou Huang
  • Sheng Li
  • Wentao Qu
  • Tong He
  • Yuenan Hou
  • Yifan Zuo
  • Wanli Ouyang

The pretrain-finetune paradigm has achieved great success in NLP and 2D image fields because of the high-quality representation ability and transferability of their pretrained models. However, pretraining such a strong model is difficult in the 3D point cloud field due to the limited amount of point cloud sequences. This paper introduces Efficient Point Cloud Learning (EPCL), an effective and efficient point cloud learner for directly training high-quality point cloud models with a frozen CLIP transformer. Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data. Specifically, the input point cloud is divided into a series of local patches, which are converted to token embeddings by the designed point cloud tokenizer. These token embeddings are concatenated with a task token and fed into the frozen CLIP transformer to learn point cloud representation. The intuition is that the proposed point cloud tokenizer projects the input point cloud into a unified token space that is similar to the 2D images. Comprehensive experiments on 3D detection, semantic segmentation, classification and few-shot learning demonstrate that the CLIP transformer can serve as an efficient point cloud encoder and our method achieves promising performance on both indoor and outdoor benchmarks. In particular, performance gains brought by our EPCL are 19.7 AP50 on ScanNet V2 detection, 4.4 mIoU on S3DIS segmentation and 1.2 mIoU on SemanticKITTI segmentation compared to contemporary pretrained models. Code is available at \url{https://github.com/XiaoshuiHuang/EPCL}.

NeurIPS Conference 2024 Conference Paper

Generalizing Weather Forecast to Fine-grained Temporal Scales via Physics-AI Hybrid Modeling

  • Wanghan Xu
  • Fenghua Ling
  • Wenlong Zhang
  • Tao Han
  • Hao Chen
  • Wanli Ouyang
  • Lei Bai

Data-driven artificial intelligence (AI) models have made significant advancements in weather forecasting, particularly in medium-range and nowcasting. However, most data-driven weather forecasting models are black-box systems that focus on learning data mapping rather than fine-grained physical evolution in the time dimension. Consequently, the limitations in the temporal scale of datasets prevent these models from forecasting at finer time scales. This paper proposes a physics-AI hybrid model (i. e. , WeatherGFT) which generalizes weather forecasts to finer-grained temporal scales beyond training dataset. Specifically, we employ a carefully designed PDE kernel to simulate physical evolution on a small time scale (e. g. , 300 seconds) and use a parallel neural networks with a learnable router for bias correction. Furthermore, we introduce a lead time-aware training framework to promote the generalization of the model at different lead times. The weight analysis of physics-AI modules indicates that physics conducts major evolution while AI performs corrections adaptively. Extensive experiments show that WeatherGFT trained on an hourly dataset, effectively generalizes forecasts across multiple time scales, including 30-minute, which is even smaller than the dataset's temporal resolution.

NeurIPS Conference 2024 Conference Paper

Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT

  • Le Zhuo
  • Ruoyi Du
  • Han Xiao
  • Yangguang Li
  • Dongyang Liu
  • Rongjie Huang
  • Wenze Liu
  • Xiangyang Zhu

Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers (Flag-DiT) that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduce a sigmoid time discretization schedule for diffusion sampling, which achieves high-quality generation in 5-10 steps combined with higher-order ODE solvers. Thanks to these improvements, Lumina-Next not only improves the basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities as well as multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-views, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights at https: //github. com/Alpha-VLLM/Lumina-T2X, we aim to advance the development of next-generation generative AI capable of universal modeling.

TMLR Journal 2024 Journal Article

MaskMA: Towards Zero-Shot Multi-Agent Decision Making with Mask-Based Collaborative Learning

  • Jie Liu
  • Yinmin Zhang
  • Chuming Li
  • Zhiyuan You
  • Zhanhui Zhou
  • Chao Yang
  • Yaodong Yang
  • Yu Liu

Building a single generalist agent with strong zero-shot capability has recently sparked significant advancements. However, extending this capability to multi-agent decision making scenarios presents challenges. Most current works struggle with zero-shot transfer, due to two challenges particular to the multi-agent settings: (a) a mismatch between centralized training and decentralized execution; and (b) difficulties in creating generalizable representations across diverse tasks due to varying agent numbers and action spaces. To overcome these challenges, we propose a Mask-Based collaborative learning framework for Multi-Agent decision making (MaskMA). Firstly, we randomly mask part of the units and collaboratively learn the policies of unmasked units to handle the mismatch. In addition, MaskMA integrates a generalizable action representation by dividing the action space into intrinsic actions solely related to the unit itself and interactive actions involving interactions with other units. This flexibility allows MaskMA to tackle tasks with varying agent numbers and thus different action spaces. Extensive experiments in SMAC reveal MaskMA, with a single model trained on 11 training maps, can achieve an impressive 77.8% average zero-shot win rate on 60 unseen test maps by decentralized execution, while also performing effectively on other types of downstream tasks (e.g., varied policies collaboration, ally malfunction, and ad hoc team play).

NeurIPS Conference 2024 Conference Paper

Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

  • Lifeng Qiao
  • Peng Ye
  • Yuchen Ren
  • Weiqiang Bai
  • Chaoqi Liang
  • Xinzhu Ma
  • Nanqing Dong
  • Wanli Ouyang

Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tokenization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent. MxDNA employs a sparse Mixture of Convolution Experts coupled with a deformable convolution to model the tokenization process, with the discontinuous, overlapping, and ambiguous nature of meaningful genomic segments explicitly considered. On Nucleotide Transformer Benchmarks and Genomic Benchmarks, MxDNA demonstrates superior performance to existing methods with less pretraining data and time, highlighting its effectiveness. Finally, we show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining. Our MxDNA aims to provide a new perspective on DNA tokenization, potentially offering broad applications in various domains and yielding profound insights. Code is available at https: //github. com/qiaoqiaoLF/MxDNA.

AAAI Conference 2024 Conference Paper

MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators

  • Yaqi Zhang
  • Di Huang
  • Bin Liu
  • Shixiang Tang
  • Yan Lu
  • Lu Chen
  • Lei Bai
  • Qi Chu

Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in large language models (LLMs). Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Visit our webpage at https://qiqiapink.github.io/MotionGPT/.

NeurIPS Conference 2024 Conference Paper

NeuRodin: A Two-stage Framework for High-Fidelity Neural Surface Reconstruction

  • Yifan Wang
  • Di Huang
  • Weicai Ye
  • Guofeng Zhang
  • Wanli Ouyang
  • Tong He

Signed Distance Function (SDF)-based volume rendering has demonstrated significant capabilities in surface reconstruction. Although promising, SDF-based methods often fail to capture detailed geometric structures, resulting in visible defects. By comparing SDF-based volume rendering to density-based volume rendering, we identify two main factors within the SDF-based approach that degrade surface quality: SDF-to-density representation and geometric regularization. These factors introduce challenges that hinder the optimization of the SDF field. To address these issues, we introduce NeuRodin, a novel two-stage neural surface reconstruction framework that not only achieves high-fidelity surface reconstruction but also retains the flexible optimization characteristics of density-based methods. NeuRodin incorporates innovative strategies that facilitate transformation of arbitrary topologies and reduce artifacts associated with density bias. Extensive evaluations on the Tanks and Temples and ScanNet++ datasets demonstrate the superiority of NeuRodin, showing strong reconstruction capabilities for both indoor and outdoor environments using solely posed RGB captures. Project website: https: //open3dvlab. github. io/NeuRodin/

ICLR Conference 2024 Conference Paper

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

  • Zeren Chen
  • Ziqin Wang
  • Zhen Wang 0003
  • Huayang Liu
  • Zhenfei Yin
  • Si Liu 0001
  • Lu Sheng
  • Wanli Ouyang

Recent studies have demonstrated Large Language Models (LLMs) can extend their zero-shot generalization capabilities to multimodal learning through instruction tuning. As more modalities and downstream tasks are introduced, negative conflicts and interference may have a worse impact on performance. While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs). Specifically, to mitigate the interference, we combine the concept of Mixture-of-Experts (MoE) with LoRA and design a multimodal LoRA-MoE decoder for task- and modality-specific learning. To the best of our knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to address this problem. The experimental results (about 20% improvement) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. Code and corresponding dataset will be available soon.

NeurIPS Conference 2024 Conference Paper

Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot Learning

  • Haoyi Zhu
  • Yating Wang
  • Di Huang
  • Weicai Ye
  • Wanli Ouyang
  • Tong He

In robot learning, the observation space is crucial due to the distinct characteristics of different modalities, which can potentially become a bottleneck alongside policy design. In this study, we explore the influence of various observation spaces on robot learning, focusing on three predominant modalities: RGB, RGB-D, and point cloud. We introduce OBSBench, a benchmark comprising two simulators and 125 tasks, along with standardized pipelines for various encoders and policy baselines. Extensive experiments on diverse contact-rich manipulation tasks reveal a notable trend: point cloud-based methods, even those with the simplest designs, frequently outperform their RGB and RGB-D counterparts. This trend persists in both scenarios: training from scratch and utilizing pre-training. Furthermore, our findings demonstrate that point cloud observations often yield better policy performance and significantly stronger generalization capabilities across various geometric and visual conditions. These outcomes suggest that the 3D point cloud is a valuable observation modality for intricate robotic tasks. We also suggest that incorporating both appearance and coordinate information can enhance the performance of point cloud methods. We hope our work provides valuable insights and guidance for designing more generalizable and robust robotic models.

NeurIPS Conference 2024 Conference Paper

ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention

  • Mingchen Li
  • Yang Tan
  • Xinzhu Ma
  • Bozitao Zhong
  • Huiqun Yu
  • Ziyi Zhou
  • Wanli Ouyang
  • Bingxin Zhou

Protein language models (PLMs) have shown remarkable capabilities in various protein function prediction tasks. However, while protein function is intricately tied to structure, most existing PLMs do not incorporate protein structure information. To address this issue, we introduce ProSST, a Transformer-based protein language model that seamlessly integrates both protein sequences and structures. ProSST incorporates a structure quantization module and a Transformer architecture with disentangled attention. The structure quantization module translates a 3D protein structure into a sequence of discrete tokens by first serializing the protein structure into residue-level local structures and then embeds them into dense vector space. These vectors are then quantized into discrete structure tokens by a pre-trained clustering model. These tokens serve as an effective protein structure representation. Furthermore, ProSST explicitly learns the relationship between protein residue token sequences and structure token sequences through the sequence-structure disentangled attention. We pre-train ProSST on millions of protein structures using a masked language model objective, enabling it to learn comprehensive contextual representations of proteins. To evaluate the proposed ProSST, we conduct extensive experiments on the zero-shot mutation effect prediction and several supervised downstream tasks, where ProSST achieves the state-of-the-art performance among all baselines. Our code and pre-trained models are publicly available.

AAAI Conference 2024 Conference Paper

Semi-supervised 3D Object Detection with PatchTeacher and PillarMix

  • Xiaopei Wu
  • Liang Peng
  • Liang Xie
  • Yuenan Hou
  • Binbin Lin
  • Xiaoshui Huang
  • Haifeng Liu
  • Deng Cai

Semi-supervised learning aims to leverage numerous unlabeled data to improve the model performance. Current semi-supervised 3D object detection methods typically use a teacher to generate pseudo labels for a student, and the quality of the pseudo labels is essential for the final performance. In this paper, we propose PatchTeacher, which focuses on partial scene 3D object detection to provide high-quality pseudo labels for the student. Specifically, we divide a complete scene into a series of patches and feed them to our PatchTeacher sequentially. PatchTeacher leverages the low memory consumption advantage of partial scene detection to process point clouds with a high-resolution voxelization, which can minimize the information loss of quantization and extract more fine-grained features. However, it is non-trivial to train a detector on fractions of the scene. Therefore, we introduce three key techniques, i.e., Patch Normalizer, Quadrant Align, and Fovea Selection, to improve the performance of PatchTeacher. Moreover, we devise PillarMix, a strong data augmentation strategy that mixes truncated pillars from different LiDAR scans to generate diverse training samples and thus help the model learn more general representation. Extensive experiments conducted on Waymo and ONCE datasets verify the effectiveness and superiority of our method and we achieve new state-of-the-art results, surpassing existing methods by a large margin. Codes are available at https://github.com/LittlePey/PTPM.

ICML Conference 2024 Conference Paper

Towards a Self-contained Data-driven Global Weather Forecasting Framework

  • Yi Xiao
  • Lei Bai 0001
  • Wei Xue 0003
  • Hao Chen 0045
  • Kun Chen 0004
  • Kang Chen
  • Tao Han 0002
  • Wanli Ouyang

Data-driven weather forecasting models are advancing rapidly, yet they rely on initial states (i. e. , analysis states) typically produced by traditional data assimilation algorithms. Four-dimensional variational assimilation (4DVar) is one of the most widely adopted data assimilation algorithms in numerical weather prediction centers; it is accurate but computationally expensive. In this paper, we aim to couple the AI forecasting model, FengWu, with 4DVar to build a self-contained data-driven global weather forecasting framework, FengWu-4DVar. To achieve this, we propose an AI-embedded 4DVar algorithm that includes three components: (1) a 4DVar objective function embedded with the FengWu forecasting model and its error representation to enhance efficiency and accuracy; (2) a spherical-harmonic-transform-based (SHT-based) approximation strategy for capturing the horizontal correlation of background error; and (3) an auto-differentiation (AD) scheme for determining the optimal analysis fields. Experimental results show that under the ERA5 simulated observational data with varying proportions and noise levels, FengWu-4DVar can generate accurate analysis fields; remarkably, it has achieved stable self-contained global weather forecasts for an entire year for the first time, demonstrating its potential for real-world applications. Additionally, our framework is approximately 100 times faster than the traditional 4DVar algorithm under similar experimental conditions, highlighting its significant computational efficiency.

AAAI Conference 2023 Conference Paper

ACE: Cooperative Multi-Agent Q-learning with Bidirectional Action-Dependency

  • Chuming Li
  • Jie Liu
  • Yinmin Zhang
  • Yuhong Wei
  • Yazhe Niu
  • Yaodong Yang
  • Yu Liu
  • Wanli Ouyang

Multi-agent reinforcement learning (MARL) suffers from the non-stationarity problem, which is the ever-changing targets at every iteration when multiple agents update their policies at the same time. Starting from first principle, in this paper, we manage to solve the non-stationarity problem by proposing bidirectional action-dependent Q-learning (ACE). Central to the development of ACE is the sequential decision making process wherein only one agent is allowed to take action at one time. Within this process, each agent maximizes its value function given the actions taken by the preceding agents at the inference stage. In the learning phase, each agent minimizes the TD error that is dependent on how the subsequent agents have reacted to their chosen action. Given the design of bidirectional dependency, ACE effectively turns a multi-agent MDP into a single-agent MDP. We implement the ACE framework by identifying the proper network representation to formulate the action dependency, so that the sequential decision process is computed implicitly in one forward pass. To validate ACE, we compare it with strong baselines on two MARL benchmarks. Empirical experiments demonstrate that ACE outperforms the state-of-the-art algorithms on Google Research Football and StarCraft Multi-Agent Challenge by a large margin. In particular, on SMAC tasks, ACE achieves 100% success rate on almost all the hard and super hard maps. We further study extensive research problems regarding ACE, including extension, generalization and practicability.

ICLR Conference 2023 Conference Paper

Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation

  • Martin Zong
  • Zengyu Qiu
  • Xinzhu Ma
  • Kunlin Yang
  • Chunya Liu
  • Jun Hou
  • Shuai Yi
  • Wanli Ouyang

Knowledge distillation (KD) has shown very promising capabilities in transferring learning representations from large models (teachers) to small models (students). However, as the capacity gap between students and teachers becomes larger, existing KD methods fail to achieve better results. Our work shows that the 'prior knowledge' is vital to KD, especially when applying large teachers. Particularly, we propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. This means that our method also takes the teacher's feature as `input', not just `target'. Besides, we dynamically adjust the ratio of the prior knowledge during the training phase according to the feature gap, thus guiding the student in an appropriate difficulty. To evaluate the proposed method, we conduct extensive experiments on two image classification benchmarks (i.e. CIFAR100 and ImageNet) and an object detection benchmark (\i.e. MS COCO). The results demonstrate the superiority of our method in performance under varying settings. Besides, our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers. More importantly, DPK provides a fast solution in teacher model selection for any given model. Our codes will be publicly available for reproducibility.

NeurIPS Conference 2023 Conference Paper

CluB: Cluster Meets BEV for LiDAR-Based 3D Object Detection

  • Yingjie Wang
  • Jiajun Deng
  • Yuenan Hou
  • Yao Li
  • Yu Zhang
  • Jianmin Ji
  • Wanli Ouyang
  • Yanyong Zhang

Currently, LiDAR-based 3D detectors are broadly categorized into two groups, namely, BEV-based detectors and cluster-based detectors. BEV-based detectors capture the contextual information from the Bird's Eye View (BEV) and fill their center voxels via feature diffusion with a stack of convolution layers, which, however, weakens the capability of presenting an object with the center point. On the other hand, cluster-based detectors exploit the voting mechanism and aggregate the foreground points into object-centric clusters for further prediction. In this paper, we explore how to effectively combine these two complementary representations into a unified framework. Specifically, we propose a new 3D object detection framework, referred to as CluB, which incorporates an auxiliary cluster-based branch into the BEV-based detector by enriching the object representation at both feature and query levels. Technically, CluB is comprised of two steps. First, we construct a cluster feature diffusion module to establish the association between cluster features and BEV features in a subtle and adaptive fashion. Based on that, an imitation loss is introduced to distill object-centric knowledge from the cluster features to the BEV features. Second, we design a cluster query generation module to leverage the voting centers directly from the cluster branch, thus enriching the diversity of object queries. Meanwhile, a direction loss is employed to encourage a more accurate voting center for each cluster. Extensive experiments are conducted on Waymo and nuScenes datasets, and our CluB achieves state-of-the-art performance on both benchmarks.

ICLR Conference 2023 Conference Paper

Cycle-consistent Masked AutoEncoder for Unsupervised Domain Generalization

  • Haiyang Yang
  • Xiaotong Li
  • Shixiang Tang
  • Feng Zhu 0006
  • Yizhou Wang 0007
  • Meilin Chen
  • Lei Bai 0001
  • Rui Zhao 0001

Self-supervised learning methods undergo undesirable performance drops when there exists a significant domain gap between training and testing scenarios. Therefore, unsupervised domain generalization (UDG) is proposed to tackle the problem, which requires the model to be trained on several different domains without supervision and generalize well on unseen test domains. Existing methods either rely on a cross-domain and semantically consistent image pair in contrastive methods or the reconstruction pair in generative methods, while the precious image pairs are not available without semantic labels. In this paper, we propose a cycle cross-domain reconstruction task for unsupervised domain generalization in the absence of paired images. The cycle cross-domain reconstruction task converts a masked image from one domain to another domain and then reconstructs the original image from the converted images. To preserve the divergent domain knowledge of decoders in the cycle reconstruction task, we propose a novel domain-contrastive loss to regularize the domain information in reconstructed images encoded with the desirable domain style. Qualitative results on extensive datasets illustrate our method improves the state-of-the-art unsupervised domain generalization methods by average $\textbf{+5.59\%}, \textbf{+4.52\%}, \textbf{+4.22\%}, \textbf{+7.02\%}$ on $1\%, 5\%, 10\%, 100\%$ PACS, and $\textbf{+5.08\%}, \textbf{+6.49\%}, \textbf{+1.79\%}, \textbf{+0.53\%}$ on $1\%, 5\%, 10\%, 100\%$ DomainNet, respectively.

AAAI Conference 2023 Conference Paper

Fine-Grained Retrieval Prompt Tuning

  • Shijie Wang
  • Jianlong Chang
  • Zhihui Wang
  • Haojie Li
  • Wanli Ouyang
  • Qi Tian

Fine-grained object retrieval aims to learn discriminative representation to retrieve visually similar objects. However, existing top-performing works usually impose pairwise similarities on the semantic embedding spaces or design a localization sub-network to continually fine-tune the entire model in limited data scenarios, thus resulting in convergence to suboptimal solutions. In this paper, we develop Fine-grained Retrieval Prompt Tuning (FRPT), which steers a frozen pre-trained model to perform the fine-grained retrieval task from the perspectives of sample prompting and feature adaptation. Specifically, FRPT only needs to learn fewer parameters in the prompt and adaptation instead of fine-tuning the entire model, thus solving the issue of convergence to suboptimal solutions caused by fine-tuning the entire model. Technically, a discriminative perturbation prompt (DPP) is introduced and deemed as a sample prompting process, which amplifies and even exaggerates some discriminative elements contributing to category prediction via a content-aware inhomogeneous sampling operation. In this way, DPP can make the fine-grained retrieval task aided by the perturbation prompts close to the solved task during the original pre-training. Thereby, it preserves the generalization and discrimination of representation extracted from input samples. Besides, a category-specific awareness head is proposed and regarded as feature adaptation, which removes the species discrepancies in features extracted by the pre-trained model using category-guided instance normalization. And thus, it makes the optimized features only include the discrepancies among subcategories. Extensive experiments demonstrate that our FRPT with fewer learnable parameters achieves the state-of-the-art performance on three widely-used fine-grained datasets.

NeurIPS Conference 2023 Conference Paper

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

  • Zhenfei Yin
  • Jiong Wang
  • Jianjian Cao
  • Zhelun Shi
  • Dingning Liu
  • Mukai Li
  • Xiaoshui Huang
  • Zhiyong Wang

Large language models have emerged as a promising approach towards achieving general-purpose AI agents. The thriving open-source LLM community has greatly accelerated the development of agents that support human-machine dialogue interaction through natural language processing. However, human interaction with the world extends beyond only text as a modality, and other modalities such as vision are also crucial. Recent works on multi-modal large language models, such as GPT-4V and Bard, have demonstrated their effectiveness in handling visual modalities. However, the transparency of these works is limited and insufficient to support academic research. To the best of our knowledge, we present one of the very first open-source endeavors in the field, LAMM, encompassing a Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs, with a specific focus on facilitating AI agents capable of bridging the gap between ideas and execution, thereby enabling seamless human-AI interaction. Our main contribution is three-fold: 1) We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision. Extensive experiments validate the effectiveness of our dataset and benchmark. 2) We outline the detailed methodology of constructing multi-modal instruction tuning datasets and benchmarks for MLLMs, enabling rapid scaling and extension of MLLM research to diverse domains, tasks, and modalities. 3) We provide a primary but potential MLLM training framework optimized for modality extension. We also provide baseline models, comprehensive experimental observations, and analysis to accelerate future research. Our baseline model is trained within 24 A100 GPU hours, framework supports training with V100 and RTX3090 is available thanks to the open-source society. Codes and data are now available at https: //openlamm. github. io.

NeurIPS Conference 2023 Conference Paper

Learning to Parameterize Visual Attributes for Open-set Fine-grained Retrieval

  • Shijie Wang
  • Jianlong Chang
  • Haojie Li
  • Zhihui Wang
  • Wanli Ouyang
  • Qi Tian

Open-set fine-grained retrieval is an emerging challenging task that allows to retrieve unknown categories beyond the training set. The best solution for handling unknown categories is to represent them using a set of visual attributes learnt from known categories, as widely used in zero-shot learning. Though important, attribute modeling usually requires significant manual annotations and thus is labor-intensive. Therefore, it is worth to investigate how to transform retrieval models trained by image-level supervision from category semantic extraction to attribute modeling. To this end, we propose a novel Visual Attribute Parameterization Network (VAPNet) to learn visual attributes from known categories and parameterize them into the retrieval model, without the involvement of any attribute annotations. In this way, VAPNet could utilize its parameters to parse a set of visual attributes from unknown categories and precisely represent them. Technically, VAPNet explicitly attains some semantics with rich details via making use of local image patches and distills the visual attributes from these discovered semantics. Additionally, it integrates the online refinement of these visual attributes into the training process to iteratively enhance their quality. Simultaneously, VAPNet treats these attributes as supervisory signals to tune the retrieval models, thereby achieving attribute parameterization. Extensive experiments on open-set fine-grained retrieval datasets validate the superior performance of our VAPNet over existing solutions.

AAAI Conference 2023 Conference Paper

Multi-Scale Control Signal-Aware Transformer for Motion Synthesis without Phase

  • Lintao Wang
  • Kun Hu
  • Lei Bai
  • Yu Ding
  • Wanli Ouyang
  • Zhiyong Wang

Synthesizing controllable motion for a character using deep learning has been a promising approach due to its potential to learn a compact model without laborious feature engineering. To produce dynamic motion from weak control signals such as desired paths, existing methods often require auxiliary information such as phases for alleviating motion ambiguity, which limits their generalisation capability. As past poses often contain useful auxiliary hints, in this paper, we propose a task-agnostic deep learning method, namely Multi-scale Control Signal-aware Transformer (MCS-T), with an attention based encoder-decoder architecture to discover the auxiliary information implicitly for synthesizing controllable motion without explicitly requiring auxiliary information such as phase. Specifically, an encoder is devised to adaptively formulate the motion patterns of a character's past poses with multi-scale skeletons, and a decoder driven by control signals to further synthesize and predict the character's state by paying context-specialised attention to the encoded past motion patterns. As a result, it helps alleviate the issues of low responsiveness and slow transition which often happen in conventional methods not using auxiliary information. Both qualitative and quantitative experimental results on an existing biped locomotion dataset, which involves diverse types of motion transitions, demonstrate the effectiveness of our method. In particular, MCS-T is able to successfully generate motions comparable to those generated by the methods using auxiliary information.

AAAI Conference 2023 Conference Paper

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition

  • Wenhao Wu
  • Zhun Sun
  • Wanli Ouyang

Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. Along with the growth of computational capacity, we now have open-source vision-language pre-trained models in large scales of the model architecture and amount of data. In this study, we focus on transferring knowledge for video classification tasks. Conventional methods randomly initialize the linear classifier head for vision classification, but they leave the usage of the text encoder for downstream visual recognition tasks undiscovered. In this paper, we revise the role of the linear classifier and replace the classifier with the different knowledge from pre-trained model. We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning. The empirical study shows that our method improves both the performance and the training speed of video classification, with a negligible change in the model. Our simple yet effective tuning paradigm achieves state-of-the-art performance and efficient training on various video recognition scenarios, i.e., zero-shot, few-shot, general recognition. In particular, our paradigm achieves the state-of-the-art accuracy of 87.8% on Kinetics-400, and also surpasses previous methods by 20~50% absolute top-1 accuracy under zero-shot, few-shot settings on five video datasets. Code and models are available at https://github.com/whwu95/Text4Vis.

NeurIPS Conference 2023 Conference Paper

Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images

  • Zeyu Lu
  • Di Huang
  • Lei Bai
  • Jingjing Qu
  • Chengyue Wu
  • Xihui Liu
  • Wanli Ouyang

Photos serve as a way for humans to record what they experience in their daily lives, and they are often regarded as trustworthy sources of information. However, there is a growing concern that the advancement of artificial intelligence (AI) technology may produce fake photos, which can create confusion and diminish trust in photographs. This study aims to comprehensively evaluate agents for distinguishing state-of-the-art AI-generated visual content. Our study benchmarks both human capability and cutting-edge fake image detection AI algorithms, using a newly collected large-scale fake image dataset Fake2M. In our human perception evaluation, titled HPBench, we discovered that humans struggle significantly to distinguish real photos from AI-generated ones, with a misclassification rate of 38. 7\%. Along with this, we conduct the model capability of AI-Generated images detection evaluation MPBench and the top-performing model from MPBench achieves a 13\% failure rate under the same setting used in the human evaluation. We hope that our study can raise awareness of the potential risks of AI-generated images and facilitate further research to prevent the spread of false information. More information can refer to https: //github. com/Inf-imagine/Sentry.

PRL Workshop 2023 Workshop Paper

Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

  • Chuming Li
  • Ruonan Jia
  • Jiawei Yao
  • Jie Liu
  • Yinmin Zhang
  • Yazhe Niu
  • Yaodong Yang
  • Yu Liu

Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks due to its high sample efficiency. To save the computation cost of conducting planning online, recent practices tend to distill optimized action sequences into an RL policy during the training phase. Although the distillation can incorporate both the foresight of planning and the exploration ability of RL policies, the theoretical understanding of these methods is yet unclear. In this paper, we extend the policy improvement step of Soft Actor-Critic (SAC) by developing an approach to distill from model-based planning to the policy. We then demonstrate that such an approach of policy improvement has a theoretical guarantee of monotonic improvement and convergence to the maximum value defined in SAC. We discuss effective design choices and implement our theory as a practical algorithm---$\textit{\textbf{M}odel-based \textbf{P}lanning \textbf{D}istilled to \textbf{P}olicy (MPDP)}$---that updates the policy jointly over multiple future time steps. Extensive experiments show that MPDP achieves better sample efficiency and asymptotic performance than both model-free and model-based planning algorithms on six continuous control benchmark tasks in MuJoCo.

ECAI Conference 2023 Conference Paper

Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

  • Chuming Li
  • Ruonan Jia
  • Jie Liu 0047
  • Yinmin Zhang
  • Yazhe Niu
  • Yaodong Yang 0001
  • Yu Liu 0015
  • Wanli Ouyang

Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks due to its high sample efficiency. To save the computation cost of conducting planning online, recent practices tend to distill optimized action sequences into an RL policy during the training phase. Although the distillation can incorporate both the foresight of planning and the exploration ability of RL policies, the theoretical understanding of these methods is yet unclear. In this paper, we extend the policy improvement of Soft Actor-Critic (SAC) by developing an approach to distill from model-based planning to the policy. We then demonstrate that such an approach of policy improvement has a theoretical guarantee of monotonic improvement and convergence to the maximum value defined in SAC. We discuss effective design choices and implement our theory as a practical algorithm—Model-based Planning Distilled to Policy (MPDP)—that updates the policy jointly over multiple future time steps. Extensive experiments show that MPDP achieves better sample efficiency and asymptotic performance than both model-free and model-based planning algorithms on six continuous control benchmark tasks in MuJoCo.

AAAI Conference 2022 Conference Paper

Category-Specific Nuance Exploration Network for Fine-Grained Object Retrieval

  • Shijie Wang
  • Zhihui Wang
  • Haojie Li
  • Wanli Ouyang

Employing additional prior knowledge to model local features as a final fine-grained object representation has become a trend for fine-grained object retrieval (FGOR). A potential limitation of these methods is that they only focus on common parts across the dataset (e. g. , head, body, or even leg) by introducing additional prior knowledge, but the retrieval of a fine-grained object may rely on category-specific nuances that contribute to category prediction. To handle this limitation, we propose an end-to-end Category-specific Nuance Exploration Network (CNENet) that elaborately discovers category-specific nuances that contribute to category prediction, and semantically aligns these nuances grouped by subcategory without any additional prior knowledge, to directly emphasize the discrepancy among subcategories. Specifically, we design a Nuance Modelling Module that adaptively predicts a group of category-specific response (CARE) maps via implicitly digging into category-specific nuances, specifying the locations and scales for category-specific nuances. Upon this, two nuance regularizations are proposed: 1) semantic discrete loss that forces each CARE map to attend to different spatial regions to capture diverse nuances; 2) semantic alignment loss that constructs a consistent semantic correspondence for each CARE map of the same order with the same subcategory via guaranteeing each instance and its transformed counterpart to be spatially aligned. Moreover, we propose a Nuance Expansion Module, which exploits context appearance information of discovered nuances and refines the prediction of current nuance by its similar neighbors, leading to further improvement on nuance consistency and completeness. Extensive experiments validate that our CNENet consistently yields the best performance under the same settings against most competitive approaches on CUB Birds, Stanford Cars, and FGVC Aircraft datasets.

ICLR Conference 2022 Conference Paper

MonoDistill: Learning Spatial Features for Monocular 3D Object Detection

  • Zhiyu Chong
  • Xinzhu Ma
  • Hong Zhang 0011
  • Yuxin Yue
  • Haojie Li
  • Zhihui Wang 0001
  • Wanli Ouyang

3D object detection is a fundamental and challenging task for 3D scene understanding, and the monocular-based methods can serve as an economical alternative to the stereo-based or LiDAR-based methods. However, accurately locating objects in the 3D space from a single image is extremely difficult due to the lack of spatial cues. To mitigate this issue, we propose a simple and effective scheme to introduce the spatial information from LiDAR signals to the monocular 3D detectors, without introducing any extra cost in the inference phase. In particular, we first project the LiDAR signals into the image plane and align them with the RGB images. After that, we use the resulting data to train a 3D detector (LiDAR Net) using the same architecture as the baseline model. Finally, this LiDAR Net can serve as the teacher to transfer the learned knowledge to the baseline model. Experimental results show that the proposed method can significantly boost the performance of the baseline model and ranks the $1^{st}$ place among all monocular-based methods on the KITTI benchmark. Besides, extensive ablation studies are conducted, which further prove the effectiveness of each part of our designs and illustrate what the baseline model has learned from the LiDAR Net.

ICLR Conference 2022 Conference Paper

Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint Localization

  • Can Wang
  • Sheng Jin 0007
  • Yingda Guan
  • Wentao Liu 0002
  • Chen Qian 0006
  • Ping Luo 0002
  • Wanli Ouyang

Localizing keypoints of an object is a basic visual problem. However, supervised learning of a keypoint localization network often requires a large amount of data, which is expensive and time-consuming to obtain. To remedy this, there is an ever-growing interest in semi-supervised learning (SSL), which leverages a small set of labeled data along with a large set of unlabeled data. Among these SSL approaches, pseudo-labeling (PL) is one of the most popular. PL approaches apply pseudo-labels to unlabeled data, and then train the model with a combination of the labeled and pseudo-labeled data iteratively. The key to the success of PL is the selection of high-quality pseudo-labeled samples. Previous works mostly select training samples by manually setting a single confidence threshold. We propose to automatically select reliable pseudo-labeled samples with a series of dynamic thresholds, which constitutes a learning curriculum.Extensive experiments on five keypoint localization benchmark datasets demonstrate that the proposed approach significantly outperforms the previous state-of-the-art SSL approaches.

IJCAI Conference 2022 Conference Paper

RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training

  • Luya Wang
  • Feng Liang
  • Yangguang Li
  • Honggang Zhang
  • Wanli Ouyang
  • Jing Shao

Recently, self-supervised vision transformers have attracted unprecedented attention for their impressive representation learning ability. However, the dominant method, contrastive learning, mainly relies on an instance discrimination pretext task, which learns a global understanding of the image. This paper incorporates local feature learning into self-supervised vision transformers via Reconstructive Pre-training (RePre). Our RePre extends contrastive frameworks by adding a branch for reconstructing raw image pixels in parallel with the existing contrastive objective. RePre equips with a lightweight convolution-based decoder that fuses the multi-hierarchy features from the transformer encoder. The multi-hierarchy features provide rich supervisions from low to high semantic information, crucial for our RePre. Our RePre brings decent improvements on various contrastive frameworks with different vision transformer architectures. Transfer performance in downstream tasks outperforms supervised pre-training and state-of-the-art (SOTA) self-supervised counterparts.

AAAI Conference 2022 Conference Paper

SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation

  • Dongzhan Zhou
  • Xinchi Zhou
  • Di Hu
  • Hang Zhou
  • Lei Bai
  • Ziwei Liu
  • Wanli Ouyang

Multiple modalities can provide rich semantic information; and exploiting such information will normally lead to better performance compared with the single-modality counterpart. However, it is not easy to devise an effective cross-modal fusion structure due to the variations of feature dimensions and semantics, especially when the inputs even come from different sensors, as in the field of audio-visual learning. In this work, we propose SepFusion, a novel framework that can smoothly produce optimal fusion structures for visualsound separation. The framework is composed of two components, namely the model generator and the evaluator. To construct the generator, we devise a lightweight architecture space that can adapt to different input modalities. In this way, we can easily obtain audio-visual fusion structures according to our demands. For the evaluator, we adopt the idea of neural architecture search to select superior networks effectively. This automatic process can significantly save human efforts while achieving competitive performances. Moreover, since our SepFusion provides a series of strong models, we can utilize the model family for broader applications, such as further promoting performance via model assembly, or providing suitable architectures for the separation of certain instrument classes. These potential applications further enhance the competitiveness of our approach.

NeurIPS Conference 2022 Conference Paper

Stimulative Training of Residual Networks: A Social Psychology Perspective of Loafing

  • Peng Ye
  • Shengji Tang
  • Baopu Li
  • Tao Chen
  • Wanli Ouyang

Residual networks have shown great success and become indispensable in today’s deep models. In this work, we aim to re-investigate the training process of residual networks from a novel social psychology perspective of loafing, and further propose a new training strategy to strengthen the performance of residual networks. As residual networks can be viewed as ensembles of relatively shallow networks (i. e. , unraveled view) in prior works, we also start from such view and consider that the final performance of a residual network is co-determined by a group of sub-networks. Inspired by the social loafing problem of social psychology, we find that residual networks invariably suffer from similar problem, where sub-networks in a residual network are prone to exert less effort when working as part of the group compared to working alone. We define this previously overlooked problem as network loafing. As social loafing will ultimately cause the low individual productivity and the reduced overall performance, network loafing will also hinder the performance of a given residual network and its sub-networks. Referring to the solutions of social psychology, we propose stimulative training, which randomly samples a residual sub-network and calculates the KL-divergence loss between the sampled sub-network and the given residual network, to act as extra supervision for sub-networks and make the overall goal consistent. Comprehensive empirical results and theoretical analyses verify that stimulative training can well handle the loafing problem, and improve the performance of a residual network by improving the performance of its sub-networks. The code is available at https: //github. com/Sunshine-Ye/NIPS22-ST.

ICLR Conference 2022 Conference Paper

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

  • Yangguang Li 0001
  • Feng Liang
  • Lichen Zhao
  • Yufeng Cui
  • Wanli Ouyang
  • Jing Shao
  • Fengwei Yu
  • Junjie Yan

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1×fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.

NeurIPS Conference 2022 Conference Paper

Unsupervised Object Detection Pretraining with Joint Object Priors Generation and Detector Learning

  • Yizhou Wang
  • Meilin Chen
  • Shixiang Tang
  • Feng Zhu
  • Haiyang Yang
  • Lei Bai
  • Rui Zhao
  • Yunfeng Yan

Unsupervised pretraining methods for object detection aim to learn object discrimination and localization ability from large amounts of images. Typically, recent works design pretext tasks that supervise the detector to predict the defined object priors. They normally leverage heuristic methods to produce object priors, \emph{e. g. ,} selective search, which separates the prior generation and detector learning and leads to sub-optimal solutions. In this work, we propose a novel object detection pretraining framework that could generate object priors and learn detectors jointly by generating accurate object priors from the model itself. Specifically, region priors are extracted by attention maps from the encoder, which highlights foregrounds. Instance priors are the selected high-quality output bounding boxes of the detection decoder. By assuming objects as instances in the foreground, we can generate object priors with both region and instance priors. Moreover, our object priors are jointly refined along with the detector optimization. With better object priors as supervision, the model could achieve better detection capability, which in turn promotes the object priors generation. Our method improves the competitive approaches by \textbf{+1. 3 AP}, \textbf{+1. 7 AP} in 1\% and 10\% COCO low-data regimes object detection.

NeurIPS Conference 2021 Conference Paper

A Continuous Mapping For Augmentation Design

  • Keyu Tian
  • Chen Lin
  • Ser Nam Lim
  • Wanli Ouyang
  • Puneet Dokania
  • Philip Torr

Automated data augmentation (ADA) techniques have played an important role in boosting the performance of deep models. Such techniques mostly aim to optimize a parameterized distribution over a discrete augmentation space. Thus, are restricted by the discretization of the search space which normally is handcrafted. To overcome the limitations, we take the first step to constructing a continuous mapping from $\mathbb{R}^d$ to image transformations (an augmentation space). Using this mapping, we take a novel approach where 1) we pose the ADA as a continuous optimization problem over the parameters of the augmentation distribution; and 2) use Stochastic Gradient Langevin Dynamics to learn and sample augmentations. This allows us to potentially explore the space of infinitely many possible augmentations, which otherwise was not possible due to the discretization of the space. This view of ADA is radically different from the standard discretization based view of ADA, and it opens avenues for utilizing the vast efficient gradient-based algorithms available for continuous optimization problems. Results over multiple benchmarks demonstrate the efficiency improvement of this work compared with previous methods.

ICML Conference 2021 Conference Paper

AutoSampling: Search for Effective Data Sampling Schedules

  • Ming Sun 0008
  • Haoxuan Dou
  • Baopu Li
  • Junjie Yan
  • Wanli Ouyang
  • Lei Cui

Data sampling acts as a pivotal role in training deep learning models. However, an effective sampling schedule is difficult to learn due to its inherent high-dimension as a hyper-parameter. In this paper, we propose an AutoSampling method to automatically learn sampling schedules for model training, which consists of the multi-exploitation step aiming for optimal local sampling schedules and the exploration step for the ideal sampling distribution. More specifically, we achieve sampling schedule search with shortened exploitation cycle to provide enough supervision. In addition, we periodically estimate the sampling distribution from the learned sampling schedules and perturb it to search in the distribution space. The combination of two searches allows us to learn a robust sampling schedule. We apply our AutoSampling method to a variety of image classification tasks illustrating the effectiveness of the proposed method.

AAAI Conference 2021 Conference Paper

Dynamic Position-aware Network for Fine-grained Image Recognition

  • Shijie Wang
  • Haojie Li
  • Zhihui Wang
  • Wanli Ouyang

Most weakly supervised fine-grained image recognition (WF- GIR) approaches predominantly focus on learning the discriminative details which contain the visual variances and position clues. The position clues can be indirectly learnt by utilizing context information of discriminative visual content. However, this will cause the selected discriminative regions containing some non-discriminative information introduced by the position clues. These analysis motivates us to directly introduce position clues into visual content to only focus on the visual variances, achieving more precise discriminative region localization. Though important, position modelling usually requires significant pixel/region annotations and therefore is labor-intensive. To address this issue, we propose an end-to-end Dynamic Position-aware Network (DP-Net) to directly incorporate the position clues into visual content and dynamically align them without extra annotations, which eliminates the effect of position information for discriminative variances among subcategories. In particular, the DP-Net consists of: 1) Position Encoding Module, which learns a set of position-aware parts by directly adding the learnable position information into the horizontal/vertical visual content of images; 2) Position-vision Aligning Module, which dynamically aligns both visual content and learnable position information via performing graph convolution on position-aware parts; 3) Position-vision Reorganization Module, which projects the aligned position clues and visual content into the Euclidean space to construct a position-aware feature maps. Finally, the position-aware feature maps are used which is implicitly applied the aligned visual content and position clues for more accurate discriminative regions localization. Extensive experiments verify that DP-Net yields the best performance under the same settings with most competitive approaches, on CUB Bird, Stanford-Cars, and FGVC Aircraft datasets.

AAAI Conference 2021 Conference Paper

Gradient Regularized Contrastive Learning for Continual Domain Adaptation

  • Shixiang Tang
  • Peng Su
  • Dapeng Chen
  • Wanli Ouyang

Human beings can quickly adapt to environmental changes by leveraging learning experience. However, adapting deep neural networks to dynamic environments by machine learning algorithms remains a challenge. To better understand this issue, we study the problem of continual domain adaptation, where the model is presented with a labelled source domain and a sequence of unlabelled target domains. The obstacles in this problem are both domain shift and catastrophic forgetting. We propose Gradient Regularized Contrastive Learning (GRCL) to solve the obstacles. At the core of our method, gradient regularization plays two key roles: (1) enforcing the gradient not to harm the discriminative ability of source features which can, in turn, benefit the adaptation ability of the model to target domains; (2) constraining the gradient not to increase the classification loss on old target domains, which enables the model to preserve the performance on old target domains when adapting to an in-coming target domain. Experiments on Digits, DomainNet and Office-Caltech benchmarks demonstrate the strong performance of our approach when compared to the state-of-the-art.

AAAI Conference 2020 Conference Paper

Channel Pruning Guided by Classification Loss and Feature Importance

  • Jinyang Guo
  • Wanli Ouyang
  • Dong Xu

In this work, we propose a new layer-by-layer channel pruning method called Channel Pruning guided by classification Loss and feature Importance (CPLI). In contrast to the existing layer-by-layer channel pruning approaches that only consider how to reconstruct the features from the next layer, our approach additionally take the classification loss into account in the channel pruning process. We also observe that some reconstructed features will be removed at the next pruning stage. So it is unnecessary to reconstruct these features. To this end, we propose a new strategy to suppress the influence of unimportant features (i. e. , the features will be removed at the next pruning stage). Our comprehensive experiments on three benchmark datasets, i. e. , CIFAR-10, ImageNet, and UCF-101, demonstrate the effectiveness of our CPLI method.

ICLR Conference 2020 Conference Paper

Computation Reallocation for Object Detection

  • Feng Liang
  • Chen Lin 0003
  • Ronghao Guo
  • Ming Sun 0008
  • Wei Wu 0021
  • Junjie Yan
  • Wanli Ouyang

The allocation of computation resources in the backbone is a crucial issue in object detection. However, classification allocation pattern is usually adopted directly to object detector, which is proved to be sub-optimal. In order to reallocate the engaged computation resources in a more efficient way, we present CR-NAS (Computation Reallocation Neural Architecture Search) that can learn computation reallocation strategies across different feature resolution and spatial position diectly on the target detection dataset. A two-level reallocation space is proposed for both stage and spatial reallocation. A novel hierarchical search procedure is adopted to cope with the complex search space. We apply CR-NAS to multiple backbones and achieve consistent improvements. Our CR-ResNet50 and CR-MobileNetV2 outperforms the baseline by 1.9% and 1.7% COCO AP respectively without any additional computation budget. The models discovered by CR-NAS can be equiped to other powerful detection neck/head and be easily transferred to other dataset, e.g. PASCAL VOC, and other vision tasks, e.g. instance segmentation. Our CR-NAS can be used as a plugin to improve the performance of various networks, which is demanding.

AAAI Conference 2020 Conference Paper

DASOT: A Unified Framework Integrating Data Association and Single Object Tracking for Online Multi-Object Tracking

  • Qi Chu
  • Wanli Ouyang
  • Bin Liu
  • Feng Zhu
  • Nenghai Yu

In this paper, we propose an online multi-object tracking (MOT) approach that integrates data association and single object tracking (SOT) with a unified convolutional network (ConvNet), named DASOTNet. The intuition behind integrating data association and SOT is that they can complement each other. Following Siamese network architecture, DASOT- Net consists of the shared feature ConvNet, the data association branch and the SOT branch. Data association is treated as a special re-identification task and solved by learning discriminative features for different targets in the data association branch. To handle the problem that the computational cost of SOT grows intolerably as the number of tracked objects increases, we propose an efficient two-stage tracking method in the SOT branch, which utilizes the merits of correlation features and can simultaneously track all the existing targets within one forward propagation. With feature sharing and the interaction between them, data association branch and the SOT branch learn to better complement each other. Using a multi-task objective, the whole network can be trained endto-end. Compared with state-of-the-art online MOT methods, our method is much faster while maintaining a comparable performance.

AAAI Conference 2020 Conference Paper

Hierarchical Online Instance Matching for Person Search

  • Di Chen
  • Shanshan Zhang
  • Wanli Ouyang
  • Jian Yang
  • Bernt Schiele

Person Search is a challenging task which requires to retrieve a person’s image and the corresponding position from an image dataset. It consists of two sub-tasks: pedestrian detection and person re-identification (re-ID). One of the key challenges is to properly combine the two sub-tasks into a unified framework. Existing works usually adopt a straightforward strategy by concatenating a detector and a re-ID model directly, either into an integrated model or into separated models. We argue that simply concatenating detection and re-ID is a sub-optimal solution, and we propose a Hierarchical Online Instance Matching (HOIM) loss which exploits the hierarchical relationship between detection and re-ID to guide the learning of our network. Our novel HOIM loss function harmonizes the objectives of the two sub-tasks and encourages better feature learning. In addition, we improve the loss update policy by introducing Selective Memory Refreshment (SMR) for unlabeled persons, which takes advantage of the potential discrimination power of unlabeled data. From the experiments on two standard person search benchmarks, i. e. CUHK-SYSU and PRW, we achieve state-of-the-art performance, which justifies the effectiveness of our proposed HOIM loss on learning robust features.

NeurIPS Conference 2020 Conference Paper

Improving Auto-Augment via Augmentation-Wise Weight Sharing

  • Keyu Tian
  • Chen Lin
  • Ming Sun
  • Luping Zhou
  • Junjie Yan
  • Wanli Ouyang

The recent progress on automatically searching augmentation policies has boosted the performance substantially for various tasks. A key component of automatic augmentation search is the evaluation process for a particular augmentation policy, which is utilized to return reward and usually runs thousands of times. A plain evaluation process, which includes full model training and validation, would be time-consuming. To achieve efficiency, many choose to sacrifice evaluation reliability for speed. In this paper, we dive into the dynamics of augmented training of the model. This inspires us to design a powerful and efficient proxy task based on the Augmentation-Wise Weight Sharing (AWS) to form a fast yet accurate evaluation process in an elegant way. Comprehensive analysis verifies the superiority of this approach in terms of effectiveness and efficiency. The augmentation policies found by our method achieve superior accuracies compared with existing auto-augmentation search methods. On CIFAR-10, we achieve a top-1 error rate of 1. 24%, which is currently the best performing single model without extra training data. On ImageNet, we get a top-1 error rate of 20. 36% for ResNet-50, which leads to 3. 34% absolute error rate reduction over the baseline augmentation.

ICRA Conference 2020 Conference Paper

Navigation Command Matching for Vision-based Autonomous Driving

  • Yuxin Pan
  • Jianru Xue
  • Pengfei Zhang 0005
  • Wanli Ouyang
  • Jianwu Fang
  • Xingyu Chen

Learning an optimal policy for autonomous driving task to confront with complex environment is a long- studied challenge. Imitative reinforcement learning is accepted as a promising approach to learn a robust driving policy through expert demonstrations and interactions with environments. However, this model utilizes non-smooth rewards, which have a negative impact on matching between navigation commands and trajectory (state-action pairs), and degrade the generalizability of an agent. Smooth rewards are crucial to discriminate actions generated from sub-optimal policy. In this paper, we propose a navigation command matching (NCM) model to address this issue. There are two key components in NCM, 1) a matching measurer produces smooth navigation rewards that measure matching between navigation commands and trajectory; 2) attention-guided agent performs actions given states where salient regions in RGB images (i. e. roadsides, lane markings and dynamic obstacles) are highlighted to amplify their influence on the final model. We obtain navigation rewards and store transitions to replay buffer after an episode, so NCM is able to discriminate actions generated from suboptimal policy. Experiments on CARLA driving benchmark show our proposed NCM outperforms previous state-of-the- art models on various tasks in terms of the percentage of successfully completed episodes. Moreover, our model improves generalizability of the agent and obtains good performance even in unseen scenarios.

AAAI Conference 2020 Conference Paper

Part-Level Graph Convolutional Network for Skeleton-Based Action Recognition

  • Linjiang Huang
  • Yan Huang
  • Wanli Ouyang
  • Liang Wang

Recently, graph convolutional networks have achieved remarkable performance for skeleton-based action recognition. In this work, we identify a problem posed by the GCNs for skeleton-based action recognition, namely part-level action modeling. To address this problem, a novel Part-Level Graph Convolutional Network (PL-GCN) is proposed to capture part-level information of skeletons. Different from previous methods, the partition of body parts is learnable rather than manually defined. We propose two part-level blocks, namely Part Relation block (PR block) and Part Attention block (PA block), which are achieved by two differentiable operations, namely graph pooling operation and graph unpooling operation. The PR block aims at learning high-level relations between body parts while the PA block aims at highlighting the important body parts in the action. Integrating the original GCN with the two blocks, the PL-GCN can learn both part-level and joint-level information of the action. Extensive experiments on two benchmark datasets show the state-ofthe-art performance on skeleton-based action recognition and demonstrate the effectiveness of the proposed method.

AAAI Conference 2020 Conference Paper

Relational Prototypical Network for Weakly Supervised Temporal Action Localization

  • Linjiang Huang
  • Yan Huang
  • Wanli Ouyang
  • Liang Wang

In this paper, we propose a weakly supervised temporal action localization method on untrimmed videos based on prototypical networks. We observe two challenges posed by weakly supervision, namely action-background separation and action relation construction. Unlike the previous method, we propose to achieve action-background separation only by the original videos. To achieve this, a clustering loss is adopted to separate actions from backgrounds and learn intra-compact features, which helps in detecting complete action instances. Besides, a similarity weighting module is devised to further separate actions from backgrounds. To effectively identify actions, we propose to construct relations among actions for prototype learning. A GCN-based prototype embedding module is introduced to generate relational prototypes. Experiments on THUMOS14 and ActivityNet1. 2 datasets show that our method outperforms the state-of-the-art methods.

IJCAI Conference 2018 Conference Paper

Crowd Counting using Deep Recurrent Spatial-Aware Network

  • Lingbo Liu
  • Hongjun Wang
  • Guanbin Li
  • Wanli Ouyang
  • Liang Lin

Crowd counting from unconstrained scene images is a crucial task in many real-world applications like urban surveillance and management, but it is greatly challenged by the camera’s perspective that causes huge appearance variations in people’s scales and rotations. Conventional methods address such challenges by resorting to fixed multi-scale architectures that are often unable to cover the largely varied scales while ignoring the rotation variations. In this paper, we propose a unified neural network framework, named Deep Recurrent Spatial-Aware Network, which adaptively addresses the two issues in a learnable spatial transform module with a region-wise refinement process. Specifically, our framework incorporates a Recurrent Spatial-Aware Refinement (RSAR) module iteratively conducting two components: i) a Spatial Transformer Network that dynamically locates an attentional region from the crowd density map and transforms it to the suitable scale and rotation for optimal crowd estimation; ii) a Local Refinement Network that refines the density map of the attended region with residual learning. Extensive experiments on four challenging benchmarks show the effectiveness of our approach. Specifically, comparing with the existing best-performing methods, we achieve an improvement of 12\% on the largest dataset WorldExpo’10 and 22. 8\% on the most challenging dataset UCF\_CC\_50

NeurIPS Conference 2018 Conference Paper

FishNet: A Versatile Backbone for Image, Region, and Pixel Level Prediction

  • Shuyang Sun
  • Jiangmiao Pang
  • Jianping Shi
  • Shuai Yi
  • Wanli Ouyang

The basic principles in designing convolutional neural network (CNN) structures for predicting objects on different levels, e. g. , image-level, region-level, and pixel-level, are diverging. Generally, network structures designed specifically for image classification are directly used as default backbone structure for other tasks including detection and segmentation, but there is seldom backbone structure designed under the consideration of unifying the advantages of networks designed for pixel-level or region-level predicting tasks, which may require very deep features with high resolution. Towards this goal, we design a fish-like network, called FishNet. In FishNet, the information of all resolutions is preserved and refined for the final task. Besides, we observe that existing works still cannot \emph{directly} propagate the gradient information from deep layers to shallow layers. Our design can better handle this problem. Extensive experiments have been conducted to demonstrate the remarkable performance of the FishNet. In particular, on ImageNet-1k, the accuracy of FishNet is able to surpass the performance of DenseNet and ResNet with fewer parameters. FishNet was applied as one of the modules in the winning entry of the COCO Detection 2018 challenge. The code is available at https: //github. com/kevin-ssy/FishNet.

NeurIPS Conference 2017 Conference Paper

Learning Deep Structured Multi-Scale Features using Attention-Gated CRFs for Contour Prediction

  • Dan Xu
  • Wanli Ouyang
  • Xavier Alameda-Pineda
  • Elisa Ricci
  • Xiaogang Wang
  • Nicu Sebe

Recent works have shown that exploiting multi-scale representations deeply learned via convolutional neural networks (CNN) is of tremendous importance for accurate contour detection. This paper presents a novel approach for predicting contours which advances the state of the art in two fundamental aspects, i. e. multi-scale feature generation and fusion. Different from previous works directly considering multi-scale feature maps obtained from the inner layers of a primary CNN architecture, we introduce a hierarchical deep model which produces more rich and complementary representations. Furthermore, to refine and robustly fuse the representations learned at different scales, the novel Attention-Gated Conditional Random Fields (AG-CRFs) are proposed. The experiments ran on two publicly available datasets (BSDS500 and NYUDv2) demonstrate the effectiveness of the latent AG-CRF model and of the overall hierarchical framework.

NeurIPS Conference 2016 Conference Paper

CRF-CNN: Modeling Structured Information in Human Pose Estimation

  • Xiao Chu
  • Wanli Ouyang
  • Hongsheng Li
  • Xiaogang Wang

Deep convolutional neural networks (CNN) have achieved great success. On the other hand, modeling structural information has been proved critical in many vision problems. It is of great interest to integrate them effectively. In a classical neural network, there is no message passing between neurons in the same layer. In this paper, we propose a CRF-CNN framework which can simultaneously model structural information in both output and hidden feature layers in a probabilistic way, and it is applied to human pose estimation. A message passing scheme is proposed, so that in various layers each body joint receives messages from all the others in an efficient way. Such message passing can be implemented with convolution between features maps in the same layer, and it is also integrated with feedforward propagation in neural networks. Finally, a neural network implementation of end-to-end learning CRF-CNN is provided. Its effectiveness is demonstrated through experiments on two benchmark datasets.

ICML Conference 2016 Conference Paper

Multi-Bias Non-linear Activation in Deep Neural Networks

  • Hongyang Li 0001
  • Wanli Ouyang
  • Xiaogang Wang 0001

As a widely used non-linear activation, Rectified Linear Unit (ReLU) separates noise and signal in a feature map by learning a threshold or bias. However, we argue that the classification of noise and signal not only depends on the magnitude of responses, but also the context of how the feature responses would be used to detect more abstract patterns in higher layers. In order to output multiple response maps with magnitude in different ranges for a particular visual pattern, existing networks employing ReLU and its variants have to learn a large number of redundant filters. In this paper, we propose a multi-bias non-linear activation (MBA) layer to explore the information hidden in the magnitudes of responses. It is placed after the convolution layer to decouple the responses to a convolution kernel into multiple maps by multi-thresholding magnitudes, thus generating more patterns in the feature space at a low computational cost. It provides great flexibility of selecting responses to different visual patterns in different magnitude ranges to form rich representations in higher layers. Such a simple and yet effective scheme achieves the state-of-the-art performance on several benchmarks.