Arrow Research search

Author name cluster

Ping Luo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

75 papers
2 author rows

Possible papers

75

AAAI Conference 2026 Conference Paper

FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

  • Shilong Zhang
  • Wenbo Li
  • Shoufa Chen
  • Chongjian Ge
  • Peize Sun
  • Yifu Zhang
  • Yi Jiang
  • Zehuan Yuan

DiT models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high-resolution outputs, further amplifying computational demands—especially for single-stage DiT models. To address these challenges, we propose a novel two-stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low-resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage achieves a nearly straight ODE trajectory between low and high resolutions via flow matching, effectively generating fine details and fixing artifacts with minimal NFEs. To ensure a seamless connection between the two independently trained stages during inference, we carefully design degradation strategies during the second-stage training. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high-resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output and accordingly adjust the prompt before committing to full-resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability.

AAAI Conference 2026 Conference Paper

Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers

  • Sida Huang
  • Siqi Huang
  • Ping Luo
  • Hongyuan Zhang

With the development of diffusion models, enhancing spatial controllability in text-to-image generation has become a vital challenge. As a representative task for addressing this challenge, layout-to-image generation aims to generate images that are spatially consistent with the given layout condition. Existing layout-to-image methods typically introduce the layout condition by integrating adapter modules into the base generative model. However, the generated images often exhibit low visual quality and stylistic inconsistency with the base model, indicating a loss of pretrained knowledge. To alleviate this issue, we construct the Layout Synthesis (LaySyn) dataset, which leverages images synthesized by the base model itself to mitigate the distribution shift from the pretraining data. Moreover, we propose the Layout Control (Laytrol) Network, in which parameters are inherited from MM-DiT to preserve the pretrained knowledge of the base model. To effectively activate the copied parameters and avoid disturbance from unstable control conditions, we adopt a dedicated initialization scheme for Laytrol. In this scheme, the layout encoder is initialized as a pure text encoder to ensure that its output tokens remain within the data domain of MM-DiT. Meanwhile, the outputs of the layout control network are initialized to zero. In addition, we apply Object-level Rotary Position Embedding to the layout tokens to provide coarse positional information. Qualitative and quantitative experiments demonstrate the effectiveness of our method.

AAAI Conference 2025 Conference Paper

AnalogCoder: Analog Circuit Design via Training-Free Code Generation

  • Yao Lai
  • Sungyoung Lee
  • Guojin Chen
  • Souradip Poddar
  • Mengkang Hu
  • David Z. Pan
  • Ping Luo

Analog circuit design is a significant task in modern chip technology, focusing on the selection of component types, connectivity, and parameters to ensure proper circuit functionality. Despite advances made by Large Language Models (LLMs) in digital circuit design, the complexity and scarcity of data in analog circuitry pose significant challenges. To mitigate these issues, we introduce AnalogCoder, the first training-free LLM agent for designing analog circuits through Python code generation. Firstly, AnalogCoder incorporates a feedback-enhanced flow with tailored domain-specific prompts, enabling the automated and self-correcting design of analog circuits with a high success rate. Secondly, it proposes a circuit tool library to archive successful designs as reusable modular sub-circuits, simplifying composite circuit creation. Thirdly, extensive experiments on a benchmark designed to cover a wide range of analog circuit tasks show that AnalogCoder outperforms other LLM-based methods. It has successfully designed 20 circuits, 5 more than standard GPT-4o. We believe AnalogCoder can significantly improve the labor-intensive chip design process, enabling non-experts to design analog circuits efficiently.

AAAI Conference 2025 Conference Paper

AutoMMLab: Automatically Generating Deployable Models from Language Instructions for Computer Vision Tasks

  • Zekang Yang
  • Wang Zeng
  • Sheng Jin
  • Chen Qian
  • Ping Luo
  • Wentao Liu

Automated machine learning (AutoML) is a collection of techniques designed to automate the machine learning development process. While traditional AutoML approaches have been successfully applied in several critical steps of model development (e.g. hyperparameter optimization), there lacks a AutoML system that automates the entire end-to-end model production workflow for computer vision. To fill this blank, we propose a novel request-to-model task, which involves understanding the user's natural language request and execute the entire workflow to output production-ready models. This empowers non-expert individuals to easily build task-specific models via a user-friendly language interface. To facilitate development and evaluation, we develop a new experimental platform called AutoMMLab and a new benchmark called LAMP for studying key components in the end-to-end request-to-model pipeline. Hyperparameter optimization (HPO) is one of the most important components for AutoML. Traditional approaches mostly rely on trial-and-error, leading to inefficient parameter search. To solve this problem, we propose a novel LLM-based HPO algorithm, called HPO-LLaMA. Equipped with extensive knowledge and experience in model hyperparameter tuning, HPO-LLaMA achieves significant improvement of HPO efficiency.

TMLR Journal 2025 Journal Article

Autoregressive Models in Vision: A Survey

  • Jing Xiong
  • Gongye Liu
  • Lun Huang
  • Chengyue Wu
  • Taiqiang Wu
  • Yao Mu
  • Yuan Yao
  • Hui Shen

Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, i.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the representation strategy. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multifaceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multimodal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey.

NeurIPS Conference 2025 Conference Paper

DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning

  • Yongxin He
  • Shan Zhang
  • Yixuan Cao
  • Lei Ma
  • Ping Luo

Detecting AI-involved text is essential for combating misinformation, plagiarism, and academic misconduct. However, AI text generation includes diverse collaborative processes (AI-written text edited by humans, human-written text edited by AI, and AI-generated text refined by other AI), where various or even new LLMs could be involved. Texts generated through these varied processes exhibit complex characteristics, presenting significant challenges for detection. Current methods model these processes rather crudely, primarily employing binary classification (purely human vs. AI-involved) or multi-classification (treating human-AI collaboration as a new class). We observe that representations of texts generated through different processes exhibit inherent clustering relationships. Therefore, we propose DETree, a novel approach that models the relationships among different processes as a Hierarchical Affinity Tree structure, and introduces a specialized loss function that aligns text representations with this tree. To facilitate this learning, we developed RealBench, a comprehensive benchmark dataset that automatically incorporates a wide spectrum of hybrid texts produced through various human-AI collaboration processes. Our method improves performance in hybrid text detection tasks and significantly enhances robustness and generalization in out-of-distribution scenarios, particularly in few-shot learning conditions, further demonstrating the promise of training-based approaches in OOD settings. Our code and dataset are available at https: //github. com/heyongxin233/DETree.

AAAI Conference 2025 Conference Paper

End-to-End Autonomous Driving Through V2X Cooperation

  • Haibao Yu
  • Wenxian Yang
  • Jiaru Zhong
  • Zhenwei Yang
  • Siqi Fan
  • Ping Luo
  • Zaiqing Nie

Cooperatively utilizing both ego-vehicle and infrastructure sensor data via V2X communication has emerged as a promising approach for advanced autonomous driving. However, current research mainly focuses on improving individual modules, rather than taking end-to-end learning to optimize final planning performance, resulting in underutilized data potential. In this paper, we introduce UniV2X, a pioneering cooperative autonomous driving framework that seamlessly integrates all key driving modules across diverse views into a unified network. We propose a sparse-dense hybrid data transmission and fusion mechanism for effective vehicle-infrastructure cooperation, offering three advantages: 1) Effective for simultaneously enhancing agent perception, online mapping, and occupancy prediction, ultimately improving planning performance. 2) Transmission-friendly for practical and limited communication conditions. 3) Reliable data fusion with interpretability of this hybrid data. We implement UniV2X, as well as reproducing several benchmark methods, on the challenging DAIR-V2X, the real-world cooperative driving dataset. Experimental results demonstrate the effectiveness of UniV2X in significantly enhancing planning performance, as well as all intermediate output performance.

NeurIPS Conference 2025 Conference Paper

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

  • Jin Wang
  • Yao Lai
  • Aoxue Li
  • Shifeng Zhang
  • Jiacheng Sun
  • Ning Kang
  • Chengyue Wu
  • Zhenguo Li

The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.

NeurIPS Conference 2025 Conference Paper

OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation

  • Mengkang Hu
  • Yuhang Zhou
  • Wendong Fan
  • Yuzhou Nie
  • Ziyu Ye
  • Bowei Xia
  • Tao Sun
  • Zhaoxuan Jin

Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework that decouples strategic planning from specialized execution through a modular architecture comprising: (i) a domain-agnostic Planner for task decomposition, (ii) a Coordinator for subtask management, and (iii) specialized Workers with domain-specific tool-calling capabilities. This decoupling enables cross-domain transferability during both inference and training phases: During inference, Workforce seamlessly adapts to new domains by adding or modifying worker agents; For training, we introduce Optimized Workforce Learning (OWL), which improves generalization across domains by optimizing a domain-agnostic planner with reinforcement learning from real-world feedback. To validate our approach, we evaluate Workforce on the GAIA benchmark, covering various realistic, multi-domain agentic tasks. Experimental results demonstrate Workforce achieves open-source state-of-the-art performance ( 69. 70% ), outperforming commercial systems like OpenAI's Deep Research by 2. 34%. More notably, our OWL-trained 32B model achieves 52. 73% accuracy ( +16. 37% ) and demonstrates performance comparable to GPT-4o on challenging tasks. To summarize, by enabling scalable generalization and modular domain transfer, our work establishes a foundation for the next generation of general-purpose AI assistants. Our code is available at Anonymous URL, and our data is available at Anonymous URL.

NeurIPS Conference 2025 Conference Paper

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

  • Junting Chen
  • Haotian Liang
  • Lingxiao Du
  • Weiyun Wang
  • Mengkang Hu
  • Yao Mu
  • Wenhai Wang
  • Jifeng Dai

The rapid progress of navigation, manipulation, and vision models has made mobile manipulators capable in many specialized tasks. However, the open-world mobile manipulation (OWMM) task remains a challenge due to the need for generalization to open-ended instructions and environments, as well as the systematic complexity to integrate high-level decision making with low-level robot control based on both global scene understanding and current agent state. To address this complexity, we propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling. A second challenge is the hallucination from domain shift. To enhance the agent performance, we further introduce an agentic data synthesis pipeline for the OWMM task to adapt the VLM model to our task domain with instruction fine-tuning. We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model. Through experiments, we demonstrate that our model achieves SOTA performance compared to other foundation models including GPT-4o and strong zero-shot generalization in real world. The project page is at https: //hhyhrhy. github. io/owmm-agent-project.

NeurIPS Conference 2025 Conference Paper

SpecEM: Training-Free LLM Ensembling via Iterative Drafting, Verification, and Online Feedback

  • Bo Lv
  • Nayu Liu
  • Chen Tang
  • Xin Liu
  • Yue Yu
  • Ping Luo

Ensembles of generative large language models (LLMs) are a promising way to compensate for individual model limitations, integrating the strengths of different LLMs. Existing LLM ensemble methods, however, face limitations such as first-token delay and challenges in long-range semantic collaboration between models, Moreover, they typically assume equal voting weights for all models during ensemble, ignoring performance differences between models for a given task. In this work, we propose SpecEM, a training-free, plug-and-play LLM ensemble framework that dynamically adjusts each model's model contribution in real time based on task performance. Inspired by speculative decoding, SpecFuse iteratively performs drafting and verification, allowing models to collaborate semantically at the segment level for integrated output. Furthermore, we introduce an online feedback mechanism with multiplicative weight updates, where each model's voting weight is adjusted on-the-fly according to how often it "outperforms" others during verification stage, ensuring that stronger models exert greater influence on the ensemble during generation. Experimental results on five popular LLMs (ranging from 7B to 72B parameters) and six benchmark tasks, spanning instruction following, reasoning, commonsense, and general instruction response, demonstrate consistent performance improvements compared to state-of-the-art LLM ensemble methods.

NeurIPS Conference 2025 Conference Paper

TREND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception

  • Runjian Chen
  • Hyoungseob Park
  • Bo Zhang
  • Wenqi Shao
  • Ping Luo
  • Alex Wong

Labeling LiDAR point clouds is notoriously time-and-energy-consuming, which spurs recent unsupervised 3D representation learning methods to alleviate the labeling burden in LiDAR perception via pretrained weights. Existing work focus on either masked auto encoding or contrastive learning on LiDAR point clouds, which neglects the temporal LiDAR sequence that naturally accounts for object motion (and their semantics). Instead, we propose TREND, short for Temporal REndering with Neural fielD, to learn 3D representation via forecasting the future observation in an unsupervised manner. TREND integrates forecasting for 3D pre-training through a Recurrent Embedding scheme to generate 3D embeddings across time and a Temporal LiDAR Neural Field specifically designed for LiDAR modality to represent the 3D scene, with which we compute the loss using differentiable rendering. We evaluate TREND on 3D object detection and LiDAR semantic segmentation tasks on popular datasets, including Once, Waymo, NuScenes, and SemanticKITTI. TREND generally improves from-scratch models across datasets and tasks and brings gains of 1. 77\% mAP on Once and 2. 11\% mAP on NuScenes, which are up to 400\% more improvement compared to previous SOTA unsupervised 3D pre-training methods. Codes and models will be available.

NeurIPS Conference 2025 Conference Paper

WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception

  • Zhiheng Liu
  • Xueqing Deng
  • Shoufa Chen
  • Angtian Wang
  • Qiushan Guo
  • Mingfei Han
  • Zeyue Xue
  • Mengzhao Chen

Generative video modeling has made significant strides, yet ensuring structural and temporal consistency over long sequences remains a challenge. Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations. To address these issues, we introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme. Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics. Second, by leveraging depth cues, which we observe to be more resistant to drift than RGB, we construct a memory bank that preserves clearer contextual information, improving quality in long-horizon video generation. Third, we employ segmented noise scheduling for training prediction groups, which further mitigates drift and reduces computational cost. Extensive experiments on both diffusion and rectified flow-based models demonstrate the effectiveness of WorldWeaver in reducing temporal drift and improving the fidelity of generated videos.

AAAI Conference 2024 Conference Paper

Cached Transformers: Improving Transformers with Differentiable Memory Cachde

  • Zhaoyang Zhang
  • Wenqi Shao
  • Yixiao Ge
  • Xiaogang Wang
  • Jinwei Gu
  • Ping Luo

This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.

NeurIPS Conference 2024 Conference Paper

Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation

  • Qingwen Bu
  • Jia Zeng
  • Li Chen
  • Yanchao Yang
  • Guyue Zhou
  • Junchi Yan
  • Ping Luo
  • Heming Cui

Despite significant progress in robotics and embodied AI in recent years, deploying robots for long-horizon tasks remains a great challenge. Majority of prior arts adhere to an open-loop philosophy and lack real-time feedback, leading to error accumulation and undesirable robustness. A handful of approaches have endeavored to establish feedback mechanisms leveraging pixel-level differences or pre-trained visual representations, yet their efficacy and adaptability have been found to be constrained. Inspired by classic closed-loop control systems, we propose CLOVER, a closed-loop visuomotor control framework that incorporates feedback mechanisms to improve adaptive robotic control. CLOVER consists of a text-conditioned video diffusion model for generating visual plans as reference inputs, a measurable embedding space for accurate error quantification, and a feedback-driven controller that refines actions from feedback and initiates replans as needed. Our framework exhibits notable advancement in real-world robotic tasks and achieves state-of-the-art on CALVIN benchmark, improving by 8% over previous open-loop counterparts. Code and checkpoints are maintained at https: //github. com/OpenDriveLab/CLOVER.

NeurIPS Conference 2024 Conference Paper

ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capability for Large Vision-Language Models

  • Shuo Liu
  • Kaining Ying
  • Hao Zhang
  • Yue Yang
  • Yuqi Lin
  • Tianle Zhang
  • Chuanhao Li
  • Yu Qiao

Multi-turn visual conversation is an important ability of real-world AI assistants. However, the related evaluation benchmark is missed. This paper presents ConvBench, a multi-turn conversation benchmark with hierarchical capabilities ablation evaluation for Large Vision-Language Models (LVLMs). ConvBench comprises 577 curated multi-turn conversations, encompassing 215 tasks. These tasks are broad and open-ended, which resemble real-world user behaviors. ConvBench progressively examines the LVLMs' perception, reasoning, and creativity capabilities in each conversation and can decouple these capabilities in evaluations and thus perform reliable error attribution. Besides, considering the diversity of open-ended questions, we introduce an efficient and reliable automatic evaluation framework. Experimental results reveal that ConvBench is a significant challenge for current LVLMs, even for GPT4V, which achieves only a 39. 51% score. Besides, we have some insightful findings, such as the weak perception of LVLMs inhibits authentic strengths in reasoning and creation. We believe our design of hierarchical capabilities, decoupling capabilities evaluation, and multi-turn conversation can blaze a new trail in LVLMs evaluation. Code and benchmark are released at https: //github. com/shirlyliu64/ConvBench.

AAAI Conference 2024 Conference Paper

DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving

  • Tianqi Wang
  • Sukmin Kim
  • Ji Wenxuan
  • Enze Xie
  • Chongjian Ge
  • Junsong Chen
  • Zhenguo Li
  • Ping Luo

Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports the direct and explainable safety evaluation for autonomous driving. In this work, we propose DeepAccident, a large-scale dataset generated via a realistic simulator containing diverse accident scenarios that frequently occur in real-world driving. The proposed DeepAccident dataset includes 57K annotated frames and 285K annotated samples, approximately 7 times more than the large-scale nuScenes dataset with 40k annotated samples. In addition, we propose a new task, end-to-end motion and accident prediction, which can be used to directly evaluate the accident prediction ability for different autonomous driving algorithms. Furthermore, for each scenario, we set four vehicles along with one infrastructure to record data, thus providing diverse viewpoints for accident scenarios and enabling V2X (vehicle-to-everything) research on perception and prediction tasks. Finally, we present a baseline V2X model named V2XFormer that demonstrates superior performance for motion and accident prediction and 3D object detection compared to the single-vehicle model.

ICML Conference 2024 Conference Paper

Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View

  • Jin Wang 0039
  • Shichao Dong 0001
  • Yapeng Zhu
  • Kelu Yao
  • Weidong Zhao
  • Chao Li 0028
  • Ping Luo

Compositional reasoning capabilities are usually considered as fundamental skills to characterize human perception. Recent studies show that current Vision Language Models (VLMs) surprisingly lack sufficient knowledge with respect to such capabilities. To this end, we propose to thoroughly diagnose the composition representations encoded by VLMs, systematically revealing the potential cause for this weakness. Specifically, we propose evaluation methods from a novel game-theoretic view to assess the vulnerability of VLMs on different aspects of compositional understanding, e. g. , relations and attributes. Extensive experimental results demonstrate and validate several insights to understand the incapabilities of VLMs on compositional reasoning, which provide useful and reliable guidance for future studies. The deliverables will be updated here.

NeurIPS Conference 2024 Conference Paper

MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

  • Jie Zhu
  • Yixiong Chen
  • Mingyu Ding
  • Ping Luo
  • Leye Wang
  • Jingdong Wang

Text-to-image diffusion has attracted vast attention due to its impressive image-generation capabilities. However, when it comes to human-centric text-to-image generation, particularly in the context of faces and hands, the results often fall short of naturalness due to insufficient training priors. We alleviate the issue in this work from two perspectives. 1) From the data aspect, we carefully collect a human-centric dataset comprising over one million high-quality human-in-the-scene images and two specific sets of close-up images of faces and hands. These datasets collectively provide a rich prior knowledge base to enhance the human-centric image generation capabilities of the diffusion model. 2) On the methodological front, we propose a simple yet effective method called Mixture of Low-rank Experts (MoLE) by considering low-rank modules trained on close-up hand and face images respectively as experts. This concept draws inspiration from our observation of low-rank refinement, where a low-rank module trained by a customized close-up dataset has the potential to enhance the corresponding image part when applied at an appropriate scale. To validate the superiority of MoLE in the context of human-centric image generation compared to state-of-the-art, we construct two benchmarks and perform evaluations with diverse metrics and human studies. Datasets, model, and code are released at https: //sites. google. com/view/mole4diffuser/.

NeurIPS Conference 2024 Conference Paper

Needle In A Multimodal Haystack

  • Weiyun Wang
  • Shuibo Zhang
  • Yiming Ren
  • Yuchen Duan
  • Tiantong Li
  • Shuo Liu
  • Mengkang Hu
  • Zhe Chen

With the rapid advancement of multimodal large language models (MLLMs), their evaluation has become increasingly comprehensive. However, understanding long multimodal content, as a foundational ability for real-world applications, remains underexplored. In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning. In each task, the model is required to answer the questions according to different key information scattered throughout the given multimodal document. Evaluating the leading MLLMs on MM-NIAH, we observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation. We hope this work can provide a platform for further research on long multimodal document comprehension and contribute to the advancement of MLLMs. Code and benchmark are released at https: //github. com/OpenGVLab/MM-NIAH.

NeurIPS Conference 2024 Conference Paper

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability, Reproducibility, and Practicality

  • Tianle Zhang
  • Langtian Ma
  • Yuchen Yan
  • Yuchen Zhang
  • Kai Wang
  • Yue Yang
  • Ziyao Guo
  • Wenqi Shao

Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen2, Pika, and Sora, have significantly broadened its applicability and popularity. Despite these strides, evaluating these models poses substantial challenges. Primarily, due to the limitations inherent in automatic metrics, manual evaluation is often considered a superior method for assessing T2V generation. However, existing manual evaluation protocols face reproducibility, reliability, and practicality issues. To address these challenges, this paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models. The T2VHE protocol includes well-defined metrics, thorough annotator training, and an effective dynamic evaluation module. Experimental results demonstrate that this protocol not only ensures high-quality annotations but can also reduce evaluation costs by nearly 50\%. We will open-source the entire setup of the T2VHE protocol, including the complete protocol workflow, the dynamic evaluation component details, and the annotation interface code. This will help communities establish more sophisticated human assessment protocols.

NeurIPS Conference 2024 Conference Paper

Scalable and Effective Arithmetic Tree Generation for Adder and Multiplier Designs

  • Yao Lai
  • Jinxin Liu
  • David Z. Pan
  • Ping Luo

Across a wide range of hardware scenarios, the computational efficiency and physical size of the arithmetic units significantly influence the speed and footprint of the overall hardware system. Nevertheless, the effectiveness of prior arithmetic design techniques proves inadequate, as they do not sufficiently optimize speed and area, resulting in increased latency and larger module size. To boost computing performance, this work focuses on the two most common and fundamental arithmetic modules, adders and multipliers. We cast the design tasks as single-player tree generation games, leveraging reinforcement learning techniques to optimize their arithmetic tree structures. This tree generation formulation allows us to efficiently navigate the vast search space and discover superior arithmetic designs that improve computational efficiency and hardware size within just a few hours. Our proposed method, ArithTreeRL, achieves significant improvements for both adders and multipliers. For adders, our approach discovers designs of 128-bit adders that achieve Pareto optimality in theoretical metrics. Compared with PrefixRL, it reduces delay and size by up to 26% and 30%, respectively. For multipliers, compared to RL-MUL, our method enhances speed and reduces size by as much as 49% and 45%. Additionally, ArithTreeRL's flexibility and scalability enable seamless integration into 7nm technology. We believe our work will offer valuable insights into hardware design, further accelerating speed and reducing size through the refined search space and our tree generation methodologies.

NeurIPS Conference 2024 Conference Paper

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

  • Chaofan Tao
  • Qian Liu
  • Longxu Dou
  • Niklas Muennighoff
  • Zhongwei Wan
  • Ping Luo
  • Min Lin
  • Ngai Wong

Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the conclusion that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies. Most LLMs, however, use insufficient vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29. 1 to 32. 0 with the same 2. 3e21 FLOPs. Our work highlights the importance of jointly considering tokenization and model scaling for efficient pre-training. The code and demo are available at https: //github. com/sail-sg/scaling-with-vocab and https: //hf. co/spaces/sail/scaling-with-vocab-demo.

NeurIPS Conference 2024 Conference Paper

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge

  • Chuanhao Li
  • Zhen Li
  • Chenchen Jing
  • Shuo Liu
  • Wenqi Shao
  • Yuwei Wu
  • Ping Luo
  • Yu Qiao

Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently due to the large amount of resources required, and therefore fail in many cases. For example, if a LVLM was released on January 2024, and it wouldn't know the singer of the theme song for the new Detective Conan movie, which wasn't released until April 2024. To solve the problem, a promising solution motivated by retrieval-augmented generation (RAG) is to provide LVLMs with up-to-date knowledge via internet search during inference, i. e. , internet-augmented generation (IAG), which is already integrated in some closed-source commercial LVLMs such as GPT-4V. However, the specific mechanics underpinning them remain a mystery. In this paper, we propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed SearchLVLMs. A hierarchical filtering model is trained to effectively and efficiently find the most helpful content from the websites returned by a search engine to prompt LVLMs with up-to-date knowledge. To train the model and evaluate our framework's performance, we propose a pipeline to automatically generate news-related VQA samples to construct a dataset, dubbed UDK-VQA. A multi-model voting mechanism is introduced to label the usefulness of website/content for VQA samples to construct the training set. Experimental results demonstrate the effectiveness of our framework, outperforming GPT-4o by $\sim$30\% in accuracy.

NeurIPS Conference 2024 Conference Paper

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

  • Jiannan Wu
  • Muyan Zhong
  • Sen Xing
  • Zeqiang Lai
  • Zhaoyang Liu
  • Zhe Chen
  • Wenhai Wang
  • Xizhou Zhu

We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed ``super link'', as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully collected and combed training data from hundreds of public vision and vision-language tasks. In this way, our model can be joint-trained end-to-end on hundreds of vision language tasks and generalize to these tasks using a set of shared parameters through different user prompts, achieving performance comparable to task-specific models. We believe VisionLLM v2 will offer a new perspective on the generalization of MLLMs.

AAAI Conference 2023 Conference Paper

DrugOOD: Out-of-Distribution Dataset Curator and Benchmark for AI-Aided Drug Discovery – a Focus on Affinity Prediction Problems with Noise Annotations

  • Yuanfeng Ji
  • Lu Zhang
  • Jiaxiang Wu
  • Bingzhe Wu
  • Lanqing Li
  • Long-Kai Huang
  • Tingyang Xu
  • Yu Rong

AI-aided drug discovery (AIDD) is gaining popularity due to its potential to make the search for new pharmaceuticals faster, less expensive, and more effective. Despite its extensive use in numerous fields (e.g., ADMET prediction, virtual screening), little research has been conducted on the out-of-distribution (OOD) learning problem with noise. We present DrugOOD, a systematic OOD dataset curator and benchmark for AIDD. Particularly, we focus on the drug-target binding affinity prediction problem, which involves both macromolecule (protein target) and small-molecule (drug compound). DrugOOD offers an automated dataset curator with user-friendly customization scripts, rich domain annotations aligned with biochemistry knowledge, realistic noise level annotations, and rigorous benchmarking of SOTA OOD algorithms, as opposed to only providing fixed datasets. Since the molecular data is often modeled as irregular graphs using graph neural network (GNN) backbones, DrugOOD also serves as a valuable testbed for graph OOD learning problems. Extensive empirical studies have revealed a significant performance gap between in-distribution and out-of-distribution experiments, emphasizing the need for the development of more effective schemes that permit OOD generalization under noise for AIDD.

NeurIPS Conference 2023 Conference Paper

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

  • Yao Mu
  • Qinglong Zhang
  • Mengkang Hu
  • Wenhai Wang
  • Mingyu Ding
  • Jun Jin
  • Bin Wang
  • Jifeng Dai

Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed EgoCOT. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts" mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We introduce a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control. Extensive experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering. Notably, EmbodiedGPT significantly enhances the success rate of the embodied control task by extracting more effective features. It has achieved a remarkable 1. 6 times increase in success rate on the Franka Kitchen benchmark and a 1. 3 times increase on the Meta-World benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.

NeurIPS Conference 2023 Conference Paper

Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

  • Haibao Yu
  • Yingjuan Tang
  • Enze Xie
  • Jilei Mao
  • Ping Luo
  • Zaiqing Nie

Cooperatively utilizing both ego-vehicle and infrastructure sensor data can significantly enhance autonomous driving perception abilities. However, the uncertain temporal asynchrony and limited communication conditions that are present in traffic environments can lead to fusion misalignment and constrain the exploitation of infrastructure data. To address these issues in vehicle-infrastructure cooperative 3D (VIC3D) object detection, we propose the Feature Flow Net (FFNet), a novel cooperative detection framework. FFNet is a flow-based feature fusion framework that uses a feature flow prediction module to predict future features and compensate for asynchrony. Instead of transmitting feature maps extracted from still-images, FFNet transmits feature flow, leveraging the temporal coherence of sequential infrastructure frames. Furthermore, we introduce a self-supervised training approach that enables FFNet to generate feature flow with feature prediction ability from raw infrastructure sequences. Experimental results demonstrate that our proposed method outperforms existing cooperative detection methods while only requiring about 1/100 of the transmission cost of raw data and covers all latency in one model on the DAIR-V2X dataset. The code is available https: //github. com/haibao-yu/FFNet-VIC3D.

NeurIPS Conference 2023 Conference Paper

Foundation Model is Efficient Multimodal Multitask Model Selector

  • Fanqing Meng
  • Wenqi Shao
  • zhanglin peng
  • Chonghe Jiang
  • Kaipeng Zhang
  • Yu Qiao
  • Ping Luo

This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering. A brute-force approach is to finetune all models on all target datasets, bringing high computational costs. Although recent-advanced approaches employed lightweight metrics to measure models’ transferability, they often depend heavily on the prior knowledge of a single task, making them inapplicable in a multi-modal multi-task scenario. To tackle this issue, we propose an efficient multi-task model selector (EMMS), which employs large-scale foundation models to transform diverse label formats such as categories, texts, and bounding boxes of different downstream tasks into a unified noisy label embedding. EMMS can estimate a model’s transferability through a simple weighted linear regression, which can be efficiently solved by an alternating minimization algorithm with a convergence guarantee. Extensive experiments on 5 downstream tasks with 24 datasets show that EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario. For instance, compared with the state- of-the-art method LogME enhanced by our label embeddings, EMMS achieves 9. 0%, 26. 3%, 20. 1%, 54. 8%, 12. 2% performance gain on image recognition, referring, captioning, visual question answering, and text question answering, while bringing 5. 13×, 6. 29×, 3. 59×, 6. 19×, and 5. 66× speedup in wall-clock time, respectively. The code is available at https: //github. com/OpenGVLab/Multitask-Model-Selector.

NeurIPS Conference 2023 Conference Paper

OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping

  • Huijie Wang
  • Tianyu Li
  • Yang Li
  • Li Chen
  • Chonghao Sima
  • Zhenbo Liu
  • Bangjun Wang
  • Peijin Jia

Accurately depicting the complex traffic scene is a vital component for autonomous vehicles to execute correct judgments. However, existing benchmarks tend to oversimplify the scene by solely focusing on lane perception tasks. Observing that human drivers rely on both lanes and traffic signals to operate their vehicles safely, we present OpenLane-V2, the first dataset on topology reasoning for traffic scene structure. The objective of the presented dataset is to advance research in understanding the structure of road scenes by examining the relationship between perceived entities, such as traffic elements and lanes. Leveraging existing datasets, OpenLane-V2 consists of 2, 000 annotated road scenes that describe traffic elements and their correlation to the lanes. It comprises three primary sub-tasks, including the 3D lane detection inherited from OpenLane, accompanied by corresponding metrics to evaluate the model’s performance. We evaluate various state-of-the-art methods, and present their quantitative and qualitative results on OpenLane-V2 to indicate future avenues for investigating topology reasoning in traffic scenes.

NeurIPS Conference 2023 Conference Paper

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

  • Zeyue Xue
  • Guanglu Song
  • Qiushan Guo
  • Boxiao Liu
  • Zhuofan Zong
  • Yu Liu
  • Ping Luo

Text-to-image generation has recently witnessed remarkable achievements. We introduce a text-conditional image diffusion model, termed RAPHAEL, to generate highly artistic images, which accurately portray the text prompts, encompassing multiple nouns, adjectives, and verbs. This is achieved by stacking tens of mixture-of-experts (MoEs) layers, i. e. , space-MoE and time-MoE layers, enabling billions of diffusion paths (routes) from the network input to the output. Each path intuitively functions as a "painter" for depicting a particular textual concept onto a specified image region at a diffusion timestep. Comprehensive experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as Stable Diffusion, ERNIE-ViLG 2. 0, DeepFloyd, and DALL-E 2, in terms of both image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior performance in switching images across diverse styles, such as Japanese comics, realism, cyberpunk, and ink illustration. Secondly, a single model with three billion parameters, trained on 1, 000 A100 GPUs for two months, achieves a state-of-the-art zero-shot FID score of 6. 61 on the COCO dataset. Furthermore, RAPHAEL significantly surpasses its counterparts in human evaluation on the ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the frontiers of image generation research in both academia and industry, paving the way for future breakthroughs in this rapidly evolving field. More details can be found on a webpage: https: //raphael-painter. github. io/.

NeurIPS Conference 2023 Conference Paper

Top-Ambiguity Samples Matter: Understanding Why Deep Ensemble Works in Selective Classification

  • Qiang Ding
  • Yixuan Cao
  • Ping Luo

Selective classification allows a machine learning model to reject some hard inputs and thus improve the reliability of its predictions. In this area, the ensemble method is powerful in practice, but there has been no solid analysis on why the ensemble method works. Inspired by an interesting empirical result that the improvement of the ensemble largely comes from top-ambiguity samples where its member models diverge, we prove that, based on some assumptions, the ensemble has a lower selective risk than the member model for any coverage within a range. The proof is nontrivial since the selective risk is a non-convex function of the model prediction. The assumptions and the theoretical results are supported by systematic experiments on both computer vision and natural language processing tasks.

TMLR Journal 2023 Journal Article

Understanding Self-Supervised Pretraining with Part-Aware Representation Learning

  • Jie Zhu
  • Jiyang Qi
  • Mingyu Ding
  • Xiaokang Chen
  • Ping Luo
  • Xinggang Wang
  • Wenyu Liu
  • Leye Wang

In this paper, we are interested in understanding self-supervised pretraining through studying the capability that self-supervised methods learn part-aware representations. The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts. We explain that contrastive learning is a part-to-whole task: the projection layer hallucinates the whole object representation from the object part representation learned from the encoder, and that masked image modeling is a part-to-part task: the masked patches of the object are hallucinated from the visible patches. The explanation suggests that the self-supervised pretrained encoder leans toward understanding the object part. We empirically compare the off-the-shelf encoders pretrained with several representative methods on object-level recognition and part-level recognition. The results show that the fully-supervised model outperforms self-supervised models for object-level recognition, and most self-supervised contrastive learning and masked image modeling methods outperform the fully-supervised method for part-level recognition. It is observed that the combination of contrastive learning and masked image modeling further improves the performance.

NeurIPS Conference 2023 Conference Paper

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

  • Wenhai Wang
  • Zhe Chen
  • Xiaokang Chen
  • Jiannan Wu
  • Xizhou Zhu
  • Gang Zeng
  • Ping Luo
  • Tong Lu

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The code shall be released.

NeurIPS Conference 2022 Conference Paper

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

  • Shoufa Chen
  • Chongjian Ge
  • Zhan Tong
  • Jiangliu Wang
  • Yibing Song
  • Jue Wang
  • Ping Luo

Pretraining Vision Transformers (ViTs) has achieved great success in visual recognition. A following scenario is to adapt a ViT to various image and video recognition tasks. The adaptation is challenging because of heavy computation and memory storage. Each model needs an independent and complete finetuning process to adapt to different tasks, which limits its transferability to different visual domains. To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently. It possesses several benefits more appealing than prior arts. Firstly, AdaptFormer introduces lightweight modules that only add less than 2% extra parameters to a ViT, while it is able to increase the ViT's transferability without updating its original pre-trained parameters, significantly outperforming the existing 100\% fully fine-tuned models on action recognition benchmarks. Secondly, it can be plug-and-play in different Transformers and scalable to many visual tasks. Thirdly, extensive experiments on five image and video datasets show that AdaptFormer largely improves ViTs in the target domains. For example, when updating just 1. 5% extra parameters, it achieves about 10% and 19% relative improvement compared to the fully fine-tuned models on Something-Something~v2 and HMDB51, respectively. Code is available at https: //github. com/ShoufaChen/AdaptFormer.

NeurIPS Conference 2022 Conference Paper

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

  • Yuanfeng Ji
  • Haotian Bai
  • Chongjian Ge
  • Jie Yang
  • Ye Zhu
  • Ruimao Zhang
  • Zhen Li
  • Lingyan Zhanng

Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. Information can be found at https: //amos22. grand-challenge. org.

NeurIPS Conference 2022 Conference Paper

DOMINO: Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

  • Yao Mu
  • Yuzheng Zhuang
  • Fei Ni
  • Bin Wang
  • Jianyu Chen
  • Jianye Hao
  • Ping Luo

Adapting to the changes in transition dynamics is essential in robotic applications. By learning a conditional policy with a compact context, context-aware meta-reinforcement learning provides a flexible way to adjust behavior according to dynamics changes. However, in real-world applications, the agent may encounter complex dynamics changes. Multiple confounders can influence the transition dynamics, making it challenging to infer accurate context for decision-making. This paper addresses such a challenge by decomposed mutual information optimization (DOMINO) for context learning, which explicitly learns a disentangled context to maximize the mutual information between the context and historical trajectories while minimizing the state transition prediction error. Our theoretical analysis shows that DOMINO can overcome the underestimation of the mutual information caused by multi-confounded challenges via learning disentangled context and reduce the demand for the number of samples collected in various environments. Extensive experiments show that the context learned by DOMINO benefits both model-based and model-free reinforcement learning algorithms for dynamics generalization in terms of sample efficiency and performance in unseen environments.

IJCAI Conference 2022 Conference Paper

Don’t Touch What Matters: Task-Aware Lipschitz Data Augmentation for Visual Reinforcement Learning

  • Zhecheng Yuan
  • Guozheng Ma
  • Yao Mu
  • Bo Xia
  • Bo Yuan
  • Xueqian Wang
  • Ping Luo
  • Huazhe Xu

One of the key challenges in visual Reinforcement Learning (RL) is to learn policies that can generalize to unseen environments. Recently, data augmentation techniques aiming at enhancing data diversity have demonstrated proven performance in improving the generalization ability of learned policies. However, due to the sensitivity of RL training, naively applying data augmentation, which transforms each pixel in a task-agnostic manner, may suffer from instability and damage the sample efficiency, thus further exacerbating the generalization performance. At the heart of this phenomenon is the diverged action distribution and high-variance value estimation in the face of augmented images. To alleviate this issue, we propose Task-aware Lipschitz Data Augmentation (TLDA) for visual RL, which explicitly identifies the task-correlated pixels with large Lipschitz constants, and only augments the task-irrelevant pixels for stability. We verify the effectiveness of our approach on DeepMind Control suite, CARLA and DeepMind Manipulation tasks. The extensive empirical results show that TLDA improves both sample efficiency and generalization; it outperforms previous state-of-the-art methods across 3 different visual control benchmarks.

NeurIPS Conference 2022 Conference Paper

Large-batch Optimization for Dense Visual Predictions: Training Faster R-CNN in 4.2 Minutes

  • Zeyue Xue
  • Jianming Liang
  • Guanglu Song
  • Zhuofan Zong
  • Liang Chen
  • Yu Liu
  • Ping Luo

Training a large-scale deep neural network in a large-scale dataset is challenging and time-consuming. The recent breakthrough of large-batch optimization is a promising way to tackle this challenge. However, although the current advanced algorithms such as LARS and LAMB succeed in classification models, the complicated pipelines of dense visual predictions such as object detection and segmentation still suffer from the heavy performance drop in the large-batch training regime. To address this challenge, we propose a simple yet effective algorithm, named Adaptive Gradient Variance Modulator (AGVM), which can train dense visual predictors with very large batch size, enabling several benefits more appealing than prior arts. Firstly, AGVM can align the gradient variances between different modules in the dense visual predictors, such as backbone, feature pyramid network (FPN), detection, and segmentation heads. We show that training with a large batch size can fail with the gradient variances misaligned among them, which is a phenomenon primarily overlooked in previous work. Secondly, AGVM is a plug-and-play module that generalizes well to many different architectures (e. g. , CNNs and Transformers) and different tasks (e. g. , object detection, instance segmentation, semantic segmentation, and panoptic segmentation). It is also compatible with different optimizers (e. g. , SGD and AdamW). Thirdly, a theoretical analysis of AGVM is provided. Extensive experiments on the COCO and ADE20K datasets demonstrate the superiority of AGVM. For example, AGVM demonstrates more stable generalization performance than prior arts under extremely large batch size (i. e. , 10k). AGVM can train Faster R-CNN+ResNet50 in 4. 2 minutes without losing performance. It enables training an object detector with one billion parameters in just 3. 5 hours, reducing the training time by 20. 9×, whilst achieving 62. 2 mAP on COCO. The deliverables will be released at https: //github. com/Sense-X/AGVM.

NeurIPS Conference 2022 Conference Paper

MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning

  • Yao Lai
  • Yao Mu
  • Ping Luo

Placement is an essential task in modern chip design, aiming at placing millions of circuit modules on a 2D chip canvas. Unlike the human-centric solution, which requires months of intense effort by hardware engineers to produce a layout to minimize delay and energy consumption, deep reinforcement learning has become an emerging autonomous tool. However, the learning-centric method is still in its early stage, impeded by a massive design space of size ten to the order of a few thousand. This work presents MaskPlace to automatically generate a valid chip layout design within a few hours, whose performance can be superior or comparable to recent advanced approaches. It has several appealing benefits that prior arts do not have. Firstly, MaskPlace recasts placement as a problem of learning pixel-level visual representation to comprehensively describe millions of modules on a chip, enabling placement in a high-resolution canvas and a large action space. It outperforms recent methods that represent a chip as a hypergraph. Secondly, it enables training the policy network by an intuitive reward function with dense reward, rather than a complicated reward function with sparse reward from previous methods. Thirdly, extensive experiments on many public benchmarks show that MaskPlace outperforms existing RL approaches in all key performance metrics, including wirelength, congestion, and density. For example, it achieves 60%-90% wirelength reduction and guarantees zero overlaps. We believe MaskPlace can improve AI-assisted chip layout design. The deliverables are released at https: //laiyao1. github. io/maskplace.

NeurIPS Conference 2022 Conference Paper

Rethinking Resolution in the Context of Efficient Video Recognition

  • Chuofan Ma
  • Qiushan Guo
  • Yi Jiang
  • Ping Luo
  • Zehuan Yuan
  • Xiaojuan Qi

In this paper, we empirically study how to make the most of low-resolution frames for efficient video recognition. Existing methods mainly focus on developing compact networks or alleviating temporal redundancy of video inputs to increase efficiency, whereas compressing frame resolution has rarely been considered a promising solution. A major concern is the poor recognition accuracy on low-resolution frames. We thus start by analyzing the underlying causes of performance degradation on low-resolution frames. Our key finding is that the major cause of degradation is not information loss in the down-sampling process, but rather the mismatch between network architecture and input scale. Motivated by the success of knowledge distillation (KD), we propose to bridge the gap between network and input size via cross-resolution KD (ResKD). Our work shows that ResKD is a simple but effective method to boost recognition accuracy on low-resolution frames. Without bells and whistles, ResKD considerably surpasses all competitive methods in terms of efficiency and accuracy on four large-scale benchmark datasets, i. e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V2. In addition, we extensively demonstrate its effectiveness over state-of-the-art architectures, i. e. , 3D-CNNs and Video Transformers, and scalability towards super low-resolution frames. The results suggest ResKD can serve as a general inference acceleration method for state-of-the-art video recognition. Our code will be available at https: //github. com/CVMI-Lab/ResKD.

AAAI Conference 2022 Conference Paper

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization

  • Zhe Chen
  • Wenhai Wang
  • Enze Xie
  • Tong Lu
  • Ping Luo

We present an extremely simple Ultra-Resolution Style Transfer framework, termed URST, to flexibly process arbitrary high-resolution images (e. g. , 10000×10000 pixels) style transfer for the first time. Most of the existing state-of-the-art methods would fall short due to massive memory cost and small stroke size when processing ultra-high resolution images. URST completely avoids the memory problem caused by ultra-high resolution images by (1) dividing the image into small patches and (2) performing patch-wise style transfer with a novel Thumbnail Instance Normalization (TIN). Specifically, TIN can extract thumbnail features’ normalization statistics and apply them to small patches, ensuring the style consistency among different patches. Overall, the URST framework has three merits compared to prior arts. (1) We divide input image into small patches and adopt TIN, successfully transferring image style with arbitrary high-resolution. (2) Experiments show that our URST surpasses existing SOTA methods on ultra-high resolution images benefiting from the effectiveness of the proposed stroke perceptual loss in enlarging the stroke size. (3) Our URST can be easily plugged into most existing style transfer methods and directly improve their performance even without training. Code is available at https: //git. io/URST.

AAAI Conference 2021 Conference Paper

A Bottom-Up DAG Structure Extraction Model for Math Word Problems

  • Yixuan Cao
  • Feng Hong
  • Hongwei Li
  • Ping Luo

Research on automatically solving mathematical word problems (MWP) has a long history. Most recent works adopt the Seq2Seq approach to predict the result equations as a sequence of quantities and operators. Although result equations can be written as a sequence, it is essentially a structure. More precisely, it is a Direct Acyclic Graph (DAG) whose leaf nodes are the quantities, and internal and root nodes are arithmetic or comparison operators. In this paper, we propose a novel Seq2DAG approach to extract the equation set directly as a DAG structure. It extracts the structure in a bottom-up fashion by aggregating quantities and sub-expressions layer by layer iteratively. The advantages of our approach are threefold: it is intrinsically suitable to solve multivariate problems, it always outputs valid structure, and its computation satisfies commutative law for +, × and =. Experimental results on DRAW1K and Math23K datasets demonstrate that our model outperforms state-of-the-art deep learning methods. We also conduct detailed analysis on the results to show the strengths and limitations of our approach.

AAAI Conference 2021 Conference Paper

A Unified Multi-Scenario Attacking Network for Visual Object Tracking

  • Xuesong Chen
  • Canmiao Fu
  • Feng Zheng
  • Yong Zhao
  • Hongsheng Li
  • Ping Luo
  • Guo-Jun Qi

Existing methods of adversarial attacks successfully generate adversarial examples to confuse Deep Neural Networks (DNNs) of image classification and object detection, resulting in wrong predictions. However, these methods are difficult to attack models of video object tracking, because the tracking algorithms could handle sequential information across video frames and the categories of targets tracked are normally unknown in advance. In this paper, we propose a Unified and Effective Network, named UEN, to attack visual object tracking models. There are several appealing characteristics of UEN: (1) UEN could produce various invisible adversarial perturbations according to different attack settings by using only one simple end-to-end network with three ingenious loss function; (2) UEN could generate general visible adversarial patch patterns to attack the advanced trackers in the real-world; (3) Extensive experiments show that UEN is able to attack many state-of-the-art trackers effectively (e. g. SiamRPN-based networks and DiMP) on popular tracking datasets including OTB100, UAV123, and GOT10K, making online real-time attacks possible. The attack results outperform the introduced baseline in terms of attacking ability and attacking efficiency.

NeurIPS Conference 2021 Conference Paper

An Empirical Investigation of Representation Learning for Imitation

  • Cynthia Chen
  • Xin Chen
  • Sam Toyer
  • Cody Wild
  • Scott Emmons
  • Ian Fischer
  • Kuang-Huei Lee
  • Neel Alex

Imitation learning often needs a large demonstration set in order to handle the full range of situations that an agent might find itself in during deployment. However, collecting expert demonstrations can be expensive. Recent work in vision, reinforcement learning, and NLP has shown that auxiliary representation learning objectives can reduce the need for large amounts of expensive, task-specific data. Our Empirical Investigation of Representation Learning for Imitation (EIRLI) investigates whether similar benefits apply to imitation learning. We propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation across several environment suites. In the settings we evaluate, we find that existing algorithms for image-based representation learning provide limited value relative to a well-tuned baseline with image augmentations. To explain this result, we investigate differences between imitation learning and other settings where representation learning has provided significant benefit, such as image classification. Finally, we release a well-documented codebase which both replicates our findings and provides a modular framework for creating new representation learning algorithms out of reusable components.

NeurIPS Conference 2021 Conference Paper

Compressed Video Contrastive Learning

  • Yuqi Huo
  • Mingyu Ding
  • Haoyu Lu
  • Nanyi Fei
  • Zhiwu Lu
  • Ji-Rong Wen
  • Ping Luo

This work concerns self-supervised video representation learning (SSVRL), one topic that has received much attention recently. Since videos are storage-intensive and contain a rich source of visual content, models designed for SSVRL are expected to be storage- and computation-efficient, as well as effective. However, most existing methods only focus on one of the two objectives, failing to consider both at the same time. In this work, for the first time, the seemingly contradictory goals are simultaneously achieved by exploiting compressed videos and capturing mutual information between two input streams. Specifically, a novel Motion Vector based Cross Guidance Contrastive learning approach (MVCGC) is proposed. For storage and computation efficiency, we choose to directly decode RGB frames and motion vectors (that resemble low-resolution optical flows) from compressed videos on-the-fly. To enhance the representation ability of the motion vectors, hence the effectiveness of our method, we design a cross guidance contrastive learning algorithm based on multi-instance InfoNCE loss, where motion vectors can take supervision signals from RGB frames and vice versa. Comprehensive experiments on two downstream tasks show that our MVCGC yields new state-of-the-art while being significantly more efficient than its competitors.

NeurIPS Conference 2021 Conference Paper

Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

  • Mingyu Ding
  • Zhenfang Chen
  • Tao Du
  • Ping Luo
  • Josh Tenenbaum
  • Chuang Gan

In this work, we propose a unified framework, called Visual Reasoning with Differ-entiable Physics (VRDP), that can jointly learn visual concepts and infer physics models of objects and their interactions from videos and language. This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine. The visual perception module parses each video frame into object-centric trajectories and represents them as latent scene representations. The concept learner grounds visual concepts (e. g. , color, shape, and material) from these object-centric representations based on the language, thus providing prior knowledge for the physics engine. The differentiable physics model, implemented as an impulse-based differentiable rigid-body simulator, performs differentiable physical simulation based on the grounded concepts to infer physical properties, such as mass, restitution, and velocity, by fitting the simulated trajectories into the video observations. Consequently, these learned concepts and physical models can explain what we have seen and imagine what is about to happen in future and counterfactual scenarios. Integrating differentiable physics into the dynamic reasoning framework offers several appealing benefits. More accurate dynamics prediction in learned physics models enables state-of-the-art performance on both synthetic and real-world benchmarks while still maintaining high transparency and interpretability; most notably, VRDP improves the accuracy of predictive and counterfactual questions by 4. 5% and 11. 5% compared to its best counterpart. VRDP is also highly data-efficient: physical parameters can be optimized from very few videos, and even a single video can be sufficient. Finally, with all physical parameters inferred, VRDP can quickly learn new concepts from a few examples.

AAAI Conference 2021 Conference Paper

Extracting Zero-shot Structured Information from Form-like Documents: Pretraining with Keys and Triggers

  • Rongyu Cao
  • Ping Luo

In this paper, we revisit the problem of extracting the values of a given set of key fields from form-like documents. It is the vital step to support many downstream applications, such as knowledge base construction, question answering, document comprehension and so on. Previous studies ignore the semantics of the given keys by considering them only as the class labels, and thus might be incapable to handle zero-shot keys. Meanwhile, although these models often leverage the attention mechanism, the learned features might not reflect the true proxy of explanations on why humans would recognize the value for the key, and thus could not well generalize to new documents. To address these issues, we propose a Key- Aware and Trigger-Aware (KATA) extraction model. With the input key, it explicitly learns two mappings, namely from key representations to trigger representations and then from trigger representations to values. These two mappings might be intrinsic and invariant across different keys and documents. With a large training set automatically constructed based on the Wikipedia data, we pre-train these two mappings. Experiments with the fine-tuning step to two applications show that the proposed model achieves more than 70% accuracy for the extraction of zero-shot keys while previous methods all fail.

NeurIPS Conference 2021 Conference Paper

Model-Based Reinforcement Learning via Imagination with Derived Memory

  • Yao Mu
  • Yuzheng Zhuang
  • Bin Wang
  • Guangxiang Zhu
  • Wulong Liu
  • Jianyu Chen
  • Ping Luo
  • Shengbo Li

Model-based reinforcement learning aims to improve the sample efficiency of policy learning by modeling the dynamics of the environment. Recently, the latent dynamics model is further developed to enable fast planning in a compact space. It summarizes the high-dimensional experiences of an agent, which mimics the memory function of humans. Learning policies via imagination with the latent model shows great potential for solving complex tasks. However, only considering memories from the true experiences in the process of imagination could limit its advantages. Inspired by the memory prosthesis proposed by neuroscientists, we present a novel model-based reinforcement learning framework called Imagining with Derived Memory (IDM). It enables the agent to learn policy from enriched diverse imagination with prediction-reliability weight, thus improving sample efficiency and policy robustness. Experiments on various high-dimensional visual control tasks in the DMControl benchmark demonstrate that IDM outperforms previous state-of-the-art methods in terms of policy robustness and further improves the sample efficiency of the model-based method.

NeurIPS Conference 2021 Conference Paper

Rethinking the Pruning Criteria for Convolutional Neural Network

  • Zhongzhan Huang
  • Wenqi Shao
  • Xinjiang Wang
  • Liang Lin
  • Ping Luo

Channel pruning is a popular technique for compressing convolutional neural networks (CNNs), where various pruning criteria have been proposed to remove the redundant filters. From our comprehensive experiments, we found two blind spots of pruning criteria: (1) Similarity: There are some strong similarities among several primary pruning criteria that are widely cited and compared. According to these criteria, the ranks of filters’ Importance Score are almost identical, resulting in similar pruned structures. (2) Applicability: The filters' Importance Score measured by some pruning criteria are too close to distinguish the network redundancy well. In this paper, we analyze the above blind spots on different types of pruning criteria with layer-wise pruning or global pruning. We also break some stereotypes, such as that the results of $\ell_1$ and $\ell_2$ pruning are not always similar. These analyses are based on the empirical experiments and our assumption (Convolutional Weight Distribution Assumption) that the well-trained convolutional filters in each layer approximately follow a Gaussian-alike distribution. This assumption has been verified through systematic and extensive statistical tests.

NeurIPS Conference 2021 Conference Paper

Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

  • Chongjian Ge
  • Youwei Liang
  • Yibing Song
  • Jianbo Jiao
  • Jue Wang
  • Ping Luo

Studies on self-supervised visual representation learning (SSL) improve encoder backbones to discriminate training samples without labels. While CNN encoders via SSL achieve comparable recognition performance to those via supervised learning, their network attention is under-explored for further improvement. Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL. The proposed CARE framework consists of a CNN stream (C-stream) and a transformer stream (T-stream), where each stream contains two branches. C-stream follows an existing SSL framework with two CNN encoders, two projectors, and a predictor. T-stream contains two transformers, two projectors, and a predictor. T-stream connects to CNN encoders and is in parallel to the remaining C-Stream. During training, we perform SSL in both streams simultaneously and use the T-stream output to supervise C-stream. The features from CNN encoders are modulated in T-stream for visual attention enhancement and become suitable for the SSL scenario. We use these modulated features to supervise C-stream for learning attentive CNN encoders. To this end, we revitalize CNN attention by using transformers as guidance. Experiments on several standard visual recognition benchmarks, including image classification, object detection, and semantic segmentation, show that the proposed CARE framework improves CNN encoder backbones to the state-of-the-art performance.

NeurIPS Conference 2021 Conference Paper

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

  • Enze Xie
  • Wenhai Wang
  • Zhiding Yu
  • Anima Anandkumar
  • Jose M. Alvarez
  • Ping Luo

We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perceptron (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to Segformer-B5, which reaches much better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50. 3% mIoU on ADE20K with 64M parameters, being 5x smaller and 2. 2% better than the previous best method. Our best model, SegFormer-B5, achieves 84. 0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C.

IJCAI Conference 2021 Conference Paper

Segmenting Transparent Objects in the Wild with Transformer

  • Enze Xie
  • Wenjia Wang
  • Wenhai Wang
  • Peize Sun
  • Hang Xu
  • Ding Liang
  • Ping Luo

This work presents a new fine-grained transparent object segmentation dataset, termed Trans10K-v2, extending Trans10K-v1, the first large-scale transparent object segmentation dataset. Unlike Trans10K-v1 that only has two limited categories, our new dataset has several appealing benefits. (1) It has 11 fine-grained categories of transparent objects, commonly occurring in the human domestic environment, making it more practical for real-world application. (2) Trans10K-v2 brings more challenges for the current advanced segmentation methods than its former version. Furthermore, a novel Transformer-based segmentation pipeline termed Trans2Seg is proposed. Firstly, the Transformer encoder of Trans2Seg provides the global receptive field in contrast to CNN's local receptive field, which shows excellent advantages over pure CNN architectures. Secondly, by formulating semantic segmentation as a problem of dictionary look-up, we design a set of learnable prototypes as the query of Trans2Seg's Transformer decoder, where each prototype learns the statistics of one category in the whole dataset. We benchmark more than 20 recent semantic segmentation methods, demonstrating that Trans2Seg significantly outperforms all the CNN-based methods, showing the proposed algorithm's potential ability to solve transparent object segmentation. Code is available in https: //github. com/xieenze/Trans2Seg.

AAAI Conference 2020 Conference Paper

Every Frame Counts: Joint Learning of Video Segmentation and Optical Flow

  • Mingyu Ding
  • Zhe Wang
  • Bolei Zhou
  • Jianping Shi
  • Zhiwu Lu
  • Ping Luo

A major challenge for video semantic segmentation is the lack of labeled data. In most benchmark datasets, only one frame of a video clip is annotated, which makes most supervised methods fail to utilize information from the rest of the frames. To exploit the spatio-temporal information in videos, many previous works use pre-computed optical flows, which encode the temporal consistency to improve the video segmentation. However, the video segmentation and optical flow estimation are still considered as two separate tasks. In this paper, we propose a novel framework for joint video semantic segmentation and optical flow estimation. Semantic segmentation brings semantic information to handle occlusion for more robust optical flow estimation, while the nonoccluded optical flow provides accurate pixel-level temporal correspondences to guarantee the temporal consistency of the segmentation. Moreover, our framework is able to utilize both labeled and unlabeled frames in the video through joint training, while no additional calculation is required in inference. Extensive experiments show that the proposed model makes the video semantic segmentation and optical flow estimation benefit from each other and outperforms existing methods under the same settings in both tasks.

TIST Journal 2020 Journal Article

Exploring Correlation Network for Cheating Detection

  • Ping Luo
  • Kai Shu
  • Junjie Wu
  • Li Wan
  • Yong Tan

The correlation network, typically formed by computing pairwise correlations between variables, has recently become a competitive paradigm to discover insights in various application domains, such as climate prediction, financial marketing, and bioinformatics. In this study, we adopt this paradigm to detect cheating behavior hidden in business distribution channels, where falsified big deals are often made by collusive partners to obtain lower product prices—a behavior deemed to be extremely harmful to the sale ecosystem. To this end, we assume that abnormal deals are likely to occur between two partners if their purchase-volume sequences have a strong negative correlation. This seemingly intuitive rule, however, imposes several research challenges. First, existing correlation measures are usually symmetric and thus cannot distinguish the different roles of partners in cheating. Second, the tick-to-tick correspondence between two sequences might be violated due to the possible delay of purchase behavior, which should also be captured by correlation measures. Finally, the fact that any pair of sequences could be correlated may result in a number of false-positive cheating pairs, which need to be corrected in a systematic manner. To address these issues, we propose a correlation network analysis framework for cheating detection. In the framework, we adopt an asymmetric correlation measure to distinguish the two roles, namely, cheating seller and cheating buyer, in a cheating alliance. Dynamic Time Warping is employed to address the time offset between two sequences in computing the correlation. We further propose two graph-cut methods to convert the correlation network into a bipartite graph to rank cheating partners, which simultaneously helps to remove false-positive correlation pairs. Based on a 4-year real-world channel dataset from a worldwide IT company, we demonstrate the effectiveness of the proposed method in comparison to competitive baseline methods.

AAAI Conference 2019 Conference Paper

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

  • Hang Zhou
  • Yu Liu
  • Ziwei Liu
  • Ping Luo
  • Xiaogang Wang

Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech. This is a challenging task because face appearance variation and semantics of speech are coupled together in the subtle movements of the talking face regions. Existing works either construct specific face appearance model on specific subjects or model the transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We find that the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. This disentangled representation has an advantage where both audio and video can serve as inputs for generation. Extensive experiments show that the proposed approach generates realistic talking face sequences on arbitrary subjects with much clearer lip motion patterns than previous work. We also demonstrate the learned audio-visual representation is extremely useful for the tasks of automatic lip reading and audio-video retrieval.

IJCAI Conference 2018 Conference Paper

Adaboost with Auto-Evaluation for Conversational Models

  • Juncen Li
  • Ping Luo
  • Ganbin Zhou
  • Fen Lin
  • Cheng Niu

We propose a boosting method for conversational models to encourage them to generate more human-like dialogs. In our method, we consider existing conversational models as weak generators and apply Adaboost to update those models. However, conventional Adaboost cannot be directly applied on conversational models. Because for conversational models, conventional Adaboost cannot adaptively adjust the weight on the instance for subsequent learning, result from the simple comparison between the true output y (to an input x) and its corresponding predicted output y' cannot directly evaluate the learning performance on x. To address this issue, we develop the Adaboost with Auto-Evaluation (called AwE). In AwE, an auto-evaluator is proposed to evaluate the predicted results, which makes it applicable to conversational models. Furthermore, we present the theoretical analysis that the training error drops exponentially fast only if certain assumption over the proposed auto-evaluator holds. Finally, we empirically show that AwE visibly boosts the performance of existing single conversational models and also outperforms the other ensemble methods for conversational models.

AAAI Conference 2018 Conference Paper

Conversational Model Adaptation via KL Divergence Regularization

  • Juncen Li
  • Ping Luo
  • Fen Lin
  • Bo Chen

In this study we formulate the problem of conversational model adaptation, where we aim to build a generative conversational model for a target domain based on a limited amount of dialogue data from this target domain and some existing dialogue models from related source domains. This model facilitates the fast building of a chatbot platform, where a new vertical chatbot with only a small number of conversation data can be supported by other related mature chatbots. Previous studies on model adaptation and transfer learning mostly focus on classification and recommendation problems, however, how these models work for conversation generation are still un-explored. To this end, we leverage a KL divergence (KLD) regularization to adapt the existing conversational models. Specifically, it employs the KLD to measure the distance between source and target domain. Adding KLD as a regularization to the objective function allows the proposed method to utilize the information from source domains effectively. We also evaluate the performance of this adaptation model for the online chatbots in Wechat platform of public accounts using both the BLEU metric and human judgement. The experiments empirically show that the proposed method visibly improves these evaluation metrics.

AAAI Conference 2018 Conference Paper

Elastic Responding Machine for Dialog Generation with Dynamically Mechanism Selecting

  • Ganbin Zhou
  • Ping Luo
  • Yijun Xiao
  • Fen Lin
  • Bo Chen
  • Qing He

Neural models aiming at generating meaningful and diverse response is attracting increasing attention over recent years. For a given post, the conventional encoder-decoder models tend to learn high-frequency but trivial responses, or are dif- ficult to determine which speaking styles are suitable to generate responses. To address this issue, we propose the elastic responding machine (ERM), which is based on a proposed encoder-diverter-filter-decoder framework. ERM models the multiple responding mechanisms to not only generate acceptable responses for a given post but also improve the diversity of responses. Here, the mechanisms could be regraded as some latent variables, and for a given post different responses may be generated by different mechanisms. The experiments demonstrate the quality and diversity of the generated responses, intuitively show how the learned model controls response mechanism when responding, and reveal some underlying relationship between mechanism and language style.

NeurIPS Conference 2018 Conference Paper

Kalman Normalization: Normalizing Internal Representations Across Network Layers

  • Guangrun Wang
  • jiefeng peng
  • Ping Luo
  • Xinjiang Wang
  • Liang Lin

As an indispensable component, Batch Normalization (BN) has successfully improved the training of deep neural networks (DNNs) with mini-batches, by normalizing the distribution of the internal representation for each hidden layer. However, the effectiveness of BN would diminish with the scenario of micro-batch (e. g. less than 4 samples in a mini-batch), since the estimated statistics in a mini-batch are not reliable with insufficient samples. This limits BN's room in training larger models on segmentation, detection, and video-related problems, which require small batches constrained by memory consumption. In this paper, we present a novel normalization method, called Kalman Normalization (KN), for improving and accelerating the training of DNNs, particularly under the context of micro-batches. Specifically, unlike the existing solutions treating each hidden layer as an isolated system, KN treats all the layers in a network as a whole system, and estimates the statistics of a certain layer by considering the distributions of all its preceding layers, mimicking the merits of Kalman Filtering. On ResNet50 trained in ImageNet, KN has 3. 4% lower error than its BN counterpart when using a batch size of 4; Even when using typical batch sizes, KN still maintains an advantage over BN while other BN variants suffer a performance degradation. Moreover, KN can be naturally generalized to many existing normalization variants to obtain gains, e. g. equipping Group Normalization with Group Kalman Normalization (GKN). KN can outperform BN and its variants for large scale object detection and segmentation task in COCO 2017.

AAAI Conference 2018 Conference Paper

Mix-and-Match Tuning for Self-Supervised Semantic Segmentation

  • Xiaohang Zhan
  • Ziwei Liu
  • Ping Luo
  • Xiaoou Tang
  • Chen Loy

Deep convolutional networks for semantic image segmentation typically require large-scale labeled data, e. g. , ImageNet and MS COCO, for network pre-training. To reduce annotation efforts, self-supervised semantic segmentation is recently proposed to pre-train a network without any humanprovided labels. The key of this new form of learning is to design a proxy task (e. g. , image colorization), from which a discriminative loss can be formulated on unlabeled data. Many proxy tasks, however, lack the critical supervision signals that could induce discriminative representation for the target image segmentation task. Thus self-supervision’s performance is still far from that of supervised pre-training. In this study, we overcome this limitation by incorporating a ‘mix-and-match’ (M&M) tuning stage in the self-supervision pipeline. The proposed approach is readily pluggable to many self-supervision methods and does not use more annotated samples than the original process. Yet, it is capable of boosting the performance of target image segmentation task to surpass fully-supervised pre-trained counterpart. The improvement is made possible by better harnessing the limited pixelwise annotations in the target dataset. Specifically, we first introduce the ‘mix’ stage, which sparsely samples and mixes patches from the target set to reflect rich and diverse local patch statistics of target images. A ‘match’ stage then forms a class-wise connected graph, which can be used to derive a strong triplet-based discriminative loss for fine-tuning the network. Our paradigm follows the standard practice in existing self-supervised studies and no extra data or label is required. With the proposed M&M approach, for the first time, a self-supervision method can achieve comparable or even better performance compared to its ImageNet pretrained counterpart on both PASCAL VOC2012 dataset and CityScapes dataset.

AAAI Conference 2018 Conference Paper

Spatial as Deep: Spatial CNN for Traffic Scene Understanding

  • Xingang Pan
  • Jianping Shi
  • Ping Luo
  • Xiaogang Wang
  • Xiaoou Tang

Convolutional neural networks (CNNs) are usually built by stacking convolutional operations layer-by-layer. Although CNN has shown strong capability to extract semantics from raw pixels, its capacity to capture spatial relationships of pixels across rows and columns of an image is not fully explored. These relationships are important to learn semantic objects with strong shape priors but weak appearance coherences, such as traffic lanes, which are often occluded or not even painted on the road surface as shown in Fig. 1 (a). In this paper, we propose Spatial CNN (SCNN), which generalizes traditional deep layer-by-layer convolutions to slice-byslice convolutions within feature maps, thus enabling message passings between pixels across rows and columns in a layer. Such SCNN is particular suitable for long continuous shape structure or large objects, with strong spatial relationship but less appearance clues, such as traffic lanes, poles, and wall. We apply SCNN on a newly released very challenging traffic lane detection dataset and Cityscapse dataset1. The results show that SCNN could learn the spatial relationship for structure output and significantly improves the performance. We show that SCNN outperforms the recurrent neural network (RNN) based ReNet and MRF+CNN (MRFNet) in the lane detection dataset by 8. 7% and 4. 6% respectively. Moreover, our SCNN won the 1st place on the TuSimple Benchmark Lane Detection Challenge, with an accuracy of 96. 53%.

AAAI Conference 2018 Conference Paper

Tree-Structured Neural Machine for Linguistics-Aware Sentence Generation

  • Ganbin Zhou
  • Ping Luo
  • Rongyu Cao
  • Yijun Xiao
  • Fen Lin
  • Bo Chen
  • Qing He

Different from other sequential data, sentences in natural language are structured by linguistic grammars. Previous generative conversational models with chain-structured decoder ignore this structure in human language and might generate plausible responses with less satisfactory relevance and fluency. In this study, we aim to incorporate the results from linguistic analysis into the process of sentence generation for high-quality conversation generation. Specifically, we use a dependency parser to transform each response sentence into a dependency tree and construct a training corpus of sentencetree pairs. A tree-structured decoder is developed to learn the mapping from a sentence to its tree, where different types of hidden states are used to depict the local dependencies from an internal tree node to its children. For training acceleration, we propose a tree canonicalization method, which transforms trees into equivalent ternary trees. Then, with a proposed tree-structured search method, the model is able to generate the most probable responses in the form of dependency trees, which are finally flattened into sequences as the system output. Experimental results demonstrate that the proposed X2TREE framework outperforms baseline methods over 11. 15% increase of acceptance ratio.

IJCAI Conference 2017 Conference Paper

EigenNet: Towards Fast and Structural Learning of Deep Neural Networks

  • Ping Luo

Deep Neural Network (DNN) is difficult to train and easy to overfit in training. We address these two issues by introducing EigenNet, an architecture that not only accelerates training but also adjusts number of hidden neurons to reduce over-fitting. They are achieved by whitening the information flows of DNNs and removing those eigenvectors that may capture noises. The former improves conditioning of the Fisher information matrix, whilst the latter increases generalization capability. These appealing properties of EigenNet can benefit many recent DNN structures, such as network in network and inception, by wrapping their hidden layers into the layers of EigenNet. The modeling capacities of the original networks are preserved. Both the training wall-clock time and number of updates are reduced by using EigenNet, compared to stochastic gradient descent on various datasets, including MNIST, CIFAR-10, and CIFAR-100.

AAAI Conference 2017 Conference Paper

Mechanism-Aware Neural Machine for Dialogue Response Generation

  • Ganbin Zhou
  • Ping Luo
  • Rongyu Cao
  • Fen Lin
  • Bo Chen
  • Qing He

To the same utterance, people’s responses in everyday dialogue may be diverse largely in terms of content semantics, speaking styles, communication intentions and so on. Previous generative conversational models ignore these 1-to-n relationships between a post to its diverse responses, and tend to return high-frequency but meaningless responses. In this study we propose a mechanism-aware neural machine for dialogue response generation. It assumes that there exists some latent responding mechanisms, each of which can generate different responses for a single input post. With this assumption we model different responding mechanisms as latent embeddings, and develop a encoder-diverter-decoder framework to train its modules in an end-to-end fashion. With the learned latent mechanisms, for the first time these decomposed modules can be used to encode the input into mechanism-aware context, and decode the responses with the controlled generation styles and topics. Finally, the experiments with human judgements, intuitive examples, detailed discussions demonstrate the quality and diversity of the generated responses with 9. 80% increase of acceptable ratio over the best of six baseline methods.

TIST Journal 2017 Journal Article

Supervised Representation Learning with Double Encoding-Layer Autoencoder for Transfer Learning

  • Fuzhen Zhuang
  • Xiaohu Cheng
  • Ping Luo
  • Sinno Jialin Pan
  • Qing He

Transfer learning has gained a lot of attention and interest in the past decade. One crucial research issue in transfer learning is how to find a good representation for instances of different domains such that the divergence between domains can be reduced with the new representation. Recently, deep learning has been proposed to learn more robust or higher-level features for transfer learning. In this article, we adapt the autoencoder technique to transfer learning and propose a supervised representation learning method based on double encoding-layer autoencoder. The proposed framework consists of two encoding layers: one for embedding and the other one for label encoding. In the embedding layer, the distribution distance of the embedded instances between the source and target domains is minimized in terms of KL-Divergence. In the label encoding layer, label information of the source domain is encoded using a softmax regression model. Moreover, to empirically explore why the proposed framework can work well for transfer learning, we propose a new effective measure based on autoencoder to compute the distribution distance between different domains. Experimental results show that the proposed new measure can better reflect the degree of transfer difficulty and has stronger correlation with the performance from supervised learning algorithms (e.g., Logistic Regression), compared with previous ones, such as KL-Divergence and Maximum Mean Discrepancy. Therefore, in our model, we have incorporated two distribution distance measures to minimize the difference between source and target domains in the embedding representations. Extensive experiments conducted on three real-world image datasets and one text data demonstrate the effectiveness of our proposed method compared with several state-of-the-art baseline methods.

IJCAI Conference 2016 Conference Paper

Browsing Regularities in Hedonic Content Systems

  • Ping Luo
  • Ganbin Zhou
  • Jiaxi Tang
  • Rui Chen
  • Zhongjie Yu
  • Qing He

Various hedonic content systems (e. g. mobile apps for video, music, news, jokes, pictures, social networks etc. ) increasingly dominate people's daily spare life. This paper studies common regularities of browsing behaviors in these systems, based on a large data set of user logs. We found that despite differences in visit time and user types, the distribution over browsing length for a visit can be described by the inverse Gaussian form with a very high precision. It indicates that the choice threshold model of decision making on continuing browsing or leave does exist. Also, We found that the stimulus intensity, in terms of the amount of recent enjoyed items, affects the probability of continuing browsing in a curve of inverted-U shape. We discuss the possible origin of this curve based on a proposed Award-Aversion Contest model. This hypothesis is supported by the empirical study, which shows that the proposed model can successfully recover the original inverse Gaussian distribution for the browsing length. These browsing regularities can be used to develop better organization of hedonic content, which helps to attract more user dwell time in these systems.

AAAI Conference 2016 Conference Paper

Face Model Compression by Distilling Knowledge from Neurons

  • Ping Luo
  • Zhenyao Zhu
  • Ziwei Liu
  • Xiaogang Wang
  • Xiaoou Tang

The recent advanced face recognition systems were built on large Deep Neural Networks (DNNs) or their ensembles, which have millions of parameters. However, the expensive computation of DNNs make their deployment difficult on mobile and embedded devices. This work addresses model compression for face recognition, where the learned knowledge of a large teacher network or its ensemble is utilized as supervision to train a compact student network. Unlike previous works that represent the knowledge by the soften label probabilities, which are difficult to fit, we represent the knowledge by using the neurons at the higher hidden layer, which preserve as much information as the label probabilities, but are more compact. By leveraging the essential characteristics (domain knowledge) of the learned face representation, a neuron selection method is proposed to choose neurons that are most relevant to face recognition. Using the selected neurons as supervision to mimic the single networks of DeepID2+ and DeepID3, which are the state-of-the-art face recognition systems, a compact student with simple network structure achieves better verification accuracy on LFW than its teachers, respectively. When using an ensemble of DeepID2+ as teacher, a mimicked student is able to outperform it and achieves 51. 6× compression ratio and 90× speed-up in inference, making this cumbersome model applicable on portable devices.

AAAI Conference 2015 Conference Paper

Deep Representation Learning with Target Coding

  • Shuo Yang
  • Ping Luo
  • Chen Change Loy
  • Kenneth W. Shum
  • Xiaoou Tang

We consider the problem of learning deep representation when target labels are available. In this paper, we show that there exists intrinsic relationship between target coding and feature representation learning in deep networks. Specifically, we found that distributed binary code with error correcting capability is more capable of encouraging discriminative features, in comparison to the 1-of-K coding that is typically used in supervised deep learning. This new finding reveals additional benefit of using error-correcting code for deep model learning, apart from its well-known error correcting property. Extensive experiments are conducted on popular visual benchmark datasets.

IJCAI Conference 2015 Conference Paper

Matrix Factorization with Scale-Invariant Parameters

  • Guangxiang Zeng
  • Hengshu Zhu
  • Qi Liu
  • Ping Luo
  • Enhong Chen
  • Tong Zhang

Tuning hyper-parameters for large-scale matrix factorization (MF) is very time consuming and sometimes unacceptable. Intuitively, we want to tune hyper-parameters on small sub-matrix sample and then exploit them into the original large-scale matrix. However, most of existing MF methods are scale-variant, which means the optimal hyperparameters usually change with the different scale of matrices. To this end, in this paper we propose a scale-invariant parametric MF method, where a set of scale-invariant parameters is defined for model complexity regularization. Therefore, the proposed method can free us from tuning hyper-parameters on large-scale matrix, and achieve a good performance in a more efficient way. Extensive experiments on real-world dataset clearly validate both the effectiveness and efficiency of our method.

IJCAI Conference 2015 Conference Paper

Supervised Representation Learning: Transfer Learning with Deep Autoencoders

  • Fuzhen Zhuang
  • Xiaohu Cheng
  • Ping Luo
  • Sinno Jialin Pan
  • Qing He

Transfer learning has attracted a lot of attention in the past decade. One crucial research issue in transfer learning is how to find a good representation for instances of different domains such that the divergence between domains can be reduced with the new representation. Recently, deep learning has been proposed to learn more robust or higherlevel features for transfer learning. However, to the best of our knowledge, most of the previous approaches neither minimize the difference between domains explicitly nor encode label information in learning the representation. In this paper, we propose a supervised representation learning method based on deep autoencoders for transfer learning. The proposed deep autoencoder consists of two encoding layers: an embedding layer and a label encoding layer. In the embedding layer, the distance in distributions of the embedded instances between the source and target domains is minimized in terms of KL-Divergence. In the label encoding layer, label information of the source domain is encoded using a softmax regression model. Extensive experiments conducted on three real-world image datasets demonstrate the effectiveness of our proposed method compared with several state-of-theart baseline methods.

NeurIPS Conference 2014 Conference Paper

Multi-View Perceptron: a Deep Model for Learning Face Identity and View Representations

  • Zhenyao Zhu
  • Ping Luo
  • Xiaogang Wang
  • Xiaoou Tang

Various factors, such as identities, views (poses), and illuminations, are coupled in face images. Disentangling the identity and view representations is a major challenge in face recognition. Existing face recognition systems either use handcrafted features or learn features discriminatively to improve recognition accuracy. This is different from the behavior of human brain. Intriguingly, even without accessing 3D data, human not only can recognize face identity, but can also imagine face images of a person under different viewpoints given a single 2D image, making face perception in the brain robust to view changes. In this sense, human brain has learned and encoded 3D face models from 2D images. To take into account this instinct, this paper proposes a novel deep neural net, named multi-view perceptron (MVP), which can untangle the identity and view features, and infer a full spectrum of multi-view images in the meanwhile, given a single 2D face image. The identity features of MVP achieve superior performance on the MultiPIE dataset. MVP is also capable to interpolate and predict images under viewpoints that are unobserved in the training data.

ICRA Conference 2014 Conference Paper

Timed automata based motion planning for a self-assembly robot system

  • Rui Wang 0024
  • Ping Luo
  • Yong Guan
  • Hongxing Wei
  • Xiaojuan Li
  • Jie Zhang 0074
  • Xiaoyu Song

Sambot is a module robot system, with the advantages of self-assembly. A target robotic configuration can be organized by a group of Sambots. A novel motion planning method for Sambot configuration using model checking is presented in this paper. This hierarchical method contains two layers. The abstract logic layer is responsible for the discrete planning of Sambots configuration. The robot and the environment are all modeled as timed automata. System requirements are formalized as Computational Tree Logic (CTL) formulas. Model checking is applied on the system model. The verification result gives the optimal discrete plans for the configuration of Sambot. In physical layer, a sample-based planner generates the trajectory trace considering the dynamics of Sambot and the suggested high level plans. The experiment results illustrate the effectiveness of our approach.

IJCAI Conference 2013 Conference Paper

Concept Learning for Cross-Domain Text Classification: A General Probabilistic Framework

  • Fuzhen Zhuang
  • Ping Luo
  • Peifeng Yin
  • Qing He
  • Zhongzhi Shi

Cross-domain learning targets at leveraging the knowledge from source domains to train accurate models for the test data from target domains with different but related data distributions. To tackle the challenge of data distribution difference in terms of raw features, previous works proposed to mine high-level concepts (e. g. , word clusters) across data domains, which shows to be more appropriate for classification. However, all these works assume that the same set of concepts are shared in the source and target domains in spite that some distinct concepts may exist only in one of the data domains. Thus, we need a general framework, which can incorporate both shared and distinct concepts, for cross-domain classification. To this end, we develop a probabilistic model, by which both the shared and distinct concepts can be learned by the EM process which optimizes the data likelihood. To validate the effectiveness of this model we intentionally construct the classification tasks where the distinct concepts exist in the data domains. The systematic experiments demonstrate the superiority of our model over all compared baselines, especially on those much more challenging tasks.

IJCAI Conference 2011 Conference Paper

Combining Supervised and Unsupervised Models via Unconstrained Probabilistic Embedding

  • Xudong Ma
  • Ping Luo
  • Fuzhen Zhuang
  • Qing He
  • Zhongzhi Shi
  • Zhiyong Shen

Ensemble learning with output from multiple supervised and unsupervised models aims to improvethe classification accuracy of supervised model ensembleby jointly considering the grouping results from unsupervised models. In this paper we cast this ensemble task as an unconstrained probabilistic embedding problem. Specifically, we assume both objects and classes/clusters have latent coordinates without constraints in a D-dimensional Euclidean space, and consider the mapping from the embedded space into the space of results from supervised and unsupervised models as a probabilistic generative process. The prediction of an objectis then determined by the distances between the objectand the classes in the embedded space. A solution of this embedding can be obtained using the quasi-Newton method, resulting in the objects and classes/clusters with high co-occurrence weights being embedded close. We demonstrate the benefits of this unconstrained embedding method by three real applications.