Arrow Research search

Author name cluster

Deli Zhao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

35 papers
2 author rows

Possible papers

35

AAAI Conference 2026 Conference Paper

Towards Affordance-Aware Robotic Dexterous Grasping with Human-like Priors

  • Haoyu Zhao
  • Linghao Zhuang
  • Xingyue Zhao
  • Cheng Zeng
  • Haoran Xu
  • Yuming Jiang
  • Jun CEN
  • Kexiang Wang

A dexterous hand capable of generalizable grasping objects is fundamental for the development of general-purpose embodied AI. However, previous methods focus narrowly on low-level grasp stability metrics, neglecting affordance-aware positioning and human-like poses which are crucial for downstream manipulation. To address these limitations, we propose AffordDex, a novel framework with two-stage training that learns a universal grasping policy with an inherent understanding of both motion priors and object affordances. In the first stage, a trajectory imitator is pre-trained on a large corpus of human hand motions to instill a strong prior for natural movement. In the second stage, a residual module is trained to adapt these general human-like motions to specific object instances. This refinement is critically guided by two components: our Negative Affordance-aware Segmentation (NAA) module, which identifies functionally inappropriate contact regions, and a privileged teacher-student distillation process that ensures the final vision-based policy is highly successful. Extensive experiments demonstrate that AffordDex not only achieves universal dexterous grasping but also remains remarkably human-like in posture and functionally appropriate in contact location. As a result, AffordDex significantly outperforms state-of-the-art baselines across seen objects, unseen instances, and even entirely novel categories.

ICLR Conference 2025 Conference Paper

CirT: Global Subseasonal-to-Seasonal Forecasting with Geometry-inspired Transformer

  • Yang Liu 0165
  • Zinan Zheng
  • Jiashun Cheng
  • Fugee Tsung
  • Deli Zhao
  • Yu Rong 0001
  • Jia Li 0009

Accurate Subseasonal-to-Seasonal (S2S) climate forecasting is pivotal for decision-making including agriculture planning and disaster preparedness but is known to be challenging due to its chaotic nature. Although recent data-driven models have shown promising results, their performance is limited by inadequate consideration of geometric inductive biases. Usually, they treat the spherical weather data as planar images, resulting in an inaccurate representation of locations and spatial relations. In this work, we propose the geometric-inspired Circular Transformer (CirT) to model the cyclic characteristic of the graticule, consisting of two key designs: (1) Decomposing the weather data by latitude into circular patches that serve as input tokens to the Transformer; (2) Leveraging Fourier transform in self-attention to capture the global information and model the spatial periodicity. Extensive experiments on the Earth Reanalysis 5 (ERA5) reanalysis dataset demonstrate our model yields a significant improvement over the advanced data-driven models, including PanguWeather and GraphCast, as well as skillful ECMWF systems. Additionally, we empirically show the effectiveness of our model designs and high-quality prediction over spatial and temporal dimensions.

NeurIPS Conference 2025 Conference Paper

EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

  • Yuqian Yuan
  • Ronghao Dang
  • Long Li
  • Wentong Li
  • Dian Jiao
  • Xin Li
  • Deli Zhao
  • Fan Wang

The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions. capabilities in object-level spatiotemporal reasoning required for real-world interactions. To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios. Specially, EOC-Bench features 3, 277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types. To ensure thorough assessment, we develop a mixed-format human-in-the-loop annotation frameworkBased on EOC-Bench, we conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object cognitive capabilities of MLLMs, establishing a robust foundation for developing reliable core models for embodied systems.

ICML Conference 2025 Conference Paper

Large Language-Geometry Model: When LLM meets Equivariance

  • Zongzhao Li
  • Jiacheng Cen
  • Bing Su 0001
  • Tingyang Xu
  • Yu Rong 0001
  • Deli Zhao
  • Wenbing Huang 0001

Accurately predicting 3D structures and dynamics of physical systems is crucial in scientific applications. Existing approaches that rely on geometric Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, but they often fail in leveraging extensive broader information. While direct application of Large Language Models (LLMs) can incorporate external knowledge, they lack the capability for spatial reasoning with guaranteed equivariance. In this paper, we propose EquiLLM, a novel framework for representing 3D physical systems that seamlessly integrates $\mathrm{E}(3)$-equivariance with LLM capabilities. Specifically, EquiLLM comprises four key components: geometry-aware prompting, an equivariant encoder, an LLM, and an equivariant adapter. Essentially, the LLM guided by the instructive prompt serves as a sophisticated invariant feature processor, while 3D directional information is exclusively handled by the equivariant encoder and adapter modules. Experimental results demonstrate that EquiLLM delivers significant improvements over previous methods across molecular dynamics simulation, human motion simulation, and antibody design, highlighting its promising generalizability.

ICLR Conference 2025 Conference Paper

MolSpectra: Pre-training 3D Molecular Representation with Multi-modal Energy Spectra

  • Liang Wang 0056
  • Shaozhen Liu
  • Yu Rong 0001
  • Deli Zhao
  • Qiang Liu 0006
  • Shu Wu
  • Liang Wang 0001

Establishing the relationship between 3D structures and the energy states of molecular systems has proven to be a promising approach for learning 3D molecular representations. However, existing methods are limited to modeling the molecular energy states from classical mechanics. This limitation results in a significant oversight of quantum mechanical effects, such as quantized (discrete) energy level structures, which offer a more accurate estimation of molecular energy and can be experimentally measured through energy spectra. In this paper, we propose to utilize the energy spectra to enhance the pre-training of 3D molecular representations (MolSpectra), thereby infusing the knowledge of quantum mechanics into the molecular representations. Specifically, we propose SpecFormer, a multi-spectrum encoder for encoding molecular spectra via masked patch reconstruction. By further aligning outputs from the 3D encoder and spectrum encoder using a contrastive objective, we enhance the 3D encoder's understanding of molecules. Evaluations on public benchmarks reveal that our pre-trained representations surpass existing methods in predicting molecular properties and modeling dynamics.

NeurIPS Conference 2025 Conference Paper

Non-stationary Equivariant Graph Neural Networks for Physical Dynamics Simulation

  • Chaohao Yuan
  • Maoji Wen
  • Ercan KURUOGLU
  • Yang Liu
  • Jia Li
  • Tingyang Xu
  • Deli Zhao
  • Hong Cheng

To enhance the generalization ability of graph neural networks (GNNs) in learning and simulation physical dynamics, a series of equivariant GNNs have been developed to incorporate the symmetric inductive bias. However, the existing methods do not take into account the non-stationarity nature of physical dynamics, where the joint distribution changes over time. Moreover, previous approaches for modeling non-stationary time series typically involve normalizing the data, which disrupts the symmetric assumption inherent in physical dynamics. To model the non-stationary physical dynamics while preserving the symmetric inductive bias, we introduce a Non-Stationary Equivariant Graph Neural Network (NS-EGNN) to capture the non-stationarity in physical dynamics while preserving the symmetric property of the model. Specifically, NS-EGNN employs Fourier Transform on segments of physical dynamics to extract time-varying frequency information from the trajectories. It then uses the first and second-order differences to mitigate non-stationarity, followed by pooling for future predictions. Through capturing varying frequency characteristics and alleviate the linear and quadric trend in the raw physical dynamics, NS-EGNN better models the temporal dependencies in the physical dynamics. NS-EGNN has been applied on various types of physical dynamics, including molecular, motion and protein dynamics. In various scenario, NS-EGNN consistently surpasses the performance of existing state-of-the-art algorithms, underscoring its effectiveness. The implementation of NS-EGNN is available at https: //github. com/MaojiWEN/NS-EGNN.

NeurIPS Conference 2025 Conference Paper

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

  • Sicong Leng
  • Yun Xing
  • Zesen Cheng
  • Yang Zhou
  • Hang Zhang
  • Xin Li
  • Deli Zhao
  • Shijian Lu

Recent advancements in large multimodal models (LMMs) have significantly enhanced performance across diverse tasks, with ongoing efforts to further integrate additional modalities such as video and audio. However, most existing LMMs remain vulnerable to hallucinations, the discrepancy between the factual multimodal input and the generated textual output, which has limited their applicability in various real-world scenarios. This paper presents the first systematic investigation of hallucinations in LMMs involving the three most common modalities: language, visual, and audio. Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations. To address these challenges, we introduce the benchmark The Curse of Multi-Modalities (CMM), which comprehensively evaluates hallucinations in LMMs, providing a detailed analysis of their underlying issues. Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning and enhanced hallucination mitigation strategies. Based on our observations and findings, we suggest potential research directions that could enhance the reliability of LMMs.

NeurIPS Conference 2025 Conference Paper

Universally Invariant Learning in Equivariant GNNs

  • Jiacheng Cen
  • Anyi Li
  • Ning Lin
  • Tingyang Xu
  • Yu Rong
  • Deli Zhao
  • Zihe Wang
  • Wenbing Huang

Equivariant Graph Neural Networks (GNNs) have demonstrated significant success across various applications. To achieve completeness---that is, the universal approximation property over the space of equivariant functions---the network must effectively capture the intricate multi-body interactions among different nodes. Prior methods attain this via deeper architectures, augmented body orders, or increased degrees of steerable features, often at high computational cost and without polynomial-time solutions. In this work, we present a theoretically grounded framework for constructing complete equivariant GNNs that is both efficient and practical. We prove that a complete equivariant GNN can be achieved through two key components: 1) a complete scalar function, referred to as the canonical form of the geometric graph; and 2) a full-rank steerable basis set. Leveraging this finding, we propose an efficient algorithm for constructing complete equivariant GNNs based on two common models: EGNN and TFN. Empirical results demonstrate that our model demonstrates superior completeness and excellent performance with only a few layers, thereby significantly reducing computational overhead while maintaining strong practical efficacy.

AAAI Conference 2024 Conference Paper

Latent Space Editing in Transformer-Based Flow Matching

  • Vincent Tao Hu
  • Wei Zhang
  • Meng Tang
  • Pascal Mettes
  • Deli Zhao
  • Cees Snoek

This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call u-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. Our code will be publicly available at https://taohu.me/lfm/

ICLR Conference 2024 Conference Paper

Lipschitz Singularities in Diffusion Models

  • Zhantao Yang
  • Ruili Feng
  • Han Zhang 0010
  • Yujun Shen
  • Kai Zhu 0004
  • Lianghua Huang
  • Yifei Zhang
  • Yu Liu 0063

Diffusion models, which employ stochastic differential equations to sample images through integrals, have emerged as a dominant class of generative models. However, the rationality of the diffusion process itself receives limited attention, leaving the question of whether the problem is well-posed and well-conditioned. In this paper, we uncover a vexing propensity of diffusion models: they frequently exhibit the infinite Lipschitz near the zero point of timesteps. We provide theoretical proofs to illustrate the presence of infinite Lipschitz constants and empirical results to confirm it. The Lipschitz singularities pose a threat to the stability and accuracy during both the training and inference processes of diffusion models. Therefore, the mitigation of Lipschitz singularities holds great potential for enhancing the performance of diffusion models. To address this challenge, we propose a novel approach, dubbed E-TSDM, which alleviates the Lipschitz singularities of the diffusion model near the zero point. Remarkably, our technique yields a substantial improvement in performance. Moreover, as a byproduct of our method, we achieve a dramatic reduction in the Fréchet Inception Distance of acceleration methods relying on network Lipschitz, including DDIM and DPM-Solver, by over 33\%. Extensive experiments on diverse datasets validate our theory and method. Our work may advance the understanding of the general diffusion process, and also provide insights for the design of diffusion models.

ICLR Conference 2024 Conference Paper

Space Group Constrained Crystal Generation

  • Rui Jiao
  • Wenbing Huang 0001
  • Yu Liu
  • Deli Zhao
  • Yang Liu 0005

Crystals are the foundation of numerous scientific and industrial applications. While various learning-based approaches have been proposed for crystal generation, existing methods neglect the spacegroup constraint which is crucial in describing the geometry of crystals and closely relevant to many desirable properties. However, considering spacegroup constraint is challenging owing to its diverse and nontrivial forms. In this paper, we reduce the spacegroup constraint into an equivalent formulation that is more tractable to be handcrafted into the generation process. In particular, we translate the spacegroup constraint into two cases: the basis constraint of the invariant exponential space of the lattice matrix and the Wyckoff position constraint of the fractional coordinates. Upon the derived constraints, we then propose DiffCSP++, a novel diffusion model that has enhanced a previous work DiffCSP by further taking spacegroup constraint into account. Experiments on several popular datasets verify the benefit of the involvement of the spacegroup constraint, and show that our DiffCSP++ achieves the best or comparable performance on crystal structure prediction and ab initio crystal generation.

NeurIPS Conference 2024 Conference Paper

UKnow: A Unified Knowledge Protocol with Multimodal Knowledge Graph Datasets for Reasoning and Vision-Language Pre-Training

  • Biao Gong
  • Shuai Tan
  • Yutong Feng
  • Xiaoying Xie
  • Yuyuan Li
  • Chaochao Chen
  • Kecheng Zheng
  • Yujun Shen

This work presents a unified knowledge protocol, called UKnow, which facilitates knowledge-based studies from the perspective of data. Particularly focusing on visual and linguistic modalities, we categorize data knowledge into five unit types, namely, in-image, in-text, cross-image, cross-text, and image-text, and set up an efficient pipeline to help construct the multimodal knowledge graph from any data collection. Thanks to the logical information naturally contained in knowledge graph, organizing datasets under UKnow format opens up more possibilities of data usage compared to the commonly used image-text pairs. Following UKnow protocol, we collect, from public international news, a large-scale multimodal knowledge graph dataset that consists of 1, 388, 568 nodes (with 571, 791 vision-related ones) and 3, 673, 817 triplets. The dataset is also annotated with rich event tags, including 11 coarse labels and 9, 185 fine labels. Experiments on four benchmarks demonstrate the potential of UKnow in supporting common-sense reasoning and boosting vision-language pre-training with a single dataset, benefiting from its unified form of knowledge organization. Code, dataset, and models will be made publicly available. See Appendix to download the dataset.

ICML Conference 2023 Conference Paper

Composer: Creative and Controllable Image Synthesis with Composable Conditions

  • Lianghua Huang
  • Di Chen
  • Yu Liu 0063
  • Yujun Shen
  • Deli Zhao
  • Jingren Zhou 0001

Recent large-scale generative models learned on big data are capable of synthesizing incredible images yet suffer from limited controllability. This work offers a new generation paradigm that allows flexible control of the output image, such as spatial layout and palette, while maintaining the synthesis quality and model creativity. With compositionality as the core idea, we first decompose an image into representative factors, and then train a diffusion model with all these factors as the conditions to recompose the input. At the inference stage, the rich intermediate representations work as composable elements, leading to a huge design space (i. e. , exponentially proportional to the number of decomposed factors) for customizable content creation. It is noteworthy that our approach, which we call Composer, supports various levels of conditions, such as text description as the global information, depth map and sketch as the local guidance, color histogram for low-level details, etc. Besides improving controllability, we confirm that Composer serves as a general framework and facilitates a wide range of classical generative tasks without retraining. Code and models will be made available.

ICML Conference 2023 Conference Paper

Cones: Concept Neurons in Diffusion Models for Customized Generation

  • Zhiheng Liu
  • Ruili Feng
  • Kai Zhu 0004
  • Yifei Zhang
  • Kecheng Zheng
  • Yu Liu 0063
  • Deli Zhao
  • Jingren Zhou 0001

Human brains respond to semantic features of presented stimuli with different neurons. This raises the question of whether deep neural networks admit a similar behavior pattern. To investigate this phenomenon, this paper identifies a small cluster of neurons associated with a specific subject in a diffusion model. We call those neurons the concept neurons. They can be identified by statistics of network gradients to a stimulation connected with the given subject. The concept neurons demonstrate magnetic properties in interpreting and manipulating generation results. Shutting them can directly yield the related subject contextualized in different scenes. Concatenating multiple clusters of concept neurons can vividly generate all related concepts in a single image. Our method attains impressive performance for multi-subject customization, even four or more subjects. For large-scale applications, the concept neurons are environmentally friendly as we only need to store a sparse cluster of int index instead of dense float32 parameter values, reducing storage consumption by 90% compared with previous customized generation methods. Extensive qualitative and quantitative studies on diverse scenarios show the superiority of our method in interpreting and manipulating diffusion models.

NeurIPS Conference 2023 Conference Paper

Customizable Image Synthesis with Multiple Subjects

  • Zhiheng Liu
  • Yifei Zhang
  • Yujun Shen
  • Kecheng Zheng
  • Kai Zhu
  • Ruili Feng
  • Yu Liu
  • Deli Zhao

Synthesizing images with user-specified subjects has received growing attention due to its practical applications. Despite the recent success in single subject customization, existing algorithms suffer from high training cost and low success rate along with increased number of subjects. Towards controllable image synthesis with multiple subjects as the constraints, this work studies how to efficiently represent a particular subject as well as how to appropriately compose different subjects. We find that the text embedding regarding the subject token already serves as a simple yet effective representation that supports arbitrary combinations without any model tuning. Through learning a residual on top of the base embedding, we manage to robustly shift the raw subject to the customized subject given various text conditions. We then propose to employ layout, a very abstract and easy-to-obtain prior, as the spatial guidance for subject arrangement. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image, significantly alleviating the interference across them. Using cross-attention map as the intermediary, we could strengthen the signal of target subjects and weaken the signal of irrelevant subjects within a certain region, significantly alleviating the interference across subjects. Both qualitative and quantitative experimental results demonstrate our superiority over state-of-the-art alternatives under a variety of settings for multi-subject customization.

NeurIPS Conference 2023 Conference Paper

FaceComposer: A Unified Model for Versatile Facial Content Creation

  • Jiayu Wang
  • Kang Zhao
  • Yifeng Ma
  • Shiwei Zhang
  • Yingya Zhang
  • Yujun Shen
  • Deli Zhao
  • Jingren Zhou

This work presents FaceComposer, a unified generative model that accomplishes a variety of facial content creation tasks, including text-conditioned face synthesis, text-guided face editing, face animation etc. Based on the latent diffusion framework, FaceComposer follows the paradigm of compositional generation and employs diverse face-specific conditions, e. g. , Identity Feature and Projected Normalized Coordinate Code, to release the model creativity at all possible. To support text control and animation, we clean up some existing face image datasets and collect around 500 hours of talking-face videos, forming a high-quality large-scale multi-modal face database. A temporal self-attention module is incorporated into the U-Net structure, which allows learning the denoising process on the mixture of images and videos. Extensive experiments suggest that our approach not only achieves comparable or even better performance than state-of-the-arts on each single task, but also facilitates some combined tasks with one-time forward, demonstrating its potential in serving as a foundation generative model in face domain. We further develop an interface such that users can enjoy our one-step service to create, edit, and animate their own characters. Code, dataset, model, and interface will be made publicly available.

NeurIPS Conference 2023 Conference Paper

MomentDiff: Generative Video Moment Retrieval from Random to Real

  • Pandeng Li
  • Chen-Wei Xie
  • Hongtao Xie
  • Liming Zhao
  • Lei Zhang
  • Yun Zheng
  • Deli Zhao
  • Yongdong Zhang

Video moment retrieval pursues an efficient and generalized solution to identify the specific temporal segments within an untrimmed video that correspond to a given language description. To achieve this goal, we provide a generative diffusion-based framework called MomentDiff, which simulates a typical human retrieval process from random browsing to gradual localization. Specifically, we first diffuse the real span to random noise, and learn to denoise the random noise to the original span with the guidance of similarity between text and video. This allows the model to learn a mapping from arbitrary random locations to real moments, enabling the ability to locate segments from random initialization. Once trained, MomentDiff could sample random temporal segments as initial guesses and iteratively refine them to generate an accurate temporal boundary. Different from discriminative works (e. g. , based on learnable proposals or queries), MomentDiff with random initialized spans could resist the temporal location biases from datasets. To evaluate the influence of the temporal location biases, we propose two ``anti-bias'' datasets with location distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The experimental results demonstrate that our efficient framework consistently outperforms state-of-the-art methods on three public benchmarks, and exhibits better generalization and robustness on the proposed anti-bias datasets. The code, model, and anti-bias evaluation datasets will be released publicly.

NeurIPS Conference 2023 Conference Paper

Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone

  • Zeyinzi Jiang
  • Chaojie Mao
  • Ziyuan Huang
  • Ao Ma
  • Yiliang Lv
  • Yujun Shen
  • Deli Zhao
  • Jingren Zhou

Parameter-efficient tuning has become a trend in transferring large-scale foundation models to downstream applications. Existing methods typically embed some light-weight tuners into the backbone, where both the design and the learning of the tuners are highly dependent on the base model. This work offers a new tuning paradigm, dubbed Res-Tuning, which intentionally unbinds tuners from the backbone. With both theoretical and empirical evidence, we show that popular tuning approaches have their equivalent counterparts under our unbinding formulation, and hence can be integrated into our framework effortlessly. Thanks to the structural disentanglement, we manage to free the design of tuners from the network architecture, facilitating flexible combination of various tuning strategies. We further propose a memory-efficient variant of Res-Tuning, where the bypass i. e. , formed by a sequence of tuners) is effectively detached from the main branch, such that the gradients are back-propagated only to the tuners but not to the backbone. Such a detachment also allows one-time backbone forward for multi-task inference. Extensive experiments on both discriminative and generative tasks demonstrate the superiority of our method over existing alternatives from the perspectives of efficacy and efficiency. Project page: https: //res-tuning. github. io/.

ICML Conference 2023 Conference Paper

RLEG: Vision-Language Representation Learning with Diffusion-based Embedding Generation

  • Liming Zhao
  • Kecheng Zheng
  • Yun Zheng
  • Deli Zhao
  • Jingren Zhou 0001

Vision-language representation learning models (e. g. , CLIP) have achieved state-of-the-art performance on various downstream tasks, which usually need large-scale training data to learn discriminative representation. Recent progress on generative diffusion models (e. g. , DALL-E 2) has demonstrated that diverse high-quality samples can be synthesized by randomly sampling from generative distribution. By virtue of generative capability in this paper, we propose a novel vision-language Representation Learning method with diffusion-based Embedding Generation (RLEG), which exploits diffusion models to generate feature embedding online for learning effective vision-language representation. Specifically, we first adopt image and text encoders to extract the corresponding embeddings. Secondly, pretrained diffusion-based embedding generators are harnessed to transfer the embedding modality online between vision and language domains. The embeddings generated from the generators are then served as augmented embedding-level samples, which are applied to contrastive learning with the variant of the CLIP framework. Experimental results show that the proposed method could learn effective representation and achieve state-of-the-art performance on various tasks including image classification, image-text retrieval, object detection, semantic segmentation, and text-conditional image generation.

ICLR Conference 2023 Conference Paper

The Devil is in the Wrongly-classified Samples: Towards Unified Open-set Recognition

  • Jun Cen
  • Di Luan
  • Shiwei Zhang 0001
  • Yixuan Pei
  • Yingya Zhang
  • Deli Zhao
  • Shaojie Shen
  • Qifeng Chen 0001

Open-set Recognition (OSR) aims to identify test samples whose classes are not seen during the training process. Recently, Unified Open-set Recognition (UOSR) has been proposed to reject not only unknown samples but also known but wrongly classified samples, which tends to be more practical in real-world applications. In this paper, we deeply analyze the UOSR task under different training and evaluation settings to shed light on this promising research direction. For this purpose, we first evaluate the UOSR performance of several OSR methods and show a significant finding that the uncertainty distribution of almost all these methods is actually closer to the expectation of UOSR than OSR. We show that the reason lies in the known but wrongly classified samples, as their uncertainty distribution is extremely close to unknown samples rather than known and correctly classified samples. Second, we analyze how the two training settings of OSR (i.e., pre-training and outlier exposure) influence the UOSR. We find although they are both beneficial for distinguishing known and correctly classified samples from unknown samples, pre-training is also helpful for identifying known but wrongly classified samples while outlier exposure is not. In addition to different training settings, we also formulate a new evaluation setting for UOSR which is called few-shot UOSR, where only one or five samples per unknown class are available during evaluation to help identify unknown samples. We propose FS-KNNS for the few-shot UOSR to achieve state-of-the-art performance under all settings.

NeurIPS Conference 2023 Conference Paper

VideoComposer: Compositional Video Synthesis with Motion Controllability

  • Xiang Wang
  • Hangjie Yuan
  • Shiwei Zhang
  • Dayou Chen
  • Jiuniu Wang
  • Yingya Zhang
  • Yujun Shen
  • Deli Zhao

The pursuit of controllability as a higher standard of visual content creation has yielded remarkable progress in customizable image synthesis. However, achieving controllable video synthesis remains challenging due to the large variation of temporal dynamics and the requirement of cross-frame temporal consistency. Based on the paradigm of compositional generation, this work presents VideoComposer that allows users to flexibly compose a video with textual conditions, spatial conditions, and more importantly temporal conditions. Specifically, considering the characteristic of video data, we introduce the motion vector from compressed videos as an explicit control signal to provide guidance regarding temporal dynamics. In addition, we develop a Spatio-Temporal Condition encoder (STC-encoder) that serves as a unified interface to effectively incorporate the spatial and temporal relations of sequential inputs, with which the model could make better use of temporal conditions and hence achieve higher inter-frame consistency. Extensive experimental results suggest that VideoComposer is able to control the spatial and temporal patterns simultaneously within a synthesized video in various forms, such as text description, sketch sequence, reference video, or even simply hand-crafted motions. The code and models are publicly available athttps: //videocomposer. github. io.

NeurIPS Conference 2022 Conference Paper

Improving 3D-aware Image Synthesis with A Geometry-aware Discriminator

  • Zifan Shi
  • Yinghao Xu
  • Yujun Shen
  • Deli Zhao
  • Qifeng Chen
  • Dit-Yan Yeung

3D-aware image synthesis aims at learning a generative model that can render photo-realistic 2D images while capturing decent underlying 3D shapes. A popular solution is to adopt the generative adversarial network (GAN) and replace the generator with a 3D renderer, where volume rendering with neural radiance field (NeRF) is commonly used. Despite the advancement of synthesis quality, existing methods fail to obtain moderate 3D shapes. We argue that, considering the two-player game in the formulation of GANs, only making the generator 3D-aware is not enough. In other words, displacing the generative mechanism only offers the capability, but not the guarantee, of producing 3D-aware images, because the supervision of the generator primarily comes from the discriminator. To address this issue, we propose GeoD through learning a geometry-aware discriminator to improve 3D-aware GANs. Concretely, besides differentiating real and fake samples from the 2D image space, the discriminator is additionally asked to derive the geometry information from the inputs, which is then applied as the guidance of the generator. Such a simple yet effective design facilitates learning substantially more accurate 3D shapes. Extensive experiments on various generator architectures and training datasets verify the superiority of GeoD over state-of-the-art alternatives. Moreover, our approach is registered as a general framework such that a more capable discriminator (i. e. , with a third task of novel view synthesis beyond domain classification and geometry extraction) can further assist the generator with a better multi-view consistency. Project page can be found at https: //vivianszf. github. io/geod.

NeurIPS Conference 2022 Conference Paper

Improving GANs with A Dynamic Discriminator

  • Ceyuan Yang
  • Yujun Shen
  • Yinghao Xu
  • Deli Zhao
  • Bo Dai
  • Bolei Zhou

Discriminator plays a vital role in training generative adversarial networks (GANs) via distinguishing real and synthesized samples. While the real data distribution remains the same, the synthesis distribution keeps varying because of the evolving generator, and thus effects a corresponding change of the bi-classification task assigned to the discriminator. We argue that a discriminator with an on-the-fly adjustment on its capacity can better accommodate such a time-varying task. A comprehensive empirical study confirms that the proposed training strategy, termed as DynamicD, improves the synthesis performance without incurring any additional computation cost or training objectives. Two capacity adjusting schemes are developed for training GANs under different data regimes: i) given a sufficient amount of training data, the discriminator benefits from a progressively increased learning capacity, and ii) when the training data is limited, gradually decreasing the layer width mitigates the over-fitting issue of the discriminator. Experiments on both 2D and 3D-aware image synthesis tasks conducted on a range of datasets substantiate the generalizability of our DynamicD as well as its substantial improvement over the baselines. Furthermore, DynamicD is synergistic to other discriminator-improving approaches (including data augmentation, regularizers, and pre-training), and brings continuous performance gain when combined with them for learning GANs. Code will be made publicly available.

ICML Conference 2022 Conference Paper

Principled Knowledge Extrapolation with GANs

  • Ruili Feng
  • Jie Xiao 0002
  • Kecheng Zheng
  • Deli Zhao
  • Jingren Zhou 0001
  • Qibin Sun
  • Zheng-Jun Zha

Human can extrapolate well, generalize daily knowledge into unseen scenarios, raise and answer counterfactual questions. To imitate this ability via generative models, previous works have extensively studied explicitly encoding Structural Causal Models (SCMs) into architectures of generator networks. This methodology, however, limits the flexibility of the generator as they must be carefully crafted to follow the causal graph, and demands a ground truth SCM with strong ignorability assumption as prior, which is a nontrivial assumption in many real scenarios. Thus, many current causal GAN methods fail to generate high fidelity counterfactual results as they cannot easily leverage state-of-the-art generative models. In this paper, we propose to study counterfactual synthesis from a new perspective of knowledge extrapolation, where a given knowledge dimension of the data distribution is extrapolated, but the remaining knowledge is kept indistinguishable from the original distribution. We show that an adversarial game with a closed-form discriminator can be used to address the knowledge extrapolation problem, and a novel principal knowledge descent method can efficiently estimate the extrapolated distribution through the adversarial game. Our method enjoys both elegant theoretical guarantees and superior performance in many scenarios.

NeurIPS Conference 2022 Conference Paper

Rank Diminishing in Deep Neural Networks

  • Ruili Feng
  • Kecheng Zheng
  • Yukun Huang
  • Deli Zhao
  • Michael Jordan
  • Zheng-Jun Zha

The rank of neural networks measures information flowing across layers. It is an instance of a key structural condition that applies across broad domains of machine learning. In particular, the assumption of low-rank feature representations led to algorithmic developments in many architectures. For neural networks, however, the intrinsic mechanism that yields low-rank structures remains vague and unclear. To fill this gap, we perform a rigorous study on the behavior of network rank, focusing particularly on the notion of rank deficiency. We theoretically establish a universal monotone decreasing property of network ranks from the basic rules of differential and algebraic composition, and uncover rank deficiency of network blocks and deep function coupling. By virtue of our numerical tools, we provide the first empirical analysis of the per-layer behavior of network ranks in realistic settings, \ieno, ResNets, deep MLPs, and Transformers on ImageNet. These empirical results are in direct accord with our theory. Furthermore, we reveal a novel phenomenon of independence deficit caused by the rank deficiency of deep networks, where classification confidence of a given category can be linearly decided by the confidence of a handful of other categories. The theoretical results of this work, together with the empirical findings, may advance understanding of the inherent principles of deep neural networks. Code to detect the rank behavior of networks can be found in https: //github. com/RuiLiFeng/Rank-Diminishing-in-Deep-Neural-Networks.

ICML Conference 2022 Conference Paper

Region-Based Semantic Factorization in GANs

  • Jiapeng Zhu 0001
  • Yujun Shen
  • Yinghao Xu 0001
  • Deli Zhao
  • Qifeng Chen 0001

Despite the rapid advancement of semantic discovery in the latent space of Generative Adversarial Networks (GANs), existing approaches either are limited to finding global attributes or rely on a number of segmentation masks to identify local attributes. In this work, we present a highly efficient algorithm to factorize the latent semantics learned by GANs concerning an arbitrary image region. Concretely, we revisit the task of local manipulation with pre-trained GANs and formulate region-based semantic discovery as a dual optimization problem. Through an appropriately defined generalized Rayleigh quotient, we manage to solve such a problem without any annotations or training. Experimental results on various state-of-the-art GAN models demonstrate the effectiveness of our approach, as well as its superiority over prior arts regarding precise control, region robustness, speed of implementation, and simplicity of use.

NeurIPS Conference 2021 Conference Paper

Low-Rank Subspaces in GANs

  • Jiapeng Zhu
  • Ruili Feng
  • Yujun Shen
  • Deli Zhao
  • Zheng-Jun Zha
  • Jingren Zhou
  • Qifeng Chen

The latent space of a Generative Adversarial Network (GAN) has been shown to encode rich semantics within some subspaces. To identify these subspaces, researchers typically analyze the statistical information from a collection of synthesized data, and the identified subspaces tend to control image attributes globally (i. e. , manipulating an attribute causes the change of an entire image). By contrast, this work introduces low-rank subspaces that enable more precise control of GAN generation. Concretely, given an arbitrary image and a region of interest (e. g. , eyes of face images), we manage to relate the latent space to the image region with the Jacobian matrix and then use low-rank factorization to discover steerable latent subspaces. There are three distinguishable strengths of our approach that can be aptly called LowRankGAN. First, compared to analytic algorithms in prior work, our low-rank factorization of Jacobians is able to find the low-dimensional representation of attribute manifold, making image editing more precise and controllable. Second, low-rank factorization naturally yields a null space of attributes such that moving the latent code within it only affects the outer region of interest. Therefore, local image editing can be simply achieved by projecting an attribute vector into the null space without relying on a spatial mask as existing methods do. Third, our method can robustly work with a local region from one image for analysis yet well generalize to other images, making it much easy to use in practice. Extensive experiments on state-of-the-art GAN models (including StyleGAN2 and BigGAN) trained on various datasets demonstrate the effectiveness of our LowRankGAN.

ICML Conference 2021 Conference Paper

Uncertainty Principles of Encoding GANs

  • Ruili Feng
  • Zhouchen Lin
  • Jiapeng Zhu 0001
  • Deli Zhao
  • Jingren Zhou 0001
  • Zheng-Jun Zha

The compelling synthesis results of Generative Adversarial Networks (GANs) demonstrate rich semantic knowledge in their latent codes. To obtain this knowledge for downstream applications, encoding GANs has been proposed to learn encoders, such that real world data can be encoded to latent codes, which can be fed to generators to reconstruct those data. However, despite the theoretical guarantees of precise reconstruction in previous works, current algorithms generally reconstruct inputs with non-negligible deviations from inputs. In this paper we study this predicament of encoding GANs, which is indispensable research for the GAN community. We prove three uncertainty principles of encoding GANs in practice: a) the ‘perfect’ encoder and generator cannot be continuous at the same time, which implies that current framework of encoding GANs is ill-posed and needs rethinking; b) neural networks cannot approximate the underlying encoder and generator precisely at the same time, which explains why we cannot get ‘perfect’ encoders and generators as promised in previous theories; c) neural networks cannot be stable and accurate at the same time, which demonstrates the difficulty of training and trade-off between fidelity and disentanglement encountered in previous works. Our work may eliminate gaps between previous theories and empirical results, promote the understanding of GANs, and guide network designs for follow-up works.

ICML Conference 2021 Conference Paper

Understanding Noise Injection in GANs

  • Ruili Feng
  • Deli Zhao
  • Zheng-Jun Zha

Noise injection is an effective way of circumventing overfitting and enhancing generalization in machine learning, the rationale of which has been validated in deep learning as well. Recently, noise injection exhibits surprising effectiveness when generating high-fidelity images in Generative Adversarial Networks (GANs) (e. g. StyleGAN). Despite its successful applications in GANs, the mechanism of its validity is still unclear. In this paper, we propose a geometric framework to theoretically analyze the role of noise injection in GANs. First, we point out the existence of the adversarial dimension trap inherent in GANs, which leads to the difficulty of learning a proper generator. Second, we successfully model the noise injection framework with exponential maps based on Riemannian geometry. Guided by our theories, we propose a general geometric realization for noise injection. Under our novel framework, the simple noise injection used in StyleGAN reduces to the Euclidean case. The goal of our work is to make theoretical steps towards understanding the underlying mechanism of state-of-the-art GAN algorithms. Experiments on image generation and GAN inversion validate our theory in practice.

NeurIPS Conference 2018 Conference Paper

DeepExposure: Learning to Expose Photos with Asynchronously Reinforced Adversarial Learning

  • Runsheng Yu
  • Wenyu Liu
  • Yasen Zhang
  • Zhi Qu
  • Deli Zhao
  • Bo Zhang

The accurate exposure is the key of capturing high-quality photos in computational photography, especially for mobile phones that are limited by sizes of camera modules. Inspired by luminosity masks usually applied by professional photographers, in this paper, we develop a novel algorithm for learning local exposures with deep reinforcement adversarial learning. To be specific, we segment an image into sub-images that can reflect variations of dynamic range exposures according to raw low-level features. Based on these sub-images, a local exposure for each sub-image is automatically learned by virtue of policy network sequentially while the reward of learning is globally designed for striking a balance of overall exposures. The aesthetic evaluation function is approximated by discriminator in generative adversarial networks. The reinforcement learning and the adversarial learning are trained collaboratively by asynchronous deterministic policy gradient and generative loss approximation. To further simply the algorithmic architecture, we also prove the feasibility of leveraging the discriminator as the value function. Further more, we employ each local exposure to retouch the raw input image respectively, thus delivering multiple retouched images under different exposures which are fused with exposure blending. The extensive experiments verify that our algorithms are superior to state-of-the-art methods in terms of quantitative accuracy and visual illustration.

IJCAI Conference 2016 Conference Paper

Learning Stable Linear Dynamical Systems with the Weighted Least Square Method

  • Wenbing Huang
  • Lele Cao
  • Fuchun Sun
  • Deli Zhao
  • Huaping Liu
  • Shanshan Yu

Standard subspace algorithms learn Linear Dynamical Systems (LDSs) from time series with the least-square method, where the stability of the system is not naturally guaranteed. In this paper, we propose a novel approach for learning stable systems by enforcing stability directly on the least-square solutions. To this end, we first explore the spectral-radius property of the least-square transition matrix and then determine the key component that incurs the instability of the transition matrix. By multiplying the unstable component with a weight matrix on the right side, we obtain a weighted-least-square transition matrix that is further optimized to minimize the reconstruction error of the state sequence while still maintaining the stable constraint. Comparative experimental evaluations demonstrate that our proposed methods outperform the state-of-the-art methods regarding the reconstruction accuracy and the learning efficiency.

IJCAI Conference 2015 Conference Paper

Network Representation Learning with Rich Text Information

  • Cheng Yang
  • Zhiyuan Liu
  • Deli Zhao
  • Maosong Sun
  • Edward Chang

Representation learning has shown its effectiveness in many tasks such as image classification and text mining. Network representation learning aims at learning distributed vector representation for each vertex in a network, which is also increasingly recognized as an important aspect for network analysis. Most network representation learning methods investigate network structures for learning. In reality, network vertices contain rich information (such as text), which cannot be well applied with algorithmic frameworks of typical representation learning methods. By proving that DeepWalk, a state-ofthe-art network representation method, is actually equivalent to matrix factorization (MF), we propose text-associated DeepWalk (TADW). TADW incorporates text features of vertices into network representation learning under the framework of matrix factorization. We evaluate our method and various baseline methods by applying them to the task of multi-class classification of vertices. The experimental results show that, our method outperforms other baselines on all three datasets, especially when networks are noisy and training ratio is small. The source code of this paper can be obtained from https: //github. com/albertyang33/TADW.

IJCAI Conference 2015 Conference Paper

Scalable Gaussian Process Regression Using Deep Neural Networks

  • Wenbing Huang
  • Deli Zhao
  • Fuchun Sun
  • Huaping Liu
  • Edward Chang

We propose a scalable Gaussian process model for regression by applying a deep neural network as the feature-mapping function. We first pre-train the deep neural network with a stacked denoising auto-encoder in an unsupervised way. Then, we perform a Bayesian linear regression on the top layer of the pre-trained deep network. The resulting model, Deep-Neural-Network-based Gaussian Process (DNN-GP), can learn much more meaningful representation of the data by the finite-dimensional but deep-layered feature-mapping function. Unlike standard Gaussian processes, our model scales well with the size of the training set due to the avoidance of kernel matrix inversion. Moreover, we present a mixture of DNN-GPs to further improve the regression performance. For the experiments on three representative large datasets, our proposed models significantly outperform the state-of-the-art algorithms of Gaussian process regression.

NeurIPS Conference 2014 Conference Paper

Zeta Hull Pursuits: Learning Nonconvex Data Hulls

  • Yuanjun Xiong
  • Wei Liu
  • Deli Zhao
  • Xiaoou Tang

Selecting a small informative subset from a given dataset, also called column sampling, has drawn much attention in machine learning. For incorporating structured data information into column sampling, research efforts were devoted to the cases where data points are fitted with clusters, simplices, or general convex hulls. This paper aims to study nonconvex hull learning which has rarely been investigated in the literature. In order to learn data-adaptive nonconvex hulls, a novel approach is proposed based on a graph-theoretic measure that leverages graph cycles to characterize the structural complexities of input data points. Employing this measure, we present a greedy algorithmic framework, dubbed Zeta Hulls, to perform structured column sampling. The process of pursuing a Zeta hull involves the computation of matrix inverse. To accelerate the matrix inversion computation and reduce its space complexity as well, we exploit a low-rank approximation to the graph adjacency matrix by using an efficient anchor graph technique. Extensive experimental results show that data representation learned by Zeta Hulls can achieve state-of-the-art accuracy in text and image classification tasks.

NeurIPS Conference 2008 Conference Paper

Cyclizing Clusters via Zeta Function of a Graph

  • Deli Zhao
  • Xiaoou Tang

Detecting underlying clusters from large-scale data plays a central role in machine learning research. In this paper, we attempt to tackle clustering problems for complex data of multiple distributions and large multi-scales. To this end, we develop an algorithm named Zeta $l$-links, or Zell which consists of two parts: Zeta merging with a similarity graph and an initial set of small clusters derived from local $l$-links of the graph. More specifically, we propose to structurize a cluster using cycles in the associated subgraph. A mathematical tool, Zeta function of a graph, is introduced for the integration of all cycles, leading to a structural descriptor of the cluster in determinantal form. The popularity character of the cluster is conceptualized as the global fusion of variations of the structural descriptor by means of the leave-one-out strategy in the cluster. Zeta merging proceeds, in the agglomerative fashion, according to the maximum incremental popularity among all pairwise clusters. Experiments on toy data, real imagery data, and real sensory data show the promising performance of Zell. The $98. 1\%$ accuracy, in the sense of the normalized mutual information, is obtained on the FRGC face data of 16028 samples and 466 facial clusters. The MATLAB codes of Zell will be made publicly available for peer evaluation.