Arrow Research search

Author name cluster

Nanning Zheng

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

45 papers
1 author row

Possible papers

45

AAAI Conference 2026 Conference Paper

EVOKE: Efficient and High-Fidelity EEG-to-Video Reconstruction via Decoupling Implicit Neural Representation

  • Haodong Jing
  • Panqi Yang
  • Dongyao Jiang
  • Zhipeng Liu
  • Nanning Zheng
  • Yongqiang Ma

Visual neural decoding is an important research topic at the intersection of cognitive neuroscience and machine learning. While recent progress has been made in EEG-based neural decoding, reconstructing dynamic visual content remains challenging. In the field of EEG decoding, current models either utilize pre-trained encoders for feature extraction or employ graph neural networks to represent the spatio-temporal information embedding, resulting in poor model representation and high complexity. We propose EVOKE -- an innovative framework for zero-shot decoding of high-fidelity videos from EEG signals. EVOKE employs Implicit Neural Representations to perform complete spatial modeling of EEG and continuously decouples information in the EEG-INR perceptual space. Additionally, we construct a Hierarchical-aware Attention Module (HAM) to decode EEG from three feature anchors: visual, semantic, motion, and progressively control task inference. The Motion Attention Flow (MAF) we developed overcomes the limitations of capturing motion features in dynamic stimuli, creating a more robust representation that enhances reconstruction consistency. Comprehensive experiments prove that SOTA performance of EVOKE (0.353 SSIM, 0.715 CLIP-pcc). We provide an effective method for converting brain activity into rich visual experiences and set a new benchmark for brain multimodal generation.

AAAI Conference 2026 Conference Paper

UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space

  • Panqi Yang
  • Haodong Jing
  • Nanning Zheng
  • Yongqiang Ma

In the field of human-object interaction (HOI), detection and generation are two dual tasks that have traditionally been addressed separately, hindering the development of comprehensive interaction understanding. To address this, we propose UniHOI, which jointly models HOI detection and generation via a unified token space, thereby effectively promoting knowledge sharing and enhancing generalization. Specifically, we introduce a symmetric interaction-aware attention module and a unified semi-supervised learning paradigm, enabling effective bidirectional mapping between images and interaction semantics even under limited annotations. Extensive experiments demonstrate that UniHOI achieves state-of-the-art performance in both HOI detection and generation. Specifically, UniHOI improves accuracy by 4.9% on long-tailed HOI detection and boosts interaction metrics by 42.0% on open-vocabulary generation tasks.

NeurIPS Conference 2025 Conference Paper

Riemannian Consistency Model

  • Chaoran Cheng
  • Yusong Wang
  • Yuxin Chen
  • Xiangxin Zhou
  • Nanning Zheng
  • Ge Liu

Consistency models are a class of generative models that enable few-step generation for diffusion and flow matching models. While consistency models have achieved promising results on Euclidean domains like images, their applications to Riemannian manifolds remain challenging due to the curved geometry. In this work, we propose the Riemannian Consistency Model (RCM), which, for the first time, enables few-step consistency modeling while respecting the intrinsic manifold constraint imposed by the Riemannian geometry. Leveraging the covariant derivative and exponential-map-based parameterization, we derive the closed-form solutions for both discrete- and continuous-time training objectives for RCM. We then demonstrate theoretical equivalence between the two variants of RCM: Riemannian consistency distillation (RCD) that relies on a teacher model to approximate the marginal vector field, and Riemannian consistency training (RCT) that utilizes the conditional vector field for training. We further propose a simplified training objective that eliminates the need for the complicated differential calculation. Finally, we provide a unique kinematics perspective for interpreting the RCM objective, offering new theoretical angles. Through extensive experiments, we manifest the superior generative quality of RCM in few-step generation on various non-Euclidean manifolds, including flat-tori, spheres, and the 3D rotation group SO(3), spanning a variety of crucial real-world applications such as RNA and protein generation.

AAAI Conference 2025 Conference Paper

See Through Their Minds: Learning Transferable Brain Decoding Models from Cross-Subject fMRI

  • Yulong Liu
  • Yongqiang Ma
  • Guibo Zhu
  • Haodong Jing
  • Nanning Zheng

Deciphering visual content from fMRI sheds light on the human vision system, but data scarcity and noise limit brain decoding model performance. Traditional approaches rely on subject-specific models, which are sensitive to training sample size. In this paper, we address data scarcity by proposing shallow subject-specific adapters to map cross-subject fMRI data into unified representations. A shared deep decoding model then decodes these features into the target feature space. We use both visual and textual supervision for multi-modal brain decoding and integrate high-level perception decoding with pixel-wise reconstruction guided by high-level perceptions. Our extensive experiments reveal several interesting insights: 1) Training with cross-subject fMRI benefits both high-level and low-level decoding models; 2) Merging high-level and low-level information improves reconstruction performance at both levels; 3) Transfer learning is effective for new subjects with limited training data by training new adapters; 4) Decoders trained on visually-elicited brain activity can generalize to decode imagery-induced activity, though with reduced performance.

AAAI Conference 2025 Conference Paper

Unveiling Multi-View Anomaly Detection: Intra-view Decoupling and Inter-view Fusion

  • Kai Mao
  • Yiyang Lian
  • Yangyang Wang
  • Meiqin Liu
  • Nanning Zheng
  • Ping Wei

Anomaly detection has garnered significant attention for its extensive industrial application value. Most existing methods focus on single-view scenarios and fail to detect anomalies hidden in blind spots, leaving a gap in addressing the demands of multi-view detection in practical applications. Ensemble of multiple single-view models is a typical way to tackle the multi-view situation, but it overlooks the correlations between different views. In this paper, we propose a novel multi-view anomaly detection framework, Intra-view Decoupling and Inter-view Fusion (IDIF), to explore correlations among views. Our method contains three key components: 1) a proposed Consistency Bottleneck module extracting the common features of different views through information compression and mutual information maximization; 2) an Implicit Voxel Construction module fusing features of different views with prior knowledge represented in the form of voxels; and 3) a View-wise Dropout training strategy enabling the model to learn how to cope with missing views during test. The proposed IDIF achieves state-of-the-art performance on three datasets. Extensive ablation studies also demonstrate the superiority of our methods.

NeurIPS Conference 2025 Conference Paper

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

  • Yichao Shen
  • Fangyun Wei
  • Zhiying Du
  • Yaobo Liang
  • Yan Lu
  • Jiaolong Yang
  • Nanning Zheng
  • Baining Guo

Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy—forecasting both actions and their visual consequences—explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.

NeurIPS Conference 2024 Conference Paper

Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement

  • Tao Yang
  • Cuiling Lan
  • Yan Lu
  • Nanning Zheng

Disentangled representation learning strives to extract the intrinsic factors within the observed data. Factoring these representations in an unsupervised manner is notably challenging and usually requires tailored loss functions or specific structural designs. In this paper, we introduce a new perspective and framework, demonstrating that diffusion models with cross-attention itself can serve as a powerful inductive bias to facilitate the learning of disentangled representations. We propose to encode an image into a set of concept tokens and treat them as the condition of the latent diffusion model for image reconstruction, where cross attention over the concept tokens is used to bridge the encoder and the U-Net of the diffusion model. We analyze that the diffusion process inherently possesses the time-varying information bottlenecks. Such information bottlenecks and cross attention act as strong inductive biases for promoting disentanglement. Without any regularization term in the loss function, this framework achieves superior disentanglement performance on the benchmark datasets, surpassing all previous methods with intricate designs. We have conducted comprehensive ablation studies and visualization analyses, shedding a light on the functioning of this model. We anticipate that our findings will inspire more investigation on exploring diffusion model for disentangled representation learning towards more sophisticated data analysis and understanding.

AAAI Conference 2024 Conference Paper

GSO-Net: Grid Surface Optimization via Learning Geometric Constraints

  • Chaoyun Wang
  • Jingmin Xin
  • Nanning Zheng
  • Caigui Jiang

In the context of surface representations, we find a natural structural similarity between grid surface and image data. Motivated by this inspiration, we propose a novel approach: encoding grid surfaces as geometric images and using image processing methods to address surface optimization-related problems. As a result, we have created the first dataset for grid surface optimization and devised a learning-based grid surface optimization network specifically tailored to geometric images, addressing the surface optimization problem through a data-driven learning of geometric constraints paradigm. We conduct extensive experiments on developable surface optimization, surface flattening, and surface denoising tasks using the designed network and datasets. The results demonstrate that our proposed method not only addresses the surface optimization problem better than traditional numerical optimization methods, especially for complex surfaces, but also boosts the optimization speed by multiple orders of magnitude. This pioneering study successfully applies deep learning methods to the field of surface optimization and provides a new solution paradigm for similar tasks, which will provide inspiration and guidance for future developments in the field of discrete surface optimization. The code and dataset are available at https://github.com/chaoyunwang/GSO-Net.

AAAI Conference 2024 Conference Paper

IS-DARTS: Stabilizing DARTS through Precise Measurement on Candidate Importance

  • Hongyi He
  • Longjun Liu
  • Haonan Zhang
  • Nanning Zheng

Among existing Neural Architecture Search methods, DARTS is known for its efficiency and simplicity. This approach applies continuous relaxation of network representation to construct a weight-sharing supernet and enables the identification of excellent subnets in just a few GPU days. However, performance collapse in DARTS results in deteriorating architectures filled with parameter-free operations and remains a great challenge to the robustness. To resolve this problem, we reveal that the fundamental reason is the biased estimation of the candidate importance in the search space through theoretical and experimental analysis, and more precisely select operations via information-based measurements. Furthermore, we demonstrate that the excessive concern over the supernet and inefficient utilization of data in bi-level optimization also account for suboptimal results. We adopt a more realistic objective focusing on the performance of subnets and simplify it with the help of the informationbased measurements. Finally, we explain theoretically why progressively shrinking the width of the supernet is necessary and reduce the approximation error of optimal weights in DARTS. Our proposed method, named IS-DARTS, comprehensively improves DARTS and resolves the aforementioned problems. Extensive experiments on NAS-Bench-201 and DARTS-based search space demonstrate the effectiveness of IS-DARTS.

NeurIPS Conference 2024 Conference Paper

Make Your LLM Fully Utilize the Context

  • Shengnan An
  • Zexiong Ma
  • Zeqi Lin
  • Nanning Zheng
  • Jian-Guang Lou
  • Weizhu Chen

While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FIll-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e. g. , 23. 5->26. 9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e. g. , 59. 3->59. 2 accuracy on MMLU).

NeurIPS Conference 2024 Conference Paper

Molecule Design by Latent Prompt Transformer

  • Deqian Kong
  • Yuhao Huang
  • Jianwen Xie
  • Edouardo Honig
  • Ming Xu
  • Shuanghong Xue
  • Pei Lin
  • Sanping Zhou

This work explores the challenging problem of molecule design by framing it as a conditional generative modeling task, where target biological properties or desired chemical constraints serve as conditioning variables. We propose the Latent Prompt Transformer (LPT), a novel generative model comprising three components: (1) a latent vector with a learnable prior distribution modeled by a neural transformation of Gaussian white noise; (2) a molecule generation model based on a causal Transformer, which uses the latent vector as a prompt; and (3) a property prediction model that predicts a molecule's target properties and/or constraint values using the latent prompt. LPT can be learned by maximum likelihood estimation on molecule-property pairs. During property optimization, the latent prompt is inferred from target properties and constraints through posterior sampling and then used to guide the autoregressive molecule generation. After initial training on existing molecules and their properties, we adopt an online learning algorithm to progressively shift the model distribution towards regions that support desired target properties. Experiments demonstrate that LPT not only effectively discovers useful molecules across single-objective, multi-objective, and structure-constrained optimization tasks, but also exhibits strong sample efficiency.

NeurIPS Conference 2024 Conference Paper

Neural P$^3$M: A Long-Range Interaction Modeling Enhancer for Geometric GNNs

  • Yusong Wang
  • Chaoran Cheng
  • Shaoning Li
  • Yuxuan Ren
  • Bin Shao
  • Ge Liu
  • Pheng-Ann Heng
  • Nanning Zheng

Geometric graph neural networks (GNNs) have emerged as powerful tools for modeling molecular geometry. However, they encounter limitations in effectively capturing long-range interactions in large molecular systems. To address this challenge, we introduce **Neural P$^3$M**, a versatile enhancer of geometric GNNs to expand the scope of their capabilities by incorporating mesh points alongside atoms and reimaging traditional mathematical operations in a trainable manner. Neural P$^3$M exhibits flexibility across a wide range of molecular systems and demonstrates remarkable accuracy in predicting energies and forces, outperforming on benchmarks such as the MD22 dataset. It also achieves an average improvement of 22% on the OE62 dataset while integrating with various architectures. Codes are available at https: //github. com/OnlyLoveKFC/Neural_P3M.

NeurIPS Conference 2024 Conference Paper

TPR: Topology-Preserving Reservoirs for Generalized Zero-Shot Learning

  • Hui Chen
  • Yanbin Liu
  • Yongqiang Ma
  • Nanning Zheng
  • Xin Yu

Pre-trained vision-language models (VLMs) such as CLIP have shown excellent performance for zero-shot classification. Based on CLIP, recent methods design various learnable prompts to evaluate the zero-shot generalization capability on a base-to-novel setting. This setting assumes test samples are already divided into either base or novel classes, limiting its application to realistic scenarios. In this paper, we focus on a more challenging and practical setting: generalized zero-shot learning (GZSL), i. e. , testing with no information about the base/novel division. To address this challenging zero-shot problem, we introduce two unique designs that enable us to classify an image without the need of knowing whether it comes from seen or unseen classes. Firstly, most existing methods only adopt a single latent space to align visual and linguistic features, which has a limited ability to represent complex visual-linguistic patterns, especially for fine-grained tasks. Instead, we propose a dual-space feature alignment module that effectively augments the latent space with a novel attribute space induced by a well-devised attribute reservoir. In particular, the attribute reservoir consists of a static vocabulary and learnable tokens complementing each other for flexible control over feature granularity. Secondly, finetuning CLIP models (e. g. , prompt learning) on seen base classes usually sacrifices the model's original generalization capability on unseen novel classes. To mitigate this issue, we present a new topology-preserving objective that can enforce feature topology structures of the combined base and novel classes to resemble the topology of CLIP. In this manner, our model will inherit the generalization ability of CLIP through maintaining the pairwise class angles in the attribute space. Extensive experiments on twelve object recognition datasets demonstrate that our model, termed Topology-Preserving Reservoir (TPR), outperforms strong baselines including both prompt learning and conventional generative-based zero-shot methods.

AAAI Conference 2024 Conference Paper

Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection

  • Yuhao Huang
  • Sanping Zhou
  • Junjie Zhang
  • Jinpeng Dong
  • Nanning Zheng

Efficient representation of point clouds is fundamental for LiDAR-based 3D object detection. While recent grid-based detectors often encode point clouds into either voxels or pillars, the distinctions between these approaches remain underexplored. In this paper, we quantify the differences between the current encoding paradigms and highlight the limited vertical learning within. To tackle these limitations, we propose a hybrid detection framework named Voxel-Pillar Fusion (VPF), which synergistically combines the unique strengths of both voxels and pillars. To be concrete, we first develop a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively, and then introduce the Sparse Fusion Layer (SFL), facilitating bidirectional interaction between sparse voxel and pillar features. Our computationally efficient, fully sparse method can be seamlessly integrated into both dense and sparse detectors. Leveraging this powerful yet straightforward representation, VPF delivers competitive performance, achieving real-time inference speeds on the nuScenes and Waymo Open Dataset.

NeurIPS Conference 2023 Conference Paper

Closing the gap between the upper bound and lower bound of Adam's iteration complexity

  • Bohan Wang
  • Jingwen Fu
  • Huishuai Zhang
  • Nanning Zheng
  • Wei Chen

Recently, Arjevani et al. [1] establish a lower bound of iteration complexity for the first-order optimization under an $L$-smooth condition and a bounded noise variance assumption. However, a thorough review of existing literature on Adam's convergence reveals a noticeable gap: none of them meet the above lower bound. In this paper, we close the gap by deriving a new convergence guarantee of Adam, with only an $L$-smooth condition and a bounded noise variance assumption. Our results remain valid across a broad spectrum of hyperparameters. Especially with properly chosen hyperparameters, we derive an upper bound of the iteration complexity of Adam and show that it meets the lower bound for first-order optimizers. To the best of our knowledge, this is the first to establish such a tight upper bound for Adam's convergence. Our proof utilizes novel techniques to handle the entanglement between momentum and adaptive learning rate and to convert the first-order term in the Descent Lemma to the gradient norm, which may be of independent interest.

NeurIPS Conference 2023 Conference Paper

DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models

  • Tao Yang
  • Yuwang Wang
  • Yan Lu
  • Nanning Zheng

Targeting to understand the underlying explainable factors behind observations and modeling the conditional generation process on these factors, we connect disentangled representation learning to diffusion probabilistic models (DPMs) to take advantage of the remarkable modeling ability of DPMs. We propose a new task, disentanglement of (DPMs): given a pre-trained DPM, without any annotations of the factors, the task is to automatically discover the inherent factors behind the observations and disentangle the gradient fields of DPM into sub-gradient fields, each conditioned on the representation of each discovered factor. With disentangled DPMs, those inherent factors can be automatically discovered, explicitly represented and clearly injected into the diffusion process via the sub-gradient fields. To tackle this task, we devise an unsupervised approach, named DisDiff, and for the first time achieving disentangled representation learning in the framework of DPMs. Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness of DisDiff.

NeurIPS Conference 2023 Conference Paper

Geometric Transformer with Interatomic Positional Encoding

  • Yusong Wang
  • Shaoning Li
  • Tong Wang
  • Bin Shao
  • Nanning Zheng
  • Tie-Yan Liu

The widespread adoption of Transformer architectures in various data modalities has opened new avenues for the applications in molecular modeling. Nevertheless, it remains elusive that whether the Transformer-based architecture can do molecular modeling as good as equivariant GNNs. In this paper, by designing Interatomic Positional Encoding (IPE) thatparameterizes atomic environments as Transformer's positional encodings, we propose Geoformer, a novel geometric Transformer to effectively model molecular structures for various molecular property prediction. We evaluate Geoformer on several benchmarks, including the QM9 dataset and the recently proposed Molecule3D dataset. Compared with both Transformers and equivariant GNN models, Geoformer outperforms the state-of-the-art (SoTA) algorithms on QM9, and achieves the best performance on Molecule3D for both random and scaffold splits. By introducing IPE, Geoformer paves the way for molecular geometric modeling based on Transformer architecture. Codes are available at https: //github. com/microsoft/AI2BMD/tree/Geoformer.

NeurIPS Conference 2023 Conference Paper

Learning Trajectories are Generalization Indicators

  • Jingwen Fu
  • Zhizheng Zhang
  • Dacheng Yin
  • Yan Lu
  • Nanning Zheng

This paper explores the connection between learning trajectories of Deep Neural Networks (DNNs) and their generalization capabilities when optimized using (stochastic) gradient descent algorithms. Instead of concentrating solely on the generalization error of the DNN post-training, we present a novel perspective for analyzing generalization error by investigating the contribution of each update step to the change in generalization error. This perspective enable a more direct comprehension of how the learning trajectory influences generalization error. Building upon this analysis, we propose a new generalization bound that incorporates more extensive trajectory information. Our proposed generalization bound depends on the complexity of learning trajectory and the ratio between the bias and diversity of training set. Experimental observations reveal that our method effectively captures the generalization error throughout the training process. Furthermore, our approach can also track changes in generalization error when adjustments are made to learning rates and label noise levels. These results demonstrate that learning trajectory information is a valuable indicator of a model's generalization capabilities.

AAAI Conference 2022 Conference Paper

Construct Effective Geometry Aware Feature Pyramid Network for Multi-Scale Object Detection

  • Jinpeng Dong
  • Yuhao Huang
  • Songyi Zhang
  • Shitao Chen
  • Nanning Zheng

Feature Pyramid Network (FPN) has been widely adopted to exploit multi-scale features for scale variation in object detection. However, intrinsic defects in most of the current methods with FPN make it difficult to adapt to the feature of different geometric objects. To address this issue, we introduce geometric prior into FPN to obtain more discriminative features. In this paper, we propose the Geometry-aware Feature Pyramid Network (GaFPN), which mainly consists of the novel Geometry-aware Mapping Module and Geometryaware Predictor Head. The Geometry-aware Mapping Module is proposed to make full use of all pyramid features to obtain better proposal features by the weight-generation subnetwork. The weights generation subnetwork generates fusion weight for each layer proposal features by using the geometric information of the proposal. The Geometry-aware Predictor Head introduces geometric prior into predictor head by the embedding generation network to strengthen feature representation for classification and regression. Our GaFPN can be easily extended to other two-stage object detectors with feature pyramid and applied to instance segmentation task. The proposed GaFPN significantly improves detection performance compared to baseline detectors with ResNet-50- FPN: +1. 9, +2. 0, +1. 7, +1. 3, +0. 8 points Average Precision (AP) on Faster-RCNN, Cascade R-CNN, Dynamic R-CNN, SABL, and AugFPN respectively on MS COCO dataset.

NeurIPS Conference 2022 Conference Paper

Could Giant Pre-trained Image Models Extract Universal Representations?

  • Yutong Lin
  • Ze Liu
  • Zheng Zhang
  • Han Hu
  • Nanning Zheng
  • Stephen Lin
  • Yue Cao

Frozen pretrained models have become a viable alternative to the pretraining-then-finetuning paradigm for transfer learning. However, with frozen models there are relatively few parameters available for adapting to downstream tasks, which is problematic in computer vision where tasks vary significantly in input/output format and the type of information that is of value. In this paper, we present a study of frozen pretrained models when applied to diverse and representative computer vision tasks, including object detection, semantic segmentation and video action recognition. From this empirical analysis, our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes. We additionally examine the upper bound of performance using a giant frozen pretrained model with 3 billion parameters (SwinV2-G) and find that it reaches competitive performance on a varied set of major benchmarks with only one shared frozen base network: 60. 0 box mAP and 52. 2 mask mAP on COCO object detection test-dev, 57. 6 val mIoU on ADE20K semantic segmentation, and 81. 7 top-1 accuracy on Kinetics-400 action recognition. With this work, we hope to bring greater attention to this promising path of freezing pretrained image models.

AAAI Conference 2022 Conference Paper

Learning Disentangled Classification and Localization Representations for Temporal Action Localization

  • Zixin Zhu
  • Le Wang
  • Wei Tang
  • Ziyi Liu
  • Nanning Zheng
  • Gang Hua

A common approach to Temporal Action Localization (TAL) is to generate action proposals and then perform action classification and localization on them. For each proposal, existing methods universally use a shared proposal-level representation for both tasks. However, our analysis indicates that this shared representation focuses on the most discriminative frames for classification, e. g. , “take-offs” rather than “runups” in distinguishing “high jump” and “long jump”, while frames most relevant to localization, such as the start and end frames of an action, are largely ignored. In other words, such a shared representation can not simultaneously handle both classification and localization tasks well, and it makes precise TAL difficult. To address this challenge, this paper disentangles the shared representation into classification and localization representations. The disentangled classification representation focuses on the most discriminative frames, and the disentangled localization representation focuses on the action phase as well as the action start and end. Our model can be divided into two sub-networks, i. e. , the disentanglement network and the context-based aggregation network. The disentanglement network is an autoencoder to learn orthogonal hidden variables of classification and localization. The context-based aggregation network aggregates the classification and localization representations by modeling local and global contexts. We evaluate our proposed method on two popular benchmarks for TAL, which outperforms all state-ofthe-art methods.

AAAI Conference 2022 Conference Paper

LGD: Label-Guided Self-Distillation for Object Detection

  • Peizhen Zhang
  • Zijian Kang
  • Tong Yang
  • Xiangyu Zhang
  • Nanning Zheng
  • Jian Sun

In this paper, we propose the first self-distillation framework for general object detection, termed LGD (Label-Guided self-Distillation). Previous studies rely on a strong pretrained teacher to provide instructive knowledge that could be unavailable in real-world scenarios. Instead, we generate an instructive knowledge based only on student representations and regular labels. Our framework includes sparse labelappearance encoder, inter-object relation adaptater and intraobject knowledge mapper that jointly form an implicit teacher at training phase, dynamically dependent on labels and evolving student representations. They are trained end-to-end with detector and discarded in inference. Experimentally, LGD obtains decent results on various detectors, datasets, and extensive tasks like instance segmentation. For example in MS- COCO dataset, LGD improves RetinaNet with ResNet-50 under 2× single-scale training from 36. 2% to 39. 0% mAP (+ 2. 8%). It boosts much stronger detectors like FCOS with ResNeXt-101 DCN v2 under 2× multi-scale training from 46. 1% to 47. 9% (+ 1. 8%). Compared with a classical teacherbased method FGFI, LGD not only performs better without requiring pretrained teacher but also reduces 51% training cost beyond inherent student learning. Codes are available at https: //github. com/megvii-research/LGD.

AAAI Conference 2022 Conference Paper

Social Interpretable Tree for Pedestrian Trajectory Prediction

  • Liushuai Shi
  • Le Wang
  • Chengjiang Long
  • Sanping Zhou
  • Fang Zheng
  • Nanning Zheng
  • Gang Hua

Understanding the multiple socially-acceptable future behaviors is an essential task for many vision applications. In this paper, we propose a tree-based method, termed as Social Interpretable Tree (SIT), to address this multi-modal prediction task, where a hand-crafted tree is built depending on the prior information of observed trajectory to model multiple future trajectories. Specifically, a path in the tree from the root to leaf represents an individual possible future trajectory. SIT employs a coarse-to-fine optimization strategy, in which the tree is first built by high-order velocity to balance the complexity and coverage of the tree and then optimized greedily to encourage multimodality. Finally, a teacher-forcing refining operation is used to predict the final fine trajectory. Compared with prior methods which leverage implicit latent variables to represent possible future trajectories, the path in the tree can explicitly explain the rough moving behaviors (e. g. , go straight and then turn right), and thus provides better interpretability. Despite the hand-crafted tree, the experimental results on ETH-UCY and Stanford Drone datasets demonstrate that our method is capable of matching or exceeding the performance of state-of-the-art methods. Interestingly, the experiments show that the raw built tree without training outperforms many prior deep neural network based approaches. Meanwhile, our method presents sufficient flexibility in longterm prediction and different best-of-K predictions.

NeurIPS Conference 2022 Conference Paper

Visual Concepts Tokenization

  • Tao Yang
  • Yuwang Wang
  • Yan Lu
  • Nanning Zheng

Obtaining the human-like perception ability of abstracting visual concepts from concrete pixels has always been a fundamental and important target in machine learning research fields such as disentangled representation learning and scene decomposition. Towards this goal, we propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens, with each concept token responding to one type of independent visual concept. Particularly, to obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens, preventing information leakage across concept tokens. We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts. The cross-attention and disentangling loss play the role of induction and mutual exclusion for the concept tokens, respectively. Extensive experiments on several popular datasets verify the effectiveness of VCT on the tasks of disentangled representation learning and scene decomposition. VCT achieves the state of the art results by a large margin.

AAAI Conference 2021 Conference Paper

ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization

  • Ziyi Liu
  • Le Wang
  • Qilin Zhang
  • Wei Tang
  • Junsong Yuan
  • Nanning Zheng
  • Gang Hua

The object of Weakly-supervised Temporal Action Localization (WS-TAL) is to localize all action instances in an untrimmed video with only video-level supervision. Due to the lack of frame-level annotations during training, current WS-TAL methods rely on attention mechanisms to localize the foreground snippets or frames that contribute to the video-level classification task. This strategy frequently confuse context with the actual action, in the localization result. Separating action and context is a core problem for precise WS-TAL, but it is very challenging and has been largely ignored in the literature. In this paper, we introduce an Action-Context Separation Network (ACSNet) that explicitly takes into account context for accurate action localization. It consists of two branches (i. e. , the Foreground-Background branch and the Action-Context branch). The Foreground- Background branch first distinguishes foreground from background within the entire video while the Action-Context branch further separates the foreground as action and context. We associate video snippets with two latent components (i. e. , a positive component and a negative component), and their different combinations can effectively characterize foreground, action and context. Furthermore, we introduce extended labels with auxiliary context categories to facilitate the learning of action-context separation. Experiments on THU- MOS14 and ActivityNet v1. 2/v1. 3 datasets demonstrate the ACSNet outperforms existing state-of-the-art WS-TAL methods by a large margin.

NeurIPS Conference 2021 Conference Paper

Co-evolution Transformer for Protein Contact Prediction

  • He Zhang
  • Fusong Ju
  • Jianwei Zhu
  • Liang He
  • Bin Shao
  • Nanning Zheng
  • Tie-Yan Liu

Proteins are the main machinery of life and protein functions are largely determined by their 3D structures. The measurement of the pairwise proximity between amino acids of a protein, known as inter-residue contact map, well characterizes the structural information of a protein. Protein contact prediction (PCP) is an essential building block of many protein structure related applications. The prevalent approach to contact prediction is based on estimating the inter-residue contacts using hand-crafted coevolutionary features derived from multiple sequence alignments (MSAs). To mitigate the information loss caused by hand-crafted features, some recently proposed methods try to learn residue co-evolutions directly from MSAs. These methods generally derive coevolutionary features by aggregating the learned residue representations from individual sequences with equal weights, which is inconsistent with the premise that residue co-evolutions are a reflection of collective covariation patterns of numerous homologous proteins. Moreover, non-homologous residues and gaps commonly exist in MSAs. By aggregating features from all homologs equally, the non-homologous information may cause misestimation of the residue co-evolutions. To overcome these issues, we propose an attention-based architecture, Co-evolution Transformer (CoT), for PCP. CoT jointly considers the information from all homologous sequences in the MSA to better capture global coevolutionary patterns. To mitigate the influence of the non-homologous information, CoT selectively aggregates the features from different homologs by assigning smaller weights to non-homologous sequences or residue pairs. Extensive experiments on two rigorous benchmark datasets demonstrate the effectiveness of CoT. In particular, CoT achieves a $51. 6\%$ top-L long-range precision score for the Free Modeling (FM) domains on the CASP14 benchmark, which outperforms the winner group of CASP14 contact prediction challenge by $9. 8\%$.

NeurIPS Conference 2021 Conference Paper

Dynamic Grained Encoder for Vision Transformers

  • Lin Song
  • Songyang Zhang
  • Songtao Liu
  • Zeming Li
  • Xuming He
  • Hongbin Sun
  • Jian Sun
  • Nanning Zheng

Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save computational costs. Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region. Thus it achieves a fine-grained representation in discriminative regions while keeping high efficiency. Besides, the dynamic grained encoder is compatible with most vision transformer frameworks. Without bells and whistles, our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification. Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach. Code is available at https: //github. com/StevenGrove/vtpack.

IJCAI Conference 2021 Conference Paper

Hindsight Trust Region Policy Optimization

  • Hanbo Zhang
  • Site Bai
  • Xuguang Lan
  • David Hsu
  • Nanning Zheng

Reinforcement Learning (RL) with sparse rewards is a major challenge. We pro- pose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm’s ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region. QKL reduces variance in KL divergence estimation and improves stability in policy updates. We show that HTRPO has similar convergence property as TRPO. We also present Hindsight Goal Filtering (HGF), which further improves the learning performance for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Experimental results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy 14 gradient algorithm for RL with sparse rewards.

NeurIPS Conference 2021 Conference Paper

Instance-Conditional Knowledge Distillation for Object Detection

  • Zijian Kang
  • Peizhen Zhang
  • Xiangyu Zhang
  • Jian Sun
  • Nanning Zheng

Knowledge distillation has shown great success in classification, however, it is still challenging for detection. In a typical image for detection, representations from different locations may have different contributions to detection targets, making the distillation hard to balance. In this paper, we propose a conditional distillation framework to distill the desired knowledge, namely knowledge that is beneficial in terms of both classification and localization for every instance. The framework introduces a learnable conditional decoding module, which retrieves information given each target instance as query. Specifically, we encode the condition information as query and use the teacher's representations as key. The attention between query and key is used to measure the contribution of different features, guided by a localization-recognition-sensitive auxiliary task. Extensive experiments demonstrate the efficacy of our method: we observe impressive improvements under various settings. Notably, we boost RetinaNet with ResNet-50 backbone from $37. 4$ to $40. 7$ mAP ($+3. 3$) under $1\times$ schedule, that even surpasses the teacher ($40. 4$ mAP) with ResNet-101 backbone under $3\times$ schedule. Code has been released on https: //github. com/megvii-research/ICD.

AAAI Conference 2021 Conference Paper

Semantic Consistency Networks for 3D Object Detection

  • Wenwen Wei
  • Ping Wei
  • Nanning Zheng

Detecting 3D objects from point clouds is a significant yet challenging issue in many applications. While most existing approaches seek to leverage geometric information of point clouds, few studies accommodate the inherent semantic characteristics of each point and the consistency between the geometric and semantic cues. In this work, we propose a novel semantic consistency network (SCNet) driven by a natural principle: the class of a predicted 3D bounding box should be consistent with the classes of all the points inside this box. Specifically, our SCNet consists of a feature extraction structure, a detection decision structure, and a semantic segmentation structure. In inference, the feature extraction and the detection decision structures are used to detect 3D objects. In training, the semantic segmentation structure is jointly trained with the other two structures to produce more robust and applicative model parameters. A novel semantic consistency loss is proposed to regulate the output 3D object boxes and the segmented points to boost the performance. Our model is evaluated on two challenging datasets and achieves comparable results to the state-of-the-art methods.

AAAI Conference 2021 Conference Paper

Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context

  • Ziyi Liu
  • Le Wang
  • Wei Tang
  • Junsong Yuan
  • Nanning Zheng
  • Gang Hua

Weakly-supervised Temporal Action Localization (WS-TAL) methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision. Existing WS-TAL methods rely on deep features learned for action recognition. However, due to the mismatch between classification and localization, these features cannot distinguish the frequently co-occurring contextual background, i. e. , the context, and the actual action instances. We term this challenge action-context confusion, and it will adversely affect the action localization accuracy. To address this challenge, we introduce a framework that learns two feature subspaces respectively for actions and their context. By explicitly accounting for action visual elements, the action instances can be localized more precisely without the distraction from the context. To facilitate the learning of these two feature subspaces with only video-level categorical labels, we leverage the predictions from both spatial and temporal streams for snippets grouping. In addition, an unsupervised learning task is introduced to make the proposed module focus on mining temporal information. The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks, i. e. , THUMOS14, ActivityNet v1. 2 and v1. 3 datasets.

NeurIPS Conference 2020 Conference Paper

Compositional Generalization by Learning Analytical Expressions

  • Qian Liu
  • Shengnan An
  • Jian-Guang Lou
  • Bei Chen
  • Zeqi Lin
  • Yan Gao
  • Bin Zhou
  • Nanning Zheng

Compositional generalization is a basic and essential intellective capability of human beings, which allows us to recombine known parts readily. However, existing neural network based models have been proven to be extremely deficient in such a capability. Inspired by work in cognition which argues compositionality can be captured by variable slots with symbolic functions, we present a refreshing view that connects a memory-augmented neural model with analytical expressions, to achieve compositional generalization. Our model consists of two cooperative neural modules, Composer and Solver, fitting well with the cognitive argument while being able to be trained in an end-to-end manner via a hierarchical reinforcement learning algorithm. Experiments on the well-known benchmark SCAN demonstrate that our model seizes a great ability of compositional generalization, solving all challenges addressed by previous works with 100% accuracies.

NeurIPS Conference 2020 Conference Paper

Fine-Grained Dynamic Head for Object Detection

  • Lin Song
  • Yanwei Li
  • Zhengkai Jiang
  • Zeming Li
  • Hongbin Sun
  • Jian Sun
  • Nanning Zheng

The Feature Pyramid Network (FPN) presents a remarkable approach to alleviate the scale variance in object representation by performing instance-level assignments. Nevertheless, this strategy ignores the distinct characteristics of different sub-regions in an instance. To this end, we propose a fine-grained dynamic head to conditionally select a pixel-level combination of FPN features from different scales for each instance, which further releases the ability of multi-scale feature representation. Moreover, we design a spatial gate with the new activation function to reduce computational complexity dramatically through spatially sparse convolutions. Extensive experiments demonstrate the effectiveness and efficiency of the proposed method on several state-of-the-art detection benchmarks. Code is available at https: //github. com/StevenGrove/DynamicHead.

IS Journal 2020 Journal Article

Joint Intelligence Ranking by Federated Multiplicative Update

  • Chi Zhang
  • Yu Liu
  • Le Wang
  • Yuehu Liu
  • Li Li
  • Nanning Zheng

The joint intelligence ranking of intelligent systems like autonomous driving is of great importance for building a more general, extensive, and universally accepted intelligence evaluation scheme. However, due to issues such as privacy security and industry or area competition, the integration of isolated test results may face large unimaginable difficulty in information security and encrypted model training. To address this, we derive the federated multiplicative update (FMU) algorithm with boundary constraints to solve the nonnegative matrix factorization based joint intelligence ranking. The encrypted learning process is developed to alternate original computation steps in multiplicative update algorithms. Owning feasible property for the fast convergence and secure exchange of variables, the proposed framework outperforms the previous work on both real and simulated data. Further experimental analysis reveals that the introduced federated mechanism does not harm the overall time efficiency.

NeurIPS Conference 2020 Conference Paper

Rethinking Learnable Tree Filter for Generic Feature Transform

  • Lin Song
  • Yanwei Li
  • Zhengkai Jiang
  • Zeming Li
  • Xiangyu Zhang
  • Hongbin Sun
  • Jian Sun
  • Nanning Zheng

The Learnable Tree Filter presents a remarkable approach to model structure-preserving relations for semantic segmentation. Nevertheless, the intrinsic geometric constraint forces it to focus on the regions with close spatial distance, hindering the effective long-range interactions. To relax the geometric constraint, we give the analysis by reformulating it as a Markov Random Field and introduce a learnable unary term. Besides, we propose a learnable spanning tree algorithm to replace the original non-differentiable one, which further improves the flexibility and robustness. With the above improvements, our method can better capture long range dependencies and preserve structural details with linear complexity, which is extended to several vision tasks for more generic feature transform. Extensive experiments on object detection/instance segmentation demonstrate the consistent improvements over the original version. For semantic segmentation, we achieve leading performance (82. 1% mIoU) on the Cityscapes benchmark without bells-and whistles. Code is available at https: //github. com/StevenGrove/LearnableTreeFilterV2.

NeurIPS Conference 2019 Conference Paper

Learnable Tree Filter for Structure-preserving Feature Transform

  • Lin Song
  • Yanwei Li
  • Zeming Li
  • Gang Yu
  • Hongbin Sun
  • Jian Sun
  • Nanning Zheng

Learning discriminative global features plays a vital role in semantic segmentation. And most of the existing methods adopt stacks of local convolutions or non-local blocks to capture long-range context. However, due to the absence of spatial structure preservation, these operators ignore the object details when enlarging receptive fields. In this paper, we propose the learnable tree filter to form a generic tree filtering module that leverages the structural property of minimal spanning tree to model long-range dependencies while preserving the details. Furthermore, we propose a highly efficient linear-time algorithm to reduce resource consumption. Thus, the designed modules can be plugged into existing deep neural networks conveniently. To this end, tree filtering modules are embedded to formulate a unified framework for semantic segmentation. We conduct extensive ablation studies to elaborate on the effectiveness and efficiency of the proposed method. Specifically, it attains better performance with much less overhead compared with the classic PSP block and Non-local operation under the same backbone. Our approach is proved to achieve consistent improvements on several benchmarks without bells-and-whistles. Code and models are available at https: //github. com/StevenGrove/TreeFilter-Torch.

AAAI Conference 2019 Conference Paper

Recognizing Unseen Attribute-Object Pair with Generative Model

  • Zhixiong Nan
  • Yang Liu
  • Nanning Zheng
  • Song-Chun Zhu

In this paper, we are studying the problem of recognizing attribute-object pairs that do not appear in the training dataset, which is called unseen attribute-object pair recognition. Existing methods mainly learn a discriminative classifier or compose multiple classifiers to tackle this problem, which exhibit poor performance for unseen pairs. The key reasons for this failure are 1) they have not learned an intrinsic attributeobject representation, and 2) the attribute and object are processed either separately or equally so that the inner relation between the attribute and object has not been explored. To explore the inner relation of attribute and object as well as the intrinsic attribute-object representation, we propose a generative model with the encoder-decoder mechanism that bridges visual and linguistic information in a unified end-to-end network. The encoder-decoder mechanism presents the impressive potential to find an intrinsic attribute-object feature representation. In addition, combining visual and linguistic features in a unified model allows to mine the relation of attribute and object. We conducted extensive experiments to compare our method with several state-of-the-art methods on two challenging datasets. The results show that our method outperforms all other methods.

AAAI Conference 2019 Conference Paper

Video Imprint Segmentation for Temporal Action Detection in Untrimmed Videos

  • Zhanning Gao
  • Le Wang
  • Qilin Zhang
  • Zhenxing Niu
  • Nanning Zheng
  • Gang Hua

We propose a temporal action detection by spatial segmentation framework, which simultaneously categorize actions and temporally localize action instances in untrimmed videos. The core idea is the conversion of temporal detection task into a spatial semantic segmentation task. Firstly, the video imprint representation is employed to capture the spatial/temporal interdependences within/among frames and represent them as spatial proximity in a feature space. Subsequently, the obtained imprint representation is spatially segmented by a fully convolutional network. With such segmentation labels projected back to the video space, both temporal action boundary localization and per-frame spatial annotation can be obtained simultaneously. The proposed framework is robust to variable lengths of untrimmed videos, due to the underlying fixed-size imprint representations. The efficacy of the framework is validated in two public action detection datasets.

AAAI Conference 2018 Conference Paper

Cross-View Person Identification by Matching Human Poses Estimated With Confidence on Each Body Joint

  • Guoqiang Liang
  • Xuguang Lan
  • Kang Zheng
  • Song Wang
  • Nanning Zheng

Cross-view person identification (CVPI) from multiple temporally synchronized videos taken by multiple wearable cameras from different, varying views is a very challenging but important problem, which has attracted more interests recently. Current state-of-the-art performance of CVPI is achieved by matching appearance and motion features across videos, while the matching of pose features does not work effectively given the high inaccuracy of the 3D human pose estimation on videos/images collected in the wild. In this paper, we introduce a new metric of confidence to the 3D human pose estimation and show that the combination of the inaccurately estimated human pose and the inferred confidence metric can be used to boost the CVPI performance –the estimated pose information can be integrated to the appearance and motion features to achieve the new state-of-the-art CVPI performance. More specifically, the estimated confidence metric is measured at each humanbody joint and the joints with higher confidence are weighted more in the pose matching for CVPI. In the experiments, we validate the proposed method on three wearablecamera video datasets and compare the performance against several other existing CVPI methods.

IS Journal 2018 Journal Article

Multivariate Correlation Entropy and Law Discovery in Large Data Sets

  • Jianji Wang
  • Nanning Zheng
  • Badong Chen
  • Pei Chen
  • Shitao Chen
  • Ziyi Liu
  • Fei-Yue Wang
  • Bao Xi

Over the past several centuries, many important natural laws have been discovered by scientists, which have not only changed our viewpoints about nature but also affected our lives significantly. Today, automatic discovery of meaningful laws from data beyond two variables becomes an important task of our time. Here, we propose two multivariate correlation measures, namely, the multivariate correlation entropy (MCE) and the multivariate incorrelation entropy (MIE), which can be used to measure the strength of the correlation among multiple variables. Using MIE makes it possible to directly detect linear relations existing in large data sets. In addition, more complicated nonlinear multivariate laws can be discovered using a function dictionary.

IJCAI Conference 2017 Conference Paper

Discriminative Dictionary Learning With Ranking Metric Embedded for Person Re-Identification

  • De Cheng
  • Xiaojun Chang
  • Li Liu
  • Alexander G. Hauptmann
  • Yihong Gong
  • Nanning Zheng

The goal of person re-identification (Re-Id) is to match pedestrians captured from multiple non-overlapping cameras. In this paper, we propose a novel dictionary learning based method with the ranking metric embedded, for person Re-Id. A new and essential ranking graph Laplacian term is introduced, which minimizes the intra-personal compactness and maximizes the inter-personal dispersion in the objective. Different from the traditional dictionary learning based approaches and their extensions, which just use the same or not information, our proposed method can explore the ranking relationship among the person images, which is essential for such retrieval related tasks. Simultaneously, one distance measurement has been explicitly learned in the model to further improve the performance. Since we have reformulated these ranking constraints into the graph Laplacian form, the proposed method is easy-to-implement but effective. We conduct extensive experiments on three widely used person Re-Id benchmark datasets, and achieve state-of-the-art performances.

IJCAI Conference 2017 Conference Paper

Inferring Human Attention by Learning Latent Intentions

  • Ping Wei
  • Dan Xie
  • Nanning Zheng
  • Song-Chun Zhu

This paper addresses the problem of inferring 3D human attention in RGB-D videos at scene scale. 3D human attention describes where a human is looking in 3D scenes. We propose a probabilistic method to jointly model attention, intentions, and their interactions. Latent intentions guide human attention which conversely reveals the intention features. This mutual interaction makes attention inference a joint optimization with latent intentions. An EM-based approach is adopted to learn the latent intentions and model parameters. Given an RGB-D video with 3D human skeletons, a joint-state dynamic programming algorithm is utilized to jointly infer the latent intentions, the 3D attention directions, and the attention voxels in scene point clouds. Experiments on a new 3D human attention dataset prove the strength of our method.

IJCAI Conference 2009 Conference Paper

  • Xi Li
  • Kazuhiro Fukui
  • Nanning Zheng

Object recognition using image-set or video sequence as input tends to be more robust since image-set or video sequence provides much more information than single snap-shot about the variability in the appearance of the target subject. Constrained Mutual Subspace Method (CMSM) is one of the state-of-the-art algorithms for imageset based object recognition by first projecting the image-set patterns onto the so-called generalized difference subspace then classifying based on the principal angle based mutual subspace distance. By treating the subspace bases for each image-set patterns as basic elements in the grassmann manifold, this paper presents a framework for robust image-set based recognition by CMSMbased ensemble learning in a boosting way. The proposed Boosting Constrained Mutual Subspace Method(BCMSM) improves the original CMSM in the following ways: a) The proposed BCMSM algorithm is insensitive to the dimension of the generalized differnce subspace while the performance of the original CMSM algorithm is quite dependent on the dimension and the selecting of optimum choice is quite empirical and case-dependent; b) By taking advantage of both boosting and CMSM techniques, the generalization ability is improved and much higher classification performance can be achieved. Extensive experiments on real-life data sets (two face recognition tasks and one 3D object category classification task) show that the proposed method outperforms the previous state-ofthe-art algorithms greatly in terms of classification accuracy.

IS Journal 2008 Journal Article

50 Years of Image Processing and Pattern Recognition in China

  • Nanning Zheng
  • Qubo You
  • Gaofeng Meng
  • Jihua Zhu
  • Shaoyi Du
  • Jianyi Liu

This article briefly reviews the development of image recognition in and outside China. It presents theoretical research achievements and applied research as well as several typical applications of image recognition in China. Finally, it discusses future trends in image recognition integrated with cognitive science. This article is part of a special issue on AI in China.