Author name cluster

Wen Gao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers

1 author row

AAAI Conference 2025 Conference Paper

CALLIC: Content Adaptive Learning for Lossless Image Compression

Daxin Li
Yuanchao Bai
Kai Wang
Junjun Jiang
Xianming Liu
Wen Gao

Learned lossless image compression has achieved significant advancements in recent years. However, existing methods often rely on training amortized generative models on massive datasets, resulting in sub-optimal probability distribution estimation for specific testing images during encoding process. To address this challenge, we explore the connection between the Minimum Description Length (MDL) principle and Parameter-Efficient Transfer Learning (PETL), leading to the development of a novel content-adaptive approach for learned lossless image compression, dubbed CALLIC. Specifically, we first propose a content-aware autoregressive self-attention mechanism by leveraging convolutional gating operations, termed Masked Gated ConvFormer (MGCF), and pretrain MGCF on training dataset. Cache then Crop Inference (CCI) is proposed to accelerate the coding process. During encoding, we decompose pretrained layers, including depth-wise convolutions, using low-rank matrices and then adapt the incremental weights on testing image by Rate-guided Progressive Fine-Tuning (RPFT). RPFT fine-tunes with gradually increasing patches that are sorted in descending order by estimated entropy, optimizing learning process and reducing adaptation time. Extensive experiments across diverse datasets demonstrate that CALLIC sets a new state-of-the-art (SOTA) for learned lossless image compression.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Emerging Advances in Learned Video Compression: Models, Systems and Beyond

Chuanmin Jia
Feng Ye
Siwei Ma
Wen Gao
Huifang Sun
Leonardo Chiariglione

Video compression is a fundamental topic in the visual intelligence, bridging visual signal sensing/capturing and high-level visual analytics. The broad success of artificial intelligence (AI) technology has enriched the horizon of video compression into novel paradigms by leveraging end-to-end optimized neural models. In this survey, we first provide a comprehensive and systematic overview of recent literature on end-to-end optimized learned video coding, covering the spectrum of pioneering efforts in both uni-directional and bi-directional prediction based compression model designation. We further delve into the optimization techniques employed in learned video compression (LVC), emphasizing their technical innovations, advantages. Some standardization progress is also reported. Furthermore, we investigate the system design and hardware implementation challenges of the LVC inclusively. Finally, we present the extensive simulation results to demonstrate the superior compression performance of LVC models, addressing the question that why learned codecs and AI-based video technology would have with broad impact on future visual intelligence research.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Towards End-to-End Image Compression and Analysis with Transformers

Yuanchao Bai
Xu Yang
Xianming Liu
Junjun Jiang
Yaowei Wang
Xiangyang Ji
Wen Gao

We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application. Instead of placing an existing Transformer-based image classification model directly after an image codec, we aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer. Specifically, we first replace the patchify stem (i. e. , image splitting and embedding) of the ViT model with a lightweight image encoder modelled by a convolutional neural network. The compressed features generated by the image encoder are injected convolutional inductive bias and are fed to the Transformer for image classification bypassing image reconstruction. Meanwhile, we propose a feature aggregation module to fuse the compressed features with the selected intermediate features of the Transformer, and feed the aggregated features to a deconvolutional neural network for image reconstruction. The aggregated features can obtain the long-term information from the self-attention mechanism of the Transformer and improve the compression performance. The rate-distortion-accuracy optimization problem is finally solved by a two-step training strategy. Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks.

PDF Details

NeurIPS Conference 2021 Conference Paper

MAU: A Motion-Aware Unit for Video Prediction and Beyond

Zheng Chang
Xinfeng Zhang
Shanshe Wang
Siwei Ma
Yan Ye
Xiang Xinguang
Wen Gao

Accurately predicting inter-frame motion information plays a key role in video prediction tasks. In this paper, we propose a Motion-Aware Unit (MAU) to capture reliable inter-frame motion information by broadening the temporal receptive field of the predictive units. The MAU consists of two modules, the attention module and the fusion module. The attention module aims to learn an attention map based on the correlations between the current spatial state and the historical spatial states. Based on the learned attention map, the historical temporal states are aggregated to an augmented motion information (AMI). In this way, the predictive unit can perceive more temporal dynamics from a wider receptive field. Then, the fusion module is utilized to further aggregate the augmented motion information (AMI) and current appearance information (current spatial state) to the final predicted frame. The computation load of MAU is relatively low and the proposed unit can be easily applied to other predictive models. Moreover, an information recalling scheme is employed into the encoders and decoders to help preserve the visual details of the predictions. We evaluate the MAU on both video prediction and early action recognition tasks. Experimental results show that the MAU outperforms the state-of-the-art methods on both tasks.

PDF Details

NeurIPS Conference 2021 Conference Paper

Post-Training Quantization for Vision Transformer

Zhenhua Liu
Yunhe Wang
Kai Han
Wei Zhang
Siwei Ma
Wen Gao

Recently, transformer has achieved remarkable performance on a variety of computer vision applications. Compared with mainstream convolutional neural networks, vision transformers are often of sophisticated architectures for extracting powerful feature representations, which are more difficult to be developed on mobile devices. In this paper, we present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers. Basically, the quantization task can be regarded as finding the optimal low-bit quantization intervals for weights and inputs, respectively. To preserve the functionality of the attention mechanism, we introduce a ranking loss into the conventional quantization objective that aims to keep the relative order of the self-attention results after quantization. Moreover, we thoroughly analyze the relationship between quantization loss of different layers and the feature diversity, and explore a mixed-precision quantization scheme by exploiting the nuclear norm of each attention map and output feature. The effectiveness of the proposed method is verified on several benchmark models and datasets, which outperforms the state-of-the-art post-training quantization algorithms. For instance, we can obtain an 81. 29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization. Code will be available at https: //gitee. com/mindspore/models/tree/master/research/cv/VT-PTQ.

PDF Details

AAAI Conference 2021 Conference Paper

Segatron: Segment-Aware Transformer for Language Modeling and Understanding

He Bai
Peng Shi
Jimmy Lin
Yuqing Xie
Luchen Tan
Kun Xiong
Wen Gao
Ming Li

Transformers are powerful for sequence modeling. Nearly all state-of-the-art language models and pre-trained language models are based on the Transformer architecture. However, it distinguishes sequential tokens only with the token position index. We hypothesize that better contextual representations can be generated from the Transformer with richer positional information. To verify this, we propose a segmentaware Transformer (Segatron), by replacing the original token position encoding with a combined position encoding of paragraph, sentence, and token. We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model with memory extension and relative position encoding. We find that our method can further improve the Transformer-XL base model and large model, achieving 17. 1 perplexity on the WikiText- 103 dataset. We further investigate the pre-training masked language modeling task with Segatron. Experimental results show that BERT pre-trained with Segatron (SegaBERT) can outperform BERT with vanilla Transformer on various NLP tasks, and outperforms RoBERTa on zero-shot sentence representation learning. Our code is available on GitHub. 1

PDF Details

AAAI Conference 2020 Conference Paper

CGD: Multi-View Clustering via Cross-View Graph Diffusion

Chang Tang
Xinwang Liu
Xinzhong Zhu
En Zhu
Zhigang Luo
Lizhe Wang
Wen Gao

Graph based multi-view clustering has been paid great attention by exploring the neighborhood relationship among data points from multiple views. Though achieving great success in various applications, we observe that most of previous methods learn a consensus graph by building certain data representation models, which at least bears the following drawbacks. First, their clustering performance highly depends on the data representation capability of the model. Second, solving these resultant optimization models usually results in high computational complexity. Third, there are often some hyperparameters in these models need to tune for obtaining the optimal results. In this work, we propose a general, effective and parameter-free method with convergence guarantee to learn a uniﬁed graph for multi-view data clustering via cross-view graph diffusion (CGD), which is the ﬁrst attempt to employ diffusion process for multi-view clustering. The proposed CGD takes the traditional predeﬁned graph matrices of different views as input, and learns an improved graph for each single view via an iterative cross diffusion process by 1) capturing the underlying manifold geometry structure of original data points, and 2) leveraging the complementary information among multiple graphs. The ﬁnal uniﬁed graph used for clustering is obtained by averaging the improved view associated graphs. Extensive experiments on several benchmark datasets are conducted to demonstrate the effectiveness of the proposed method in terms of seven clustering evaluation metrics.

PDF Details

IS Journal 2020 Journal Article

FedHealth: A Federated Transfer Learning Framework for Wearable Healthcare

Yiqiang Chen
Xin Qin
Jindong Wang
Chaohui Yu
Wen Gao

With the rapid development of computing technology, wearable devices make it easy to get access to people's health information. Smart healthcare achieves great success by training machine learning models on a large quantity of user personal data. However, there are two critical challenges. First, user data often exist in the form of isolated islands, making it difficult to perform aggregation without compromising privacy security. Second, the models trained on the cloud fail on personalization. In this article, we propose FedHealth, the first federated transfer learning framework for wearable healthcare to tackle these challenges. FedHealth performs data aggregation through federated learning, and then builds relatively personalized models by transfer learning. Wearable activity recognition experiments and real Parkinson's disease auxiliary diagnosis application have evaluated that FedHealth is able to achieve accurate and personalized healthcare without compromising privacy and security. FedHealth is general and extensible in many healthcare applications.

Details DOI

AAAI Conference 2020 Conference Paper

Multi-View Spectral Clustering with Optimal Neighborhood Laplacian Matrix

Sihang Zhou
Xinwang Liu
Jiyuan Liu
Xifeng Guo
Yawei Zhao
En Zhu
Yongping Zhai
Jianping Yin

Multi-view spectral clustering aims to group data into different categories by optimally exploring complementary information from multiple Laplacian matrices. However, existing methods usually linearly combine a group of pre-speciﬁed ﬁrst-order Laplacian matrices to construct an optimal Laplacian matrix, which may result in limited representation capability and insufﬁcient information exploitation. In this paper, we propose a novel optimal neighborhood multi-view spectral clustering (ONMSC) algorithm to address these issues. Speciﬁcally, the proposed algorithm generates an optimal Laplacian matrix by searching the neighborhood of both the linear combination of the ﬁrst-order and high-order base Laplacian matrices simultaneously. This design enhances the representative capacity of the optimal Laplacian and better utilizes the hidden high-order connection information, leading to improved clustering performance. An efﬁcient algorithm with proved convergence is designed to solve the resultant optimization problem. Extensive experimental results on 9 datasets demonstrate the superiority of our algorithm against state-of-the-art methods, which veriﬁes the effectiveness and advantages of the proposed ONMSC.

PDF Details

AAAI Conference 2019 Conference Paper

Efficient and Effective Incomplete Multi-View Clustering

Xinwang Liu
Xinzhong Zhu
Miaomiao Li
Chang Tang
En Zhu
Jianping Yin
Wen Gao

Incomplete multi-view clustering (IMVC) optimally fuses multiple pre-specified incomplete views to improve clustering performance. Among various excellent solutions, the recently proposed multiple kernel k-means with incomplete kernels (MKKM-IK) forms a benchmark, which redefines IMVC as a joint optimization problem where the clustering and kernel matrix imputation tasks are alternately performed until convergence. Though demonstrating promising performance in various applications, we observe that the manner of kernel matrix imputation in MKKM-IK would incur intensive computational and storage complexities, overcomplicated optimization and limitedly improved clustering performance. In this paper, we propose an Efficient and Effective Incomplete Multi-view Clustering (EE-IMVC) algorithm to address these issues. Instead of completing the incomplete kernel matrices, EE-IMVC proposes to impute each incomplete base matrix generated by incomplete views with a learned consensus clustering matrix. We carefully develop a three-step iterative algorithm to solve the resultant optimization problem with linear computational complexity and theoretically prove its convergence. Further, we conduct comprehensive experiments to study the proposed EE-IMVC in terms of clustering accuracy, running time, evolution of the learned consensus clustering matrix and the convergence. As indicated, our algorithm significantly and consistently outperforms some state-of-the-art algorithms with much less running time and memory.

PDF Details

IJCAI Conference 2018 Conference Paper

Localized Incomplete Multiple Kernel k-means

Xinzhong Zhu
Xinwang Liu
Miaomiao Li
En Zhu
Li Liu
Zhiping Cai
Jianping Yin
Wen Gao

The recently proposed multiple kernel k-means with incomplete kernels (MKKM-IK) optimally integrates a group of pre-specified incomplete kernel matrices to improve clustering performance. Though it demonstrates promising performance in various applications, we observe that it does not \emph{sufficiently consider the local structure among data and indiscriminately forces all pairwise sample similarity to equally align with their ideal similarity values}. This could make the incomplete kernels less effectively imputed, and in turn adversely affect the clustering performance. In this paper, we propose a novel localized incomplete multiple kernel k-means (LI-MKKM) algorithm to address this issue. Different from existing MKKM-IK, LI-MKKM only requires the similarity of a sample to its k-nearest neighbors to align with their ideal similarity values. This helps the clustering algorithm to focus on closer sample pairs that shall stay together and avoids involving unreliable similarity evaluation for farther sample pairs. We carefully design a three-step iterative algorithm to solve the resultant optimization problem and theoretically prove its convergence. Comprehensive experiments on eight benchmark datasets demonstrate that our algorithm significantly outperforms the state-of-the-art comparable algorithms proposed in the recent literature, verifying the advantage of considering local structure.

PDF Details

AAAI Conference 2018 Conference Paper

SAP: Self-Adaptive Proposal Model for Temporal Action Detection Based on Reinforcement Learning

Jingjia Huang
Nannan Li
Tao Zhang
Ge Li
Tiejun Huang
Wen Gao

Existing action detection algorithms usually generate action proposals through an extensive search over the video at multiple temporal scales, which brings about huge computational overhead and deviates from the human perception procedure. We argue that the process of detecting actions should be naturally one of observation and reﬁnement: observe the current window and reﬁne the span of attended window to cover true action regions. In this paper, we propose a Self-Adaptive Proposal (SAP) model that learns to ﬁnd actions through continuously adjusting the temporal bounds in a self-adaptive way. The whole process can be deemed as an agent, which is ﬁrstly placed at the beginning of the video and traverse the whole video by adopting a sequence of transformations on the current attended region to discover actions according to a learned policy. We utilize reinforcement learning, especially the Deep Q-learning algorithm to learn the agent’s decision policy. In addition, we use temporal pooling operation to extract more effective feature representation for the long temporal window, and design a regression network to adjust the position offsets between predicted results and the ground truth. Experiment results on THUMOS’14 validate the effectiveness of SAP, which can achieve competitive performance with current action detection algorithms via much fewer proposals.

PDF Details

AAAI Conference 2017 Conference Paper

Beyond Monte Carlo Tree Search: Playing Go with Deep Alternative Neural Network and Long-Term Evaluation

Jinzhuo Wang
Wenmin Wang
Ronggang Wang
Wen Gao

Monte Carlo tree search (MCTS) is extremely popular in computer Go which determines each action by enormous simulations in a broad and deep search tree. However, human experts select most actions by pattern analysis and careful evaluation rather than brute search of millions of future interactions. In this paper, we propose a computer Go system that follows experts way of thinking and playing. Our system consists of two parts. The ﬁrst part is a novel deep alternative neural network (DANN) used to generate candidates of next move. Compared with existing deep convolutional neural network (DCNN), DANN inserts recurrent layer after each convolutional layer and stacks them in an alternative manner. We show such setting can preserve more contexts of local features and its evolutions which are beneﬁcial for move prediction. The second part is a long-term evaluation (LTE) module used to provide a reliable evaluation of candidates rather than a single probability from move predictor. This is consistent with human experts nature of playing since they can foresee tens of steps to give an accurate estimation of candidates. In our system, for each candidate, LTE calculates a cumulative reward after several future interactions when local variations are settled. Combining criteria from the two parts, our system determines the optimal choice of next move. For more comprehensive experiments, we introduce a new professional Go dataset (PGD), consisting of 253, 233 professional records. Experiments on GoGoD and PGD datasets show the DANN can substantially improve performance of move prediction over pure DCNN. When combining LTE, our system outperforms most relevant approaches and open engines based on MCTS.

PDF Details

NeurIPS Conference 2016 Conference Paper

Deep Alternative Neural Network: Exploring Contexts as Early as Possible for Action Recognition

Jinzhuo Wang
Wenmin Wang
xiongtao Chen
Ronggang Wang
Wen Gao

Contexts are crucial for action recognition in video. Current methods often mine contexts after extracting hierarchical local features and focus on their high-order encodings. This paper instead explores contexts as early as possible and leverages their evolutions for action recognition. In particular, we introduce a novel architecture called deep alternative neural network (DANN) stacking alternative layers. Each alternative layer consists of a volumetric convolutional layer followed by a recurrent layer. The former acts as local feature learner while the latter is used to collect contexts. Compared with feed-forward neural networks, DANN learns contexts of local features from the very beginning. This setting helps to preserve hierarchical context evolutions which we show are essential to recognize similar actions. Besides, we present an adaptive method to determine the temporal size for network input based on optical flow energy, and develop a volumetric pyramid pooling layer to deal with input clips of arbitrary sizes. We demonstrate the advantages of DANN on two benchmarks HMDB51 and UCF101 and report competitive or superior results to the state-of-the-art.

PDF Details

TIST Journal 2016 Journal Article

Efficient Generalized Fused Lasso and Its Applications

Bo Xin
Yoshinobu Kawahara
Yizhou Wang
Lingjing Hu
Wen Gao

Generalized fused lasso (GFL) penalizes variables with l 1 norms based both on the variables and their pairwise differences. GFL is useful when applied to data where prior information is expressed using a graph over the variables. However, the existing GFL algorithms incur high computational costs and do not scale to high-dimensional problems. In this study, we propose a fast and scalable algorithm for GFL. Based on the fact that fusion penalty is the Lovász extension of a cut function, we show that the key building block of the optimization is equivalent to recursively solving graph-cut problems. Thus, we use a parametric flow algorithm to solve GFL in an efficient manner. Runtime comparisons demonstrate a significant speedup compared to existing GFL algorithms. Moreover, the proposed optimization framework is very general; by designing different cut functions, we also discuss the extension of GFL to directed graphs. Exploiting the scalability of the proposed algorithm, we demonstrate the applications of our algorithm to the diagnosis of Alzheimer’s disease (AD) and video background subtraction (BS). In the AD problem, we formulated the diagnosis of AD as a GFL regularized classification. Our experimental evaluations demonstrated that the diagnosis performance was promising. We observed that the selected critical voxels were well structured, i.e., connected, consistent according to cross validation, and in agreement with prior pathological knowledge. In the BS problem, GFL naturally models arbitrary foregrounds without predefined grouping of the pixels. Even by applying simple background models, e.g., a sparse linear combination of former frames, we achieved state-of-the-art performance on several public datasets.

Details DOI

NeurIPS Conference 2016 Conference Paper

Maximal Sparsity with Deep Networks?

Bo Xin
Yizhou Wang
Wen Gao
David Wipf
Baoyuan Wang

The iterations of many sparse estimation algorithms are comprised of a fixed linear filter cascaded with a thresholding nonlinearity, which collectively resemble a typical neural network layer. Consequently, a lengthy sequence of algorithm iterations can be viewed as a deep network with shared, hand-crafted layer weights. It is therefore quite natural to examine the degree to which a learned network model might act as a viable surrogate for traditional sparse estimation in domains where ample training data is available. While the possibility of a reduced computational budget is readily apparent when a ceiling is imposed on the number of layers, our work primarily focuses on estimation accuracy. In particular, it is well-known that when a signal dictionary has coherent columns, as quantified by a large RIP constant, then most tractable iterative algorithms are unable to find maximally sparse representations. In contrast, we demonstrate both theoretically and empirically the potential for a trained deep network to recover minimal $\ell_0$-norm representations in regimes where existing methods fail. The resulting system, which can effectively learn novel iterative sparse estimation algorithms, is deployed on a practical photometric stereo estimation problem, where the goal is to remove sparse outliers that can disrupt the estimation of surface normals from a 3D scene.

PDF Details

IJCAI Conference 2016 Conference Paper

To Project More or to Quantize More: Minimize Reconstruction Bias for Learning Compact Binary Codes

Zhe Wang
Ling-Yu Duan
Junsong Yuan
Tiejun Huang
Wen Gao

We present a novel approach called Minimal Reconstruction Bias Hashing (MRH) to learn similarity preserving binary codes that jointly optimize both projection and quantization stages. Our work tackles an important problem of how to elegantly connect optimizing projection with optimizing quantization, and to maximize the complementary effects of two stages. Distinct from previous works, MRH can adaptively adjust the projection dimensionality to balance the information loss between projection and quantization. It is formulated as a problem of minimizing reconstruction bias of compressed signals. Extensive experiment results have shown the proposed MRH significantly outperforms a variety of state-of-the-art methods over several widely used benchmarks.

PDF Details

IJCAI Conference 2015 Conference Paper

Hamming Compatible Quantization for Hashing

Zhe Wang
Ling-Yu Duan
Jie Lin
Xiaofang Wang
Tiejun Huang
Wen Gao

Hashing is one of the effective techniques for fast Approximate Nearest Neighbour (ANN) search. Traditional single-bit quantization (SBQ) in most hashing methods incurs lots of quantization error which seriously degrades the search performance. To address the limitation of SBQ, researchers have proposed promising multi-bit quantization (MBQ) methods to quantize each projection dimension with multiple bits. However, some MBQ methods need to adopt specific distance for binary code matching instead of the original Hamming distance, which would significantly decrease the retrieval speed. Two typical MBQ methods Hierarchical Quantization and Double Bit Quantization retain the Hamming distance, but both of them only consider the projection dimensions during quantization, ignoring the neighborhood structure of raw data inherent in Euclidean space. In this paper, we propose a multi-bit quantization method named Hamming Compatible Quantization (HCQ) to preserve the capability of similarity metric between Euclidean space and Hamming space by utilizing the neighborhood structure of raw data. Extensive experiment results have shown our approach significantly improves the performance of various stateof-the-art hashing methods while maintaining fast retrieval speed.

PDF Details

AAAI Conference 2015 Conference Paper

Stable Feature Selection from Brain sMRI

Bo Xin
Lingjing Hu
Yizhou Wang
Wen Gao

Neuroimage analysis usually involves learning thousands or even millions of variables using only a limited number of samples. In this regard, sparse models, e. g. the lasso, are applied to select the optimal features and achieve high diagnosis accuracy. The lasso, however, usually results in independent unstable features. Stability, a manifest of reproducibility of statistical results subject to reasonable perturbations to data and the model (Yu 2013), is an important focus in statistics, especially in the analysis of high dimensional data. In this paper, we explore a nonnegative generalized fused lasso model for stable feature selection in the diagnosis of Alzheimer’s disease. In addition to sparsity, our model incorporates two important pathological priors: the spatial cohesion of lesion voxels and the positive correlation between the features and the disease labels. To optimize the model, we propose an efﬁcient algorithm by proving a novel link between total variation and fast network ﬂow algorithms via conic duality. Experiments show that the proposed nonnegative model performs much better in exploring the intrinsic structure of data via selecting stable features compared with other state-of-the-arts.

PDF Details

AAAI Conference 2014 Conference Paper

Efficient Generalized Fused Lasso and its Application to the Diagnosis of Alzheimer’s Disease

Bo Xin
Yoshinobu Kawahara
Yizhou Wang
Wen Gao

Generalized fused lasso (GFL) penalizes variables with L1 norms based both on the variables and their pairwise differences. GFL is useful when applied to data where prior information is expressed using a graph over the variables. However, the existing GFL algorithms incur high computational costs and they do not scale to highdimensional problems. In this study, we propose a fast and scalable algorithm for GFL. Based on the fact that fusion penalty is the Lovász extension of a cut function, we show that the key building block of the optimization is equivalent to recursively solving parametric graph-cut problems. Thus, we use a parametric flow algorithm to solve GFL in an efficient manner. Runtime comparisons demonstrated a significant speed-up compared with the existing GFL algorithms. By exploiting the scalability of the proposed algorithm, we formulated the diagnosis of Alzheimer’s disease as GFL. Our experimental evaluations demonstrated that the diagnosis performance was promising and that the selected critical voxels were well structured i. e. , connected, consistent according to cross-validation and in agreement with prior clinical knowledge.

PDF Details

IS Journal 2014 Journal Article

The IEEE 1857 Standard: Empowering Smart Video Surveillance Systems

Wen Gao
Yonghong Tian
Tiejun Huang
Siwei Ma
Xianguo Zhang

The IEEE 1857 Standard for Advanced Audio and Video Coding was released as IEEE 1857-2013 in June 2013. Despite consisting of several different groups, the most significant feature of IEEE 1857-2013 is its Surveillance Groups, which can not only achieve at least twice the coding efficiency on surveillance videos as H. 264/AVC High Profile, but it's the most analysis-friendly video coding standard. This article presents an overview of IEEE 1857 Surveillance Groups, highlighting background model-based coding technology and analysis-friendly functionalities. IEEE 1857-2013 will present new opportunities and drive research in smart video surveillance communities and industries.

Details DOI

IS Journal 2013 Journal Article

IEEE 1857: Boosting Video Applications in CPSS

Tiejun Huang
Yonghong Tian
Wen Gao

In CPSS, video is definitely the information flow that takes the majority of traffic. Nevertheless, an enormous gap exists between the amount of video data collected and its searchability. The newly released IEEE 1857 video coding standard is an effective attempt to address this challenge. In particular, the IEEE 1857 standard surveillance groups can effectively support highly efficient surveillance video coding and objects-of-interest representations in the coding bitstream. These features make it a robust video coding standard for various video applications in CPSS.

Details DOI

IJCAI Conference 2013 Conference Paper

Parametric Local Multimodal Hashing for Cross-View Similarity Search

Deming Zhai
Hong Chang
Yi Zhen
Xianming Liu
Xilin Chen
Wen Gao

Recent years have witnessed the growing popularity of hashing for efﬁcient large-scale similarity search. It has been shown that the hashing quality could be boosted by hash function learning (HFL). In this paper, we study HFL in the context of multimodal data for cross-view similarity search. We present a novel multimodal HFL method, called Parametric Local Multimodal Hashing (PLMH), which learns a set of hash functions to locally adapt to the data structure of each modality. To balance locality and computational efﬁciency, the hashing projection matrix of each instance is parameterized, with guaranteed approximation error bound, as a linear combination of basis hashing projections of a small set of anchor points. A local optimal conjugate gradient algorithm is designed to learn the hash functions for each bit, and the overall hash codes are learned in a sequential manner to progressively minimize the bias. Experimental evaluations on cross-media retrieval tasks demonstrate that PLMH performs competitively against the state-of-the-art methods.

PDF Details DOI

TIST Journal 2012 Journal Article

A Generic Approach for Systematic Analysis of Sports Videos

Ning Zhang
Ling-Yu Duan
Lingfang Li
Qingming Huang
Jun Du
Wen Gao
Ling Guan

Various innovative and original works have been applied and proposed in the field of sports video analysis. However, individual works have focused on sophisticated methodologies with particular sport types and there has been a lack of scalable and holistic frameworks in this field. This article proposes a solution and presents a systematic and generic approach which is experimented on a relatively large-scale sports consortia. The system aims at the event detection scenario of an input video with an orderly sequential process. Initially, domain knowledge-independent local descriptors are extracted homogeneously from the input video sequence. Then the video representation is created by adopting a bag-of-visual-words (BoW) model. The video’s genre is first identified by applying the k-nearest neighbor (k-NN) classifiers on the initially obtained video representation, and various dissimilarity measures are assessed and evaluated analytically. Subsequently, an unsupervised probabilistic latent semantic analysis (PLSA)-based approach is employed at the same histogram-based video representation, characterizing each frame of video sequence into one of four view groups, namely closed-up-view, mid-view, long-view, and outer-field-view. Finally, a hidden conditional random field (HCRF) structured prediction model is utilized for interesting event detection. From experimental results, k-NN classifier using KL-divergence measurement demonstrates the best accuracy at 82.16% for genre categorization. Supervised SVM and unsupervised PLSA have average classification accuracies at 82.86% and 68.13%, respectively. The HCRF model achieves 92.31% accuracy using the unsupervised PLSA based label input, which is comparable with the supervised SVM based input at an accuracy of 93.08%. In general, such a systematic approach can be widely applied in processing massive videos generically.

Details DOI

TIST Journal 2012 Journal Article

Multiview Metric Learning with Global Consistency and Local Smoothness

Deming Zhai
Hong Chang
Shiguang Shan
Xilin Chen
Wen Gao

In many real-world applications, the same object may have different observations (or descriptions) from multiview observation spaces, which are highly related but sometimes look different from each other. Conventional metric-learning methods achieve satisfactory performance on distance metric computation of data in a single-view observation space, but fail to handle well data sampled from multiview observation spaces, especially those with highly nonlinear structure. To tackle this problem, we propose a new method called Multiview Metric Learning with Global consistency and Local smoothness (MVML-GL) under a semisupervised learning setting, which jointly considers global consistency and local smoothness. The basic idea is to reveal the shared latent feature space of the multiview observations by embodying global consistency constraints and preserving local geometric structures. Specifically, this framework is composed of two main steps. In the first step, we seek a global consistent shared latent feature space, which not only preserves the local geometric structure in each space but also makes those labeled corresponding instances as close as possible. In the second step, the explicit mapping functions between the input spaces and the shared latent space are learned via regularized locally linear regression. Furthermore, these two steps both can be solved by convex optimizations in closed form. Experimental results with application to manifold alignment on real-world datasets of pose and facial expression demonstrate the effectiveness of the proposed method.

Details DOI

IJCAI Conference 2011 Conference Paper

Learning Compact Visual Descriptor for Low Bit Rate Mobile Landmark Search

Rongrong Ji
Ling-Yu Duan
Jie Chen
Hongxun Yao
Tiejun Huang
Wen Gao

In this paper, we propose to extract a compact yet discriminative visual descriptor directly on the mobile device, which tackles the wireless query transmission latency in mobile landmark search. This descriptor is offline learnt from the location contexts of geo-tagged Web photos from both Flickr and Panoramio with two phrases: First, we segment the landmark photo collections into discrete geographical regions using a Gaussian Mixture Model [Stauffer et al. , 2000]. Second, a ranking sensitive vocabulary boosting is introduced to learn a compact codebook within each region. To tackle the locally optimal descriptor learning caused by imprecise geographical segmentation, we further iterate above phrases by feedback an "entropy" based descriptor compactness into a prior distribution to constrain the Gaussian mixture modeling. Consequently, when entering a specific geographical region, the codebook in the mobile device is downstream adapted, which ensures efficient extraction of compact descriptor, its low bit rate transmission, as well as promising discrimination ability. We deploy our descriptor within both HTC and iPhone mobile phones, testing landmark search in typical areas included Beijing, New York, and Barcelona containing one million images. Our learning descriptor outperforms alternative compact descriptors [Chen et al. , 2009][Chen et al. , 2010][Chandrasekhar et al. , 2009a][Chandrasekhar et al. , 2009b] with a large margin.

PDF Details DOI