Author name cluster

Jianfei Cai

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers

1 author row

AAAI Conference 2026 Conference Paper

Marginalized Generalized IoU (MGIoU): A Unified Objective Function for Optimizing Convex Parametric Shapes

Duy-Tho Le
Trung Pham
Jianfei Cai
Hamid Rezatofighi

Optimizing the similarity between parametric shapes is crucial for numerous computer vision tasks, where Intersection over Union (IoU) stands as the canonical measure. However, existing optimization methods exhibit significant shortcomings: regression-based losses like L1/L2 lack correlation with IoU, IoU-based losses are unstable and limited to simple shapes, and task-specific methods are computationally intensive and not generalizable across domains. As a result, the current landscape of parametric shape objective functions has become scattered, with each domain proposing distinct IoU approximations. To address this, we unify the parametric shape optimization objective functions by introducing Marginalized Generalized IoU (MGIoU), a novel loss function that overcomes these challenges by projecting structured convex shapes onto their unique shape Normals to compute one-dimensional normalized GIoU. MGIoU offers a simple, efficient, fully differentiable approximation strongly correlated with IoU. We extend MGIoU to MGIoU+ that supports optimizing unstructured convex shapes. Together, MGIoU and MGIoU+ unify parametric shape optimization across diverse applications. Experiments on standard benchmarks demonstrate that MGIoU and MGIoU+ demonstrate higher performance while reducing loss computation latency up to 10-40x. Also, MGIoU and MGIoU+ satisfy metric properties and scale-invariance, ensuring robustness as an objective function. We further propose MGIoU- for minimizing overlaps in tasks like collision-free trajectory prediction.

PDF Details DOI

TCS Journal 2026 Journal Article

On competitiveness of dynamic replication for distributed data access

Tianyu Zuo
Xueyan Tang
Bu Sung Lee
Jianfei Cai

This paper studies an online cost optimization problem for distributed storage and access. The goal is to dynamically create and delete copies of data objects over time at geo-distributed servers to serve access requests and minimize the total storage and network cost. We revisit a recent algorithm in the literature and show that it does not have a competitive ratio of 2 as claimed by constructing a counterexample. We further prove that no deterministic online algorithm can achieve a competitive ratio bounded by 2 for the general cost optimization problem. We develop an online algorithm and prove that it achieves a competitive ratio of max {2, min {γ, 3}}, where γ is the max/min storage cost ratio among all servers. Examples are given to confirm the tightness of competitive analysis. We also empirically evaluate algorithms using real object access traces.

Details DOI

AAAI Conference 2026 Conference Paper

PanFlow: Decoupled Motion Control for Panoramic Video Generation

Cheng Zhang
Hanwen Liang
Donny Y. Chen
Qianyi Wu
Konstantinos N. Plataniotis
Camilo Cruz Gambardella
Jianfei Cai

Panoramic video generation has attracted growing attention due to its applications in virtual reality and immersive media. However, existing methods lack explicit motion control and struggle to generate scenes with large and complex motions. We propose PanFlow a novel approach that exploits the spherical nature of panoramas to decouple the highly dynamic camera rotation from the input optical flow condition, enabling more precise control over large and dynamic motions. We further introduce a spherical noise warping strategy to promote loop consistency in motion across panorama boundaries. To support effective training, we curate a large-scale, motion-rich panoramic video dataset with frame-level pose and flow annotations. We also showcase the effectiveness of our method in various applications, including motion transfer and video editing. Extensive experiments demonstrate that PanFlow significantly outperforms prior methods in motion fidelity, visual quality, and temporal coherence.

PDF Details DOI

AAAI Conference 2026 Conference Paper

PCGS: Progressive Compression of 3D Gaussian Splatting

Yihang Chen
Mengyao Li
Qianyi Wu
Weiyao Lin
Mehrtash Harandi
Jianfei Cai

3D Gaussian Splatting (3DGS) achieves impressive rendering fidelity and speed for novel view synthesis. However, its substantial data size poses a significant challenge for practical applications. While many compression techniques have been proposed, they fail to efficiently utilize existing bitstreams in on-demand applications due to their lack of progressivity, leading to a waste of resource. To address this issue, we propose PCGS (Progressive Compression of 3D Gaussian Splatting), which adaptively controls both the quantity and quality of Gaussians (or anchors) to enable effective progressivity for on-demand applications. For quantity, we introduce a progressive masking strategy that incrementally incorporates new anchors while refining existing ones to enhance fidelity. For quality, we propose a progressive quantization approach that gradually reduces quantization step sizes to achieve finer modeling of Gaussian attributes. Furthermore, to compact the incremental bitstreams, we leverage existing quantization results to refine probability prediction, improving entropy coding efficiency across progressive levels. PCGS achieves progressivity while maintaining compression performance comparable to SoTA non-progressive methods.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning

Ziyu Ma
Chenhui Gou
Yiming Hu
Yong Wang
Bohan Zhuang
Jianfei Cai

Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector-based methods have been explored by inserting compact representations of many-shot in-context demonstrations into model activations. However, existing task-vector-based methods either overlook the importance of where to insert task vectors or struggle to determine suitable values for each location. To this end, we propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert. Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion. Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert. We evaluate STV across a range of multimodal models (e.g., Qwen-VL, Idefics-2) and tasks (e.g., VizWiz, OK-VQA), demonstrating its effectiveness and showing consistent improvements over previous task-vector-based methods with strong generalization.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Pengcheng Chen
Jin Ye
Guoan Wang
Yanjun Li
Zhongying Deng
Wei Li
Tianbin Li
Haodong Duan

Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs' effectiveness in various medical applications. Current benchmarks are often built upon specific academic literature, mainly focusing on a single domain, and lacking varying perceptual granularities. Thus, they face specific challenges, including limited clinical relevance, incomplete evaluations, and insufficient guidance for interactive LVLMs. To address these limitations, we developed the GMAI-MMBench, the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format. Additionally, we implemented a lexical tree structure that allows users to customize evaluation tasks, accommodating various assessment needs and substantially supporting medical AI research and applications. We evaluated 50 LVLMs, and the results show that even the advanced GPT-4o only achieves an accuracy of 53. 96\%, indicating significant room for improvement. Moreover, we identified five key insufficiencies in current cutting-edge LVLMs that need to be addressed to advance the development of better medical applications. We believe that GMAI-MMBench will stimulate the community to build the next generation of LVLMs toward GMAI.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views

Yuedong Chen
Chuanxia Zheng
Haofei Xu
Bohan Zhuang
Andrea Vedaldi
Tat-Jen Cham
Jianfei Cai

We introduce MVSplat360, a feed-forward approach for 360° novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations. This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided, making it challenging for conventional methods to achieve high-quality results. Our MVSplat360 addresses this by effectively combining geometry-aware 3D reconstruction with temporally consistent video generation. Specifically, it refactors a feed-forward 3D Gaussian Splatting (3DGS) model to render features directly into the latent space of a pre-trained Stable Video Diffusion (SVD) model, where these features then act as pose and visual cues to guide the denoising process and produce photorealistic 3D-consistent views. Our model is end-to-end trainable and supports rendering arbitrary views with as few as 5 sparse input views. To evaluate MVSplat360's performance, we introduce a new benchmark using the challenging DL3DV-10K dataset, where MVSplat360 achieves superior visual quality compared to state-of-the-art methods on wide-sweeping or even 360° NVS tasks. Experiments on the existing benchmark RealEstate10K also confirm the effectiveness of our model. Readers are highly recommended to view the video results at donydchen. github. io/mvsplat360.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Normal-GS: 3D Gaussian Splatting with Normal-Involved Rendering

Meng Wei
Qianyi Wu
Jianmin Zheng
Hamid Rezatofighi
Jianfei Cai

Rendering and reconstruction are long-standing topics in computer vision and graphics. Achieving both high rendering quality and accurate geometry is a challenge. Recent advancements in 3D Gaussian Splatting (3DGS) have enabled high-fidelity novel view synthesis at real-time speeds. However, the noisy and discrete nature of 3D Gaussian primitives hinders accurate surface estimation. Previous attempts to regularize 3D Gaussian normals often degrade rendering quality due to the fundamental disconnect between normal vectors and the rendering pipeline in 3DGS-based methods. Therefore, we introduce Normal-GS, a novel approach that integrates normal vectors into the 3DGS rendering pipeline. The core idea is to model the interaction between normals and incident lighting using the physically-based rendering equation. Our approach re-parameterizes surface colors as the product of normals and a designed Integrated Directional Illumination Vector (IDIV). To optimize memory usage and simplify optimization, we employ an anchor-based 3DGS to implicitly encode locally-shared IDIVs. Additionally, Normal-GS leverages optimized normals and Integrated Directional Encoding (IDE) to accurately model specular effects, enhancing both rendering quality and surface normal precision. Extensive experiments demonstrate that Normal-GS achieves near state-of-the-art visual quality while obtaining accurate surface normals and preserving real-time rendering performance.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Point-PRC: A Prompt Learning Based Regulation Framework for Generalizable Point Cloud Analysis

Hongyu Sun
Qiuhong Ke
Yongcai Wang
Wang Chen
Kang Yang
Deying Li
Jianfei Cai

This paper investigates the 3D domain generalization (3DDG) ability of large 3D models based on prevalent prompt learning. Recent works demonstrate the performances of 3D point cloud recognition can be boosted remarkably by parameter-efficient prompt tuning. However, we observe that the improvement on downstream tasks comes at the expense of a severe drop in 3D domain generalization. To resolve this challenge, we present a comprehensive regulation framework that allows the learnable prompts to actively interact with the well-learned general knowledge in large 3D models to maintain good generalization. Specifically, the proposed framework imposes multiple explicit constraints on the prompt learning trajectory by maximizing the mutual agreement between task-specific predictions and task-agnostic knowledge. We design the regulation framework as a plug-and-play module to embed into existing representative large 3D models. Surprisingly, our method not only realizes consistently increasing generalization ability but also enhances task-specific 3D recognition performances across various 3DDG benchmarks by a clear margin. Considering the lack of study and evaluation on 3DDG, we also create three new benchmarks, namely base-to-new, cross-dataset and few-shot generalization benchmarks, to enrich the field and inspire future research. Code and benchmarks are available at \url{https: //github. com/auniquesun/Point-PRC}.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

EcoFormer: Energy-Saving Attention with Linear Complexity

Jing Liu
Zizheng Pan
Haoyu He
Jianfei Cai
Bohan Zhuang

Transformer is a transformative framework for deep learning which models sequential data and has achieved remarkable performance on a wide range of tasks, but with high computational and energy cost. To improve its efficiency, a popular choice is to compress the models via binarization which constrains the floating-point values into binary ones to save resource consumption owing to cheap bitwise operations significantly. However, existing binarization methods only aim at minimizing the information loss for the input distribution statistically, while ignoring the pairwise similarity modeling at the core of the attention mechanism. To this end, we propose a new binarization paradigm customized to high-dimensional softmax attention via kernelized hashing, called EcoFormer, to map the original queries and keys into low-dimensional binary codes in Hamming space. The kernelized hash functions are learned to match the ground-truth similarity relations extracted from the attention map in a self-supervised way. Based on the equivalence between the inner product of binary codes and the Hamming distance as well as the associative property of matrix multiplication, we can approximate the attention in linear complexity by expressing it as a dot-product of binary codes. Moreover, the compact binary representations of queries and keys in EcoFormer enable us to replace most of the expensive multiply-accumulate operations in attention with simple accumulations to save considerable on-chip energy footprint on edge devices. Extensive experiments on both vision and language tasks show that EcoFormer consistently achieves comparable performance with standard attentions while consuming much fewer resources. For example, based on PVTv2-B0 and ImageNet-1K, EcoFormer achieves a 73% reduction in on-chip energy footprint with only a slight performance drop of 0. 33% compared to the standard attention. Code is available at https: //github. com/ziplab/EcoFormer.

PDF Details

NeurIPS Conference 2022 Conference Paper

Fast Vision Transformers with HiLo Attention

Zizheng Pan
Jianfei Cai
Bohan Zhuang

Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i. e. , FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies. Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group encodes low frequencies by performing global attention between the average-pooled low-frequency keys and values from each window and each query position in the input feature map. Benefiting from the efficient design for both groups, we show that HiLo is superior to the existing attention mechanisms by comprehensively benchmarking FLOPs, speed and memory consumption on GPUs and CPUs. For example, HiLo is 1. 4× faster than spatial reduction attention and 1. 6× faster than local window attention on CPUs. Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation. Code is available at https: //github. com/ziplab/LITv2.

PDF Details

AAAI Conference 2022 Conference Paper

Less Is More: Pay Less Attention in Vision Transformers

Zizheng Pan
Bohan Zhuang
Haoyu He
Jing Liu
Jianfei Cai

Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works can be prohibitively expensive due to the quadratic complexity of self-attention over a long sequence of representations, especially for high-resolution dense prediction tasks. To this end, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that the early self-attention layers in Transformers still focus on local patterns and bring minor benefits in recent hierarchical vision Transformers. Specifically, we propose a hierarchical Transformer where we use pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a nonuniform manner. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation, serving as a strong backbone for many vision tasks. Code is available at https: //github. com/zip-group/LIT.

PDF Details

NeurIPS Conference 2022 Conference Paper

MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation

Chuanxia Zheng
Tung-Long Vuong
Jianfei Cai
Dinh Phung

Although two-stage Vector Quantized (VQ) generative models allow for synthesizing high-fidelity and high-resolution images, their quantization operator encodes similar patches within an image into the same index, resulting in a repeated artifact for similar adjacent regions using existing decoder architectures. To address this issue, we propose to incorporate the spatially conditional normalization to modulate the quantized vectors so as to insert spatially variant information to the embedded index maps, encouraging the decoder to generate more photorealistic images. Moreover, we use multichannel quantization to increase the recombination capability of the discrete codes without increasing the cost of model and codebook. Additionally, to generate discrete tokens at the second stage, we adopt a Masked Generative Image Transformer (MaskGIT) to learn an underlying prior distribution in the compressed latent space, which is much faster than the conventional autoregressive model. Experiments on two benchmark datasets demonstrate that our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation.

PDF Details

IJCAI Conference 2020 Conference Paper

Learning from the Scene and Borrowing from the Rich: Tackling the Long Tail in Scene Graph Generation

Tao He
Lianli Gao
Jingkuan Song
Jianfei Cai
Yuan-Fang Li

Despite the huge progress in scene graph generation in recent years, its long-tail distribution in object relationships remains a challenging and pestering issue. Existing methods largely rely on either external knowledge or statistical bias information to alleviate this problem. In this paper, we tackle this issue from another two aspects: (1) scene-object interaction aiming at learning specific knowledge from a scene via an additive attention mechanism; and (2) long-tail knowledge transfer which tries to transfer the rich knowledge learned from the head into the tail. Extensive experiments on the benchmark dataset Visual Genome on three tasks demonstrate that our method outperforms current state-of-the-art competitors. Our source code is available at https: //github. com/htlsn/issg.

PDF Details DOI

NeurIPS Conference 2020 Conference Paper

Self-Supervised Relationship Probing

Jiuxiang Gu
Jason Kuen
Shafiq Joty
Jianfei Cai
Vlad Morariu
Handong Zhao
Tong Sun

Structured representations of images that model visual relationships are beneficial for many vision and vision-language applications. However, current human-annotated visual relationship datasets suffer from the long-tailed predicate distribution problem which limits the potential of visual relationship models. In this work, we introduce a self-supervised method that implicitly learns the visual relationships without relying on any ground-truth visual relationship annotations. Our method relies on 1) intra- and inter-modality encodings to respectively model relationships within each modality separately and jointly, and 2) relationship probing, which seeks to discover the graph structure within each modality. By leveraging masked language modeling, contrastive learning, and dependency tree distances for self-supervision, our method learns better object features as well as implicit visual relationships. We verify the effectiveness of our proposed method on various vision-language tasks that benefit from improved visual relationship understanding.

PDF Details

IJCAI Conference 2019 Conference Paper

Region Deformer Networks for Unsupervised Depth Estimation from Unconstrained Monocular Videos

Haofei Xu
Jianmin Zheng
Jianfei Cai
Juyong Zhang

While learning based depth estimation from images/videos has achieved substantial progress, there still exist intrinsic limitations. Supervised methods are limited by a small amount of ground truth or labeled data and unsupervised methods for monocular videos are mostly based on the static scene assumption, not performing well on real world scenarios with the presence of dynamic objects. In this paper, we propose a new learning based method consisting of DepthNet, PoseNet and Region Deformer Networks (RDN) to estimate depth from unconstrained monocular videos without ground truth supervision. The core contribution lies in RDN for proper handling of rigid and non-rigid motions of various objects such as rigidly moving cars and deformable humans. In particular, a deformation based motion representation is proposed to model individual object motion on 2D images. This representation enables our method to be applicable to diverse unconstrained monocular videos. Our method can not only achieve the state-of-the-art results on standard benchmarks KITTI and Cityscapes, but also show promising results on a crowded pedestrian tracking dataset, which demonstrates the effectiveness of the deformation based motion representation. Code and trained models are available at https: //github. com/haofeixu/rdn4depth.

PDF Details

EAAI Journal 2018 Journal Article

Cope with diverse data structures in multi-fidelity modeling: A Gaussian process method

Haitao Liu
Yew-Soon Ong
Jianfei Cai
Yi Wang

Multi-fidelity modeling (MFM) frameworks, especially the Bayesian MFM, have gained popularity in simulation based modeling, uncertainty quantification and optimization, due to the potential for reducing computational budget. In the view of multi-output modeling, the MFM approximates the high-/low-fidelity outputs simultaneously by considering the output correlations, and particularly, it transfers knowledge from the inexpensive low-fidelity outputs that have many training points to enhance the modeling of the expensive high-fidelity output that has a few training points. This article presents a novel multi-fidelity Gaussian process for modeling with diverse data structures. The diverse data structures mainly refer to the diversity of high-fidelity sample distributions, i. e. , the high-fidelity points may randomly fill the domain, or more challengingly, they may cluster in some subregions. The proposed multi-fidelity model is composed of a global trend term and a local residual term. Particularly, the flexible residual term extracts both the shared and output-specific residual information via a data-driven weight parameter. Numerical experiments on two synthetic examples, an aircraft example and a stochastic incompressible flow example reveal that this very promising Bayesian MFM approach is capable of effectively extracting the low-fidelity information for facilitating the modeling of the high-fidelity output using diverse data structures.

Details DOI

AAAI Conference 2018 Conference Paper

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

Jiuxiang Gu
Jianfei Cai
Gang Wang
Tsuhan Chen

The existing image captioning approaches typically train a one-stage sentence decoder, which is difﬁcult to generate rich ﬁne-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-ﬁne multi-stage prediction framework for image captioning, composed of multiple decoders each of which operates on the output of the previous stage, producing increasingly reﬁned image descriptions. Our proposed learning approach addresses the difﬁculty of vanishing gradients during training by providing a learning objective function that enforces intermediate supervisions. Particularly, we optimize our model with a reinforcement learning approach which utilizes the output of each intermediate decoder’s test-time inference algorithm as well as the output of its preceding decoder to normalize the rewards, which simultaneously solves the well-known exposure bias problem and the loss-evaluation mismatch problem. We extensively evaluate the proposed approach on MSCOCO and show that our approach can achieve the state-of-the-art performance.

PDF Details

IJCAI Conference 2017 Conference Paper

Robust Survey Aggregation with Student-t Distribution and Sparse Representation

Qingtao Tang
Tao Dai
Li Niu
Yisen Wang
Shu-Tao Xia
Jianfei Cai

Most existing survey aggregation methods assume that the sample data follow Gaussian distribution. However, these methods are sensitive to outliers, due to the thin-tailed property of the Gaussian distribution. To address this issue, we propose a robust survey aggregation method based on Student-t distribution and sparse representation. Specifically, we assume that the samples follow Student-$t$ distribution, instead of the common Gaussian distribution. Due to the Student-t distribution, our method is robust to outliers, which can be explained from both Bayesian point of view and non-Bayesian point of view. In addition, inspired by James-Stain estimator (JS) and Compressive Averaging (CAvg), we propose to sparsely represent the global mean vector by an adaptive basis comprising both data-specific basis and combined generic bases. Theoretically, we prove that JS and CAvg are special cases of our method. Extensive experiments demonstrate that our proposed method achieves significant improvement over the state-of-the-art methods on both synthetic and real datasets.

PDF Details

IJCAI Conference 2017 Conference Paper

Student-t Process Regression with Student-t Likelihood

Qingtao Tang
Li Niu
Yisen Wang
Tao Dai
Wangpeng An
Jianfei Cai
Shu-Tao Xia

Gaussian Process Regression (GPR) is a powerful Bayesian method. However, the performance of GPR can be significantly degraded when the training data are contaminated by outliers, including target outliers and input outliers. Although there are some variants of GPR (e. g. , GPR with Student-t likelihood (GPRT)) aiming to handle outliers, most of the variants focus on handling the target outliers while little effort has been done to deal with the input outliers. In contrast, in this work, we aim to handle both the target outliers and the input outliers at the same time. Specifically, we replace the Gaussian noise in GPR with independent Student-t noise to cope with the target outliers. Moreover, to enhance the robustness w. r. t. the input outliers, we use a Student-t Process prior instead of the common Gaussian Process prior, leading to Student-t Process Regression with Student-t Likelihood (TPRT). We theoretically show that TPRT is more robust to both input and target outliers than GPR and GPRT, and prove that both GPR and GPRT are special cases of TPRT. Various experiments demonstrate that TPRT outperforms GPR and its variants on both synthetic and real datasets.

PDF Details

TIST Journal 2015 Journal Article

Kinect Depth Recovery Using a Color-Guided, Region-Adaptive, and Depth-Selective Framework

Chongyu Chen
Jianfei Cai
Jianmin Zheng
Tat Jen Cham
Guangming Shi

Considering that the existing depth recovery approaches have different limitations when applied to Kinect depth data, in this article, we propose to integrate their effective features including adaptive support region selection, reliable depth selection, and color guidance together under an optimization framework for Kinect depth recovery. In particular, we formulate our depth recovery as an energy minimization problem, which solves the depth hole filling and denoising simultaneously. The energy function consists of a fidelity term and a regularization term, which are designed according to the Kinect characteristics. Our framework inherits and improves the idea of guided filtering by incorporating structure information and prior knowledge of the Kinect noise model. Through analyzing the solution to the optimization framework, we also derive a local filtering version that provides an efficient and effective way of improving the existing filtering techniques. Quantitative evaluations on our developed synthesized dataset and experiments on real Kinect data show that the proposed method achieves superior performance in terms of recovery accuracy and visual quality.

Details DOI