Author name cluster

Hehe Fan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers

2 author rows

AAAI Conference 2026 Conference Paper

DLVINet: Advancing Dual-Lens Video Inpainting Beyond Parallax Constraints

Zhiliang Wu
Kun Li
Yunqiu Xu
Hehe Fan
Yi Yang

Dual-lens video inpainting aims to simultaneously restore missing or corrupted contents in videos captured by each lens of binocular systems. Although preliminary explorations have been conducted, existing methods still face two key challenges: limited exploitation of long-range reference information and inadequate modeling of inter-lens consistency in non-standard binocular systems. In this paper, we propose a novel dual-lens video inpainting framework named DLVINet, which addresses these challenges with two core components. Firstly, we develop a sparse spatial-temporal transformer (SSTT) that effectively utilizes the information from distant frames to complete the video contents of each lens individually. By employing sparse spatial-temporal attention with a channel selection mechanism, SSTT not only restores missing regions, but also avoids introducing redundant or irrelevant information. Furthermore, SSTT introduces a multi-scale feed-forward network to enrich the multi-scale representation of completed features. Secondly, we design a cross-lens texture transformer (CLTT) to model inter-lens consistency. By interacting with corresponding features between lenses under the guidance of cross-attention, CLTT captures global inter-lens correspondences. Such a design enables effective cross-view information modeling without being constrained by horizontal parallax, which is particularly critical for non-standard binocular systems. Extensive experiments demonstrate the effectiveness of our DLVINet.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Drafting and Revision: Advancing High-Fidelity Video Inpainting

Zhiliang Wu
Kun Li
Hehe Fan
Yi Yang

Video inpainting aims to fill the missing regions in video with spatial-temporally coherent contents. Existing methods usually treat the missing contents as a whole and adopt a hybrid objective containing a reconstruction loss and an adversarial loss to train the model. However, these two kinds of loss focus on contents at different frequencies, simply combining them may cause inter-frequency conflicts, leading the trained model to generate compromised results. Inspired by the common corrupted painting restoration process of “drawing a draft first and then revising the details later”, this paper proposes a Drafting-and-Revision Completion Network (DRCN) for video inpainting. Specifically, we first design a Drafting Network that utilizes the temporal information to complete the low-frequency semantic structure at low resolution. Then, a Revision Network is developed to hallucinate high-frequency details at high resolution by using the output of Drafting Network. In this way, adversarial loss and reconstruction loss can be applied to high-frequency and low-frequency respectively, effectively mitigating inter-frequency conflicts. Furthermore, Revision Network can be stacked in a pyramid manner to generate higher resolution details, which provide a feasible solution for high-resolution video inpainting. Experiments show that DRCN achieves improvements of 7. 43% and 12. 64% in E_warp and LPIPS, and can handle higher resolution videos on limited GPU memory.

PDF Details DOI

ICML Conference 2025 Conference Paper

DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization

Zhenglin Zhou
Xiaobo Xia
Fan Ma
Hehe Fan
Yi Yang 0001
Tat-Seng Chua

Text-to-3D generation automates 3D content creation from textual descriptions, which offers transformative potential across various fields. However, existing methods often struggle to align generated content with human preferences, limiting their applicability and flexibility. To address these limitations, in this paper, we propose DreamDPO, an optimization-based framework that integrates human preferences into the 3D generation process, through direct preference optimization. Practically, DreamDPO first constructs pairwise examples, then validates their alignment with human preferences using reward or large multimodal models, and lastly optimizes the 3D representation with a preference-driven loss function. By leveraging relative preferences, DreamDPO reduces reliance on precise quality evaluations while enabling fine-grained controllability through preference-guided optimization. Experiments demonstrate that DreamDPO achieves state-of-the-art results, and provides higher-quality and more controllable 3D content compared to existing methods. The code and models will be open-sourced.

Details

ICLR Conference 2025 Conference Paper

OSDA Agent: Leveraging Large Language Models for De Novo Design of Organic Structure Directing Agents

Zhaolin Hu
Yixiao Zhou
Zhongan Wang
Xin Li 0034
Weimin Yang
Hehe Fan
Yi Yang 0001

Zeolites are crystalline porous materials that have been widely utilized in petrochemical industries as well as sustainable chemistry areas. Synthesis of zeolites often requires small molecules termed Organic Structure Directing Agents (OSDAs), which are critical in forming the porous structure. Molecule generation models can aid the design of OSDAs, but they are limited by single functionality and lack of interactivity. Meanwhile, large language models (LLMs) such as GPT-4, as general-purpose artificial intelligence systems, excel in instruction comprehension, logical reasoning, and interactive communication. However, LLMs lack in-depth chemistry knowledge and first-principle computation capabilities, resulting in uncontrollable outcomes even after fine-tuning. In this paper, we propose OSDA Agent, an interactive OSDA design framework that leverages LLMs as the brain, coupled with computational chemistry tools. The OSDA Agent consists of three main components: the Actor, responsible for generating potential OSDA structures; the Evaluator, which assesses and scores the generated OSDAs using computational chemistry tools; and the Self-reflector, which produces reflective summaries based on the Evaluator's feedback to refine the Actor's subsequent outputs. Experiments on representative zeolite frameworks show the generation-evaluation-reflection-refinement workflow can perform de novo design of OSDAs with superior generation quality than the pure LLM model, generating candidates consistent with experimentally validated OSDAs and optimizing known OSDAs.

Details

IJCAI Conference 2025 Conference Paper

Prompt-Aware Controllable Shadow Removal

Kerui Chen
Zhiliang Wu
Wenjin Hou
Kun Li
Hehe Fan
Yi Yang

Shadow removal aims to restore the image content in shadowed regions. While deep learning-based methods have shown promising results, they still face key challenges: 1) uncontrolled removal of all shadows, or 2) controllable removal but heavily relies on precise shadow region masks. To address these issues, we introduce a novel paradigm: prompt-aware controllable shadow removal. Unlike existing approaches, our paradigm allows for targeted shadow removal from specific subjects based on user prompts (e. g. , dots, lines, or subject masks). This approach eliminates the need for shadow annotations and offers flexible, user-controlled shadow removal. Specifically, we propose an end-to-end learnable model, the Prompt-Aware Controllable Shadow Removal Network (PACSRNet). PACSRNet consists of two key modules: a prompt-aware module that generates shadow masks for the specified subject based on the user prompt, and a shadow removal module that uses the shadow prior from the first module to restore the content in the shadowed areas. Additionally, we enhance the shadow removal module by incorporating feature information from the prompt-aware module through a linear operation, providing prompt-guided support for shadow removal. Recognizing that existing shadow removal datasets lack diverse user prompts, we contribute a new dataset specifically designed for prompt-based controllable shadow removal. Extensive experimental results demonstrate the effectiveness and superiority of PACSRNet.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Prototypical Calibrating Ambiguous Samples for Micro-Action Recognition

Kun Li
Dan Guo
Guoliang Chen
Chunxiao Fan
Jingyuan Xu
Zhiliang Wu
Hehe Fan
Meng Wang

Micro-Action Recognition (MAR) has gained increasing attention due to its crucial role as a form of non-verbal communication in social interactions, with promising potential for applications in human communication and emotion analysis. However, current approaches often overlook the inherent ambiguity in micro-actions, which arises from the wide category range and subtle visual differences between categories. This oversight hampers the accuracy of micro-action recognition. In this paper, we propose a novel Prototypical Calibrating Ambiguous Network (PCAN) to unleash and mitigate the ambiguity of MAR. Firstly, we employ a hierarchical action-tree to identify the ambiguous sample, categorizing them into distinct sets of ambiguous samples of false negatives and false positives, considering both body- and action-level categories. Secondly, we implement an ambiguous contrastive refinement module to calibrate these ambiguous samples by regulating the distance between ambiguous samples and their corresponding prototypes. This calibration process aims to pull false negative (FN) samples closer to their respective prototypes and push false positive (FP) samples apart from their affiliated prototypes. In addition, we propose a new prototypical diversity amplification loss to strengthen the model's capacity by amplifying the differences between different prototypes. Finally, we propose a prototype-guided rectification to rectify prediction by incorporating the representability of prototypes. Extensive experiments conducted on the benchmark dataset demonstrate the superior performance of our method compared to existing approaches.

PDF Details DOI

ICML Conference 2025 Conference Paper

Reaction Graph: Towards Reaction-Level Modeling for Chemical Reactions with 3D Structures

Yingzhao Jian
Yue Zhang 0004
Ying Wei 0001
Hehe Fan
Yi Yang 0001

Accurately modeling chemical reactions using Artificial Intelligence (AI) can accelerate discovery and development, especially in fields like drug design and material science. Although AI has made remarkable advancements in single molecule recognition, such as predicting molecular properties, the study of interactions between molecules, particularly chemical reactions, has been relatively overlooked. In this paper, we introduce Reaction Graph (RG), a unified graph representation that encapsulates the 3D molecular structures within chemical reactions. RG integrates the molecular graphs of reactants and products into a cohesive framework, effectively capturing the interatomic relationships pertinent to the reaction process. Additionally, it incorporates the 3D structure information of molecules in a simple yet effective manner. We conduct experiments on a range of tasks, including chemical reaction classification, condition prediction, and yield prediction. RG achieves the highest accuracy across six datasets, demonstrating its effectiveness. The code is available at https: //github. com/Shadow-Dream/Reaction-Graph.

Details

ICLR Conference 2025 Conference Paper

VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing

Xiangpeng Yang
Linchao Zhu
Hehe Fan
Yi Yang 0001

Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios. Our code, data, and demos are available on the [project page](https://knightyxp.github.io/VideoGrain_project_page/).

Details

AAAI Conference 2025 Conference Paper

ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning

Wenjin Hou
Dingjie Fu
Kun Li
Shiming Chen
Hehe Fan
Yi Yang

Zero-shot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen classes to unseen ones, guided by semantic information. To this end, existing works have demonstrated remarkable performance by utilizing global visual features from Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) for visual-semantic interactions. Due to the limited receptive fields of CNNs and the quadratic complexity of ViTs, however, these visual backbones achieve suboptimal visual-semantic interactions. In this paper, motivated by the visual state space model (i.e., Vision Mamba), which is capable of capturing long-range dependencies and modeling complex visual dynamics, we propose a parameter-efficient ZSL framework called ZeroMamba to advance ZSL. Our ZeroMamba comprises three key components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF). Specifically, SLP integrates semantic embeddings to map visual features to local semantic-related representations, while GRL encourages the model to learn global semantic representations. SeF combines these two semantic representations to enhance the discriminability of semantic features. We incorporate these designs into Vision Mamba, forming an end-to-end ZSL framework. As a result, the learned semantic representations are better suited for classification. Through extensive experiments on four prominent ZSL benchmarks, ZeroMamba demonstrates superior performance, significantly outperforming the state-of-the-art (i.e., CNN-based and ViT-based) methods under both conventional ZSL (CZSL) and generalized ZSL (GZSL) settings.

PDF Details DOI

AAAI Conference 2024 Conference Paper

DocMSU: A Comprehensive Benchmark for Document-Level Multimodal Sarcasm Understanding

Hang Du
Guoshun Nan
Sicheng Zhang
Binzhu Xie
Junrui Xu
Hehe Fan
Qimei Cui
Xiaofeng Tao

Multimodal Sarcasm Understanding (MSU) has a wide range of applications in the news field such as public opinion analysis and forgery detection. However, existing MSU benchmarks and approaches usually focus on sentence-level MSU. In document-level news, sarcasm clues are sparse or small and are often concealed in long text. Moreover, compared to sentence-level comments like tweets, which mainly focus on only a few trends or hot topics (e.g., sports events), content in the news is considerably diverse. Models created for sentence-level MSU may fail to capture sarcasm clues in document-level news. To fill this gap, we present a comprehensive benchmark for Document-level Multimodal Sarcasm Understanding (DocMSU). Our dataset contains 102,588 pieces of news with text-image pairs, covering 9 diverse topics such as health, business, etc. The proposed large-scale and diverse DocMSU significantly facilitates the research of document-level MSU in real-world scenarios. To take on the new challenges posed by DocMSU, we introduce a fine-grained sarcasm comprehension method to properly align the pixel-level image features with word-level textual features in documents. Experiments demonstrate the effectiveness of our method, showing that it can serve as a baseline approach to the challenging DocMSU.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Hand-Centric Motion Refinement for 3D Hand-Object Interaction via Hierarchical Spatial-Temporal Modeling

Yuze Hao
Jianrong Zhang
Tao Zhuo
Fuan Wen
Hehe Fan

Hands are the main medium when people interact with the world. Generating proper 3D motion for hand-object interaction is vital for applications such as virtual reality and robotics. Although grasp tracking or object manipulation synthesis can produce coarse hand motion, this kind of motion is inevitably noisy and full of jitter. To address this problem, we propose a data-driven method for coarse motion refinement. First, we design a hand-centric representation to describe the dynamic spatial-temporal relation between hands and objects. Compared to the object-centric representation, our hand-centric representation is straightforward and does not require an ambiguous projection process that converts object-based prediction into hand motion. Second, to capture the dynamic clues of hand-object interaction, we propose a new architecture that models the spatial and temporal structure in a hierarchical manner. Extensive experiments demonstrate that our method outperforms previous methods by a noticeable margin.

PDF Details DOI

ICML Conference 2024 Conference Paper

Improving Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning

Wei Li
Hehe Fan
Yongkang Wong
Yi Yang 0001
Mohan S. Kankanhalli

Previous efforts using frozen Large Language Models (LLMs) for visual understanding, via image captioning or image-text retrieval tasks, face challenges when dealing with complex multimodal scenarios. In order to enhance the capabilities of Multimodal Large Language Models (MLLM) in comprehending the context of vision and language, we introduce Multimodal Composition Learning (MCL) for the purpose of mapping or aligning the vision and language input. In particular, we introduce two tasks: Multimodal-Context Captioning (MC-Cap) and Multimodal-Context Retrieval (MC-Ret) to guide a frozen LLM in comprehending the vision and language context. These specialized tasks are crafted to improve the LLM’s capacity for efficient processing and utilization of multimodal inputs, thereby enhancing its proficiency in generating more accurate text or visual representations. Extensive experiments on both retrieval tasks (i. e. , zero-shot composed image retrieval, visual storytelling image retrieval and visual dialog image retrieval) and text generation tasks (i. e. , visual question answering) demonstrate the effectiveness of the proposed method. The code is available at: https: //github. com/dhg-wei/MCL.

Details

NeurIPS Conference 2024 Conference Paper

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

Wei Li
Hehe Fan
Yongkang Wong
Mohan Kankanhalli
Yi Yang

Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises from the inherent complexity of videos and the inefficient language supervision in recent web-collected video-text datasets. In this paper, we introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding, without the need for pre-training on real video data. Specifically, we first employ an advanced LLM to automatically generate Textual Videos comprising continuous textual frames, along with corresponding annotations to simulate real video-text data. Then, these annotated textual videos are used to pre-align a language-only LLM with the video modality. To bridge the gap between textual and real videos, we employ the CLIP model as the feature extractor to align image and text modalities. During text-only pre-alignment, the continuous textual frames, encoded as a sequence of CLIP text features, are analogous to continuous CLIP image features, thus aligning the LLM with real video representation. Extensive experiments, including zero-shot evaluation and finetuning on various video understanding tasks, demonstrate that TOPA is an effective and efficient framework for aligning video content with LLMs. In particular, without training on any video data, the TOPA-Llama2-13B model achieves a Top-1 accuracy of 51. 0% on the challenging long-form video understanding benchmark, Egoschema. This performance surpasses previous video-text pre-training approaches and proves competitive with recent GPT-3. 5 based video agents.

PDF Details DOI

ICLR Conference 2023 Conference Paper

Continuous-Discrete Convolution for Geometry-Sequence Modeling in Proteins

Hehe Fan
Zhangyang Wang
Yi Yang 0001
Mohan S. Kankanhalli

The structure of proteins involves 3D geometry of amino acid coordinates and 1D sequence of peptide chains. The 3D structure exhibits irregularity because amino acids are distributed unevenly in Euclidean space and their coordinates are continuous variables. In contrast, the 1D structure is regular because amino acids are arranged uniformly in the chains and their sequential positions (orders) are discrete variables. Moreover, geometric coordinates and sequential orders are in two types of spaces and their units of length are incompatible. These inconsistencies make it challenging to capture the 3D and 1D structures while avoiding the impact of sequence and geometry modeling on each other. This paper proposes a Continuous-Discrete Convolution (CDConv) that uses irregular and regular approaches to model the geometry and sequence structures, respectively. Specifically, CDConv employs independent learnable weights for different regular sequential displacements but directly encodes geometric displacements due to their irregularity. In this way, CDConv significantly improves protein modeling by reducing the impact of geometric irregularity on sequence modeling. Extensive experiments on a range of tasks, including protein fold classification, enzyme reaction classification, gene ontology term prediction and enzyme commission number prediction, demonstrate the effectiveness of the proposed CDConv.

Details

AAAI Conference 2023 Conference Paper

SEFormer: Structure Embedding Transformer for 3D Object Detection

Xiaoyu Feng
Heming Du
Hehe Fan
Yueqi Duan
Yongpan Liu

Effectively preserving and encoding structure features from objects in irregular and sparse LiDAR points is a crucial challenge to 3D object detection on the point cloud. Recently, Transformer has demonstrated promising performance on many 2D and even 3D vision tasks. Compared with the fixed and rigid convolution kernels, the self-attention mechanism in Transformer can adaptively exclude the unrelated or noisy points and is thus suitable for preserving the local spatial structure in the irregular LiDAR point cloud. However, Transformer only performs a simple sum on the point features, based on the self-attention mechanism, and all the points share the same transformation for value. A such isotropic operation cannot capture the direction-distance-oriented local structure, which is essential for 3D object detection. In this work, we propose a Structure-Embedding transFormer (SEFormer), which can not only preserve the local structure as a traditional Transformer but also have the ability to encode the local structure. Compared to the self-attention mechanism in traditional Transformer, SEFormer learns different feature transformations for value points based on the relative directions and distances to the query point. Then we propose a SEFormer-based network for high-performance 3D object detection. Extensive experiments show that the proposed architecture can achieve SOTA results on the Waymo Open Dataset, one of the most significant 3D detection benchmarks for autonomous driving. Specifically, SEFormer achieves 79.02% mAP, which is 1.2% higher than existing works. https://github.com/tdzdog/SEFormer.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Text to Point Cloud Localization with Relation-Enhanced Transformer

Guangzhi Wang
Hehe Fan
Mohan Kankanhalli

Automatically localizing a position based on a few natural language instructions is essential for future robots to communicate and collaborate with humans. To approach this goal, we focus on a text-to-point-cloud cross-modal localization problem. Given a textual query, it aims to identify the described location from city-scale point clouds. The task involves two challenges. 1) In city-scale point clouds, similar ambient instances may exist in several locations. Searching each location in a huge point cloud with only instances as guidance may lead to less discriminative signals and incorrect results. 2) In textual descriptions, the hints are provided separately. In this case, the relations among those hints are not explicitly described, leaving the difficulties of learning relations to the agent itself. To alleviate the two challenges, we propose a unified Relation-Enhanced Transformer (RET) to improve representation discriminability for both point cloud and nature language queries. The core of the proposed RET is a novel Relation-enhanced Self-Attention (RSA) mechanism, which explicitly encodes instance (hint)-wise relations for the two modalities. Moreover, we propose a fine-grained cross-modal matching method to further refine the location predictions in a subsequent instance-hint matching stage. Experimental results on the KITTI360Pose dataset demonstrate that our approach surpasses the previous state-of-the-art method by large margins.

PDF Details DOI

ICLR Conference 2021 Conference Paper

PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences

Hehe Fan
Xin Yu 0002
Yuhang Ding
Yi Yang 0001
Mohan S. Kankanhalli

Point cloud sequences are irregular and unordered in the spatial dimension while exhibiting regularities and order in the temporal dimension. Therefore, existing grid based convolutions for conventional video processing cannot be directly applied to spatio-temporal modeling of raw point cloud sequences. In this paper, we propose a point spatio-temporal (PST) convolution to achieve informative representations of point cloud sequences. The proposed PST convolution first disentangles space and time in point cloud sequences. Then, a spatial convolution is employed to capture the local structure of points in the 3D space, and a temporal convolution is used to model the dynamics of the spatial regions along the time dimension. Furthermore, we incorporate the proposed PST convolution into a deep network, namely PSTNet, to extract features of point cloud sequences in a hierarchical manner. Extensive experiments on widely-used 3D action recognition and 4D semantic segmentation datasets demonstrate the effectiveness of PSTNet to model point cloud sequences.

Details

AAAI Conference 2020 Conference Paper

Person Tube Retrieval via Language Description

Hehe Fan
Yi Yang

This paper focuses on the problem of person tube (a sequence of bounding boxes which encloses a person in a video) retrieval using a natural language query. Different from images in person re-identiﬁcation (re-ID) or person search, besides appearance, person tube contains abundant action and information. We exploit a 2D and a 3D residual networks (ResNets) to extract the appearance and action representation, respectively. To transform tubes and descriptions into a shared latent space where data from the two different modalities can be compared directly, we propose a Multi-Scale Structure Preservation (MSSP) approach. MSSP splits a person tube into several element-tubes on average, whose features are extracted by the two ResNets. Any number of consecutive element-tubes forms a sub-tube. MSSP considers the following constraints for sub-tubes and descriptions in the shared space. 1) Bidirectional ranking. Matching sub-tubes (resp. descriptions) should get ranked higher than incorrect ones for each description (resp. sub-tube). 2) External structure preservation. Sub-tubes (resp. descriptions) from different persons should stay away from each other. 3) Internal structure preservation. Sub-tubes (resp. descriptions) from the same person should be close to each other. Experimental results on person tube retrieval via language description and other two related tasks demonstrate the efﬁcacy of MSSP.

PDF Details

AAAI Conference 2019 Conference Paper

Cubic LSTMs for Video Prediction

Hehe Fan
Linchao Zhu
Yi Yang

Predicting future frames in videos has become a promising direction of research for both computer vision and robot learning communities. The core of this problem involves moving object capture and future motion prediction. While object capture specifies which objects are moving in videos, motion prediction describes their future dynamics. Motivated by this analysis, we propose a Cubic Long Short-Term Memory (CubicLSTM) unit for video prediction. CubicLSTM consists of three branches, i. e. , a spatial branch for capturing moving objects, a temporal branch for processing motions, and an output branch for combining the first two branches to generate predicted frames. Stacking multiple CubicLSTM units along the spatial branch and output branch, and then evolving along the temporal branch can form a cubic recurrent neural network (CubicRNN). Experiment shows that CubicRNN produces more accurate video predictions than prior methods on both synthetic and real-world datasets.

PDF Details

IJCAI Conference 2018 Conference Paper

Watching a Small Portion could be as Good as Watching All: Towards Efficient Video Classification

Hehe Fan
Zhongwen Xu
Linchao Zhu
Chenggang Yan
Jianjun Ge
Yi Yang

We aim to significantly reduce the computational cost for classification of temporally untrimmed videos while retaining similar accuracy. Existing video classification methods sample frames with a predefined frequency over entire video. Differently, we propose an end-to-end deep reinforcement approach which enables an agent to classify videos by watching a very small portion of frames like what we do. We make two main contributions. First, information is not equally distributed in video frames along time. An agent needs to watch more carefully when a clip is informative and skip the frames if they are redundant or irrelevant. The proposed approach enables the agent to adapt sampling rate to video content and skip most of the frames without the loss of information. Second, in order to have a confident decision, the number of frames that should be watched by an agent varies greatly from one video to another. We incorporate an adaptive stop network to measure confidence score and generate timely trigger to stop the agent watching videos, which improves efficiency without loss of accuracy. Our approach reduces the computational cost significantly for the large-scale YouTube-8M dataset, while the accuracy remains the same.

PDF Details