Arrow Research search

Author name cluster

Kaiqi Huang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

40 papers
2 author rows

Possible papers

40

AAAI Conference 2026 Conference Paper

CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos

  • Xuchen Li
  • Xuzhao Li
  • Shiyu Hu
  • Kaiqi Huang
  • Wentao Zhang

Recent advances in large language models (LLMs) have improved reasoning in text and image domains, yet achieving robust video reasoning remains a significant challenge. Existing video benchmarks mainly assess shallow understanding and reasoning and allow models to exploit global context, failing to rigorously evaluate true causal and stepwise reasoning. We present CausalStep, a benchmark designed for explicit stepwise causal reasoning in videos. CausalStep segments videos into causally linked units and enforces a strict stepwise question-answer (QA) protocol, requiring sequential answers and preventing shortcut solutions. Each question includes carefully constructed distractors based on error type taxonomy to ensure diagnostic value. The benchmark features 100 videos across six categories and 1,852 multiple-choice QA pairs. We introduce seven diagnostic metrics for comprehensive evaluation, enabling precise diagnosis of causal reasoning capabilities. Experiments with leading proprietary and open-source models, as well as human baselines, reveal a significant gap between current models and human-level stepwise reasoning. CausalStep provides a rigorous benchmark to drive progress in robust and interpretable video reasoning.

AAAI Conference 2026 Conference Paper

ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints

  • Meiqi Wu
  • Jiashu Zhu
  • Xiaokun Feng
  • Chubin Chen
  • Chen Zhu
  • Bingze Song
  • Fangyuan Mao
  • Jiahong Wu

Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a dynamic test-time scaling law strategy inspired by imagery that adaptively adjusts the inference search space and reward guided by prompts, effectively enhancing generation quality in imaginative scenarios. Furthermore, we introduce LDT-Bench, the first benchmark targeting long-distance semantic prompts, designed to evaluate the creativity of video generation models. It comprises 2,839 challenging concept pairs from diverse recognition datasets and incorporates an automatic evaluation protocol to assess creative capacity. Extensive experiments on LDT-Bench demonstrate that our approach consistently outperforms general generation models and test-time scaling approaches. Additionally, ImagerySearch achieves strong performance on VBench, confirming its effectiveness in improving video generation quality under diverse conditions.

AAAI Conference 2026 Conference Paper

No-Regret Strategy Solving in Imperfect-Information Games via Pre-Trained Embedding

  • Yanchang Fu
  • Shengda Liu
  • Pei Xu
  • Kaiqi Huang

High-quality information set abstraction remains a core challenge in solving large-scale imperfect-information extensive-form games (IIEFGs)--such as no-limit Texas Hold’em--where the finite nature of spatial resources hinders solving strategies for the full game. State-of-the-art AI methods rely on pre-trained discrete clustering for abstraction, yet their hard classification irreversibly discards critical information: specifically, the quantifiable subtle differences between information sets--vital for strategy solving--thus compromising the quality of such solving. Inspired by the word embedding paradigm in natural language processing, this paper proposes the Embedding CFR algorithm, a novel approach for solving strategies in IIEFGs within an embedding space. The algorithm pre-trains and embeds the features of individual information sets into an interconnected low-dimensional continuous space, where the resulting vectors more precisely capture both the distinctions and connections between information sets. Embedding CFR introduces a strategy-solving process driven by regret accumulation and strategy updates in this embedding space, with supporting theoretical analysis verifying its ability to reduce cumulative regret. Experiments on poker show that with the same spatial overhead, Embedding CFR achieves significantly faster exploitability convergence compared to cluster-based abstraction algorithms, confirming its effectiveness. Furthermore, to our knowledge, it is the first algorithm in poker AI that pre-trains information set abstractions via low-dimensional embedding for strategy solving.

AAAI Conference 2026 Conference Paper

RefRea: Reference-Guided Reasoning with Meta-Cognition for Accurate Language Model Agents

  • Yuxiang Mai
  • Qiyue Yin
  • Wancheng Ni
  • Jianwei Guo
  • Xiaogang Ouyang
  • Pei Xu
  • Kaiqi Huang

In recent years, with the rapid development of large language models (LLMs), LLM-based agents have achieved remarkable progress across a wide range of tasks. However, reasoning inconsistencies in LLMs still significantly limit the performance of agents in complex decision-making scenarios. Cognitive science research suggests that individuals can benefit from observing others' explicit thinking processes to improve their strategy-making. Inspired by this mechanism, we propose Reference-guided Reasoning with meta-cognition (RefRea), a novel approach that enhances decision-making by introducing a reference language model to guide and calibrate the reasoning model's actions. RefRea enhances reasoning accuracy and stability by integrating a reference model and a meta-cognition module. The reference model relies solely on validated meta-cognition for consistent guidance, while the reasoning model interacts with the environment using both validated and exploratory meta-cognition. Guidance is provided by comparing the action similarity between the reference and reasoning models. This process is supported by the meta-cognition module, which generates summary knowledge by reflecting on action history and environmental feedback, leading to more adaptive and reliable behavior. We evaluate our algorithm in the text-based reasoning environment ScienceWorld. Experimental results demonstrate that RefRea outperforms state-of-the-art methods. Comprehensive ablation studies further highlight the effectiveness of both the reference model and the meta-cognition module.

IJCAI Conference 2025 Conference Paper

Constructive Conflict-Driven Multi-Agent Reinforcement Learning for Strategic Diversity

  • Yuxiang Mai
  • Qiyue Yin
  • Wancheng Ni
  • Pei Xu
  • Kaiqi Huang

In recent years, diversity has emerged as a useful mechanism to enhance the efficiency of multi-agent reinforcement learning (MARL). However, existing methods predominantly focus on designing policies based on individual agent characteristics, often neglecting the interplay and mutual influence among agents during policy formation. To address this gap, we propose Competitive Diversity through Constructive Conflict (CoDiCon), a novel approach that incorporates competitive incentives into cooperative scenarios to encourage policy exchange and foster strategic diversity among agents. Drawing inspiration from sociological research, which highlights the benefits of moderate competition and constructive conflict in group decision-making, we design an intrinsic reward mechanism using ranking features to introduce competitive motivations. A centralized intrinsic reward module generates and distributes varying reward values to agents, ensuring an effective balance between competition and cooperation. By optimizing the parameterized centralized reward module to maximize environmental rewards, we reformulate the constrained bilevel optimization problem to align with the original task objectives. We evaluate our algorithm against state-of-the-art methods in the SMAC and GRF environments. Experimental results demonstrate that CoDiCon achieves superior performance, with competitive intrinsic rewards effectively promoting diverse and adaptive strategies among cooperative agents.

ICML Conference 2025 Conference Paper

CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features

  • Xiaokun Feng
  • Dailing Zhang
  • Shiyu Hu
  • Xuchen Li 0001
  • Meiqi Wu
  • Jing Zhang 0110
  • Xiaotang Chen
  • Kaiqi Huang

Effectively modeling and utilizing spatiotemporal features from RGB and other modalities (e. g. , depth, thermal, and event data, denoted as X) is the core of RGB-X tracker design. Existing methods often employ two parallel branches to separately process the RGB and X input streams, requiring the model to simultaneously handle two dispersed feature spaces, which complicates both the model structure and computation process. More critically, intra-modality spatial modeling within each dispersed space incurs substantial computational overhead, limiting resources for inter-modality spatial modeling and temporal modeling. To address this, we propose a novel tracker, CSTrack, which focuses on modeling Compact Spatiotemporal features to achieve simple yet effective tracking. Specifically, we first introduce an innovative Spatial Compact Module that integrates the RGB-X dual input streams into a compact spatial feature, enabling thorough intra- and inter-modality spatial modeling. Additionally, we design an efficient Temporal Compact Module that compactly represents temporal features by constructing the refined target distribution heatmap. Extensive experiments validate the effectiveness of our compact spatiotemporal modeling method, with CSTrack achieving new SOTA results on mainstream RGB-X benchmarks. The code and models will be released at: https: //github. com/XiaokunFeng/CSTrack.

ICML Conference 2025 Conference Paper

LLM Data Selection and Utilization via Dynamic Bi-level Optimization

  • Yang Yu 0056
  • Kai Han 0002
  • Hang Zhou
  • Yehui Tang 0001
  • Kaiqi Huang
  • Yunhe Wang 0001
  • Dacheng Tao

While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model’s data preferences evolve throughout training, providing new insights into the data preference of the model during training.

AAAI Conference 2025 Conference Paper

Sequential Preference Optimization: Multi-Dimensional Preference Alignment with Implicit Reward Modeling

  • Xingzhou Lou
  • Junge Zhang
  • Jian Xie
  • Lifeng Liu
  • Dong Yan
  • Kaiqi Huang

Human preference alignment is critical in building powerful and reliable large language models (LLMs). However, current methods either ignore the multi-dimensionality of human preferences (e.g. helpfulness and harmlessness) or struggle with the complexity of managing multiple reward models. To address these issues, we propose Sequential Preference Optimization (SPO), a method that sequentially fine-tunes LLMs to align with multiple dimensions of human preferences. SPO avoids explicit reward modeling, directly optimizing the models to align with nuanced human preferences. We theoretically derive closed-form optimal SPO policy and loss function. Gradient analysis is conducted to show how SPO manages to fine-tune the LLMs while maintaining alignment on previously optimized dimensions. Empirical results on LLMs of different size and multiple evaluation datasets demonstrate that SPO successfully aligns LLMs across multiple dimensions of human preferences and significantly outperforms the baselines.

AAMAS Conference 2025 Conference Paper

Uncertainty-Aware Opponent Modeling for Deep Reinforcement Learning

  • Likun Yang
  • Pei Xu
  • Shiyue Cao
  • Yongjian Ren
  • Xiaotang Chen
  • Kaiqi Huang

The ability to model opponent behavior is essential for autonomous decision-making in multi-agent games. Although stochastic behavior is universal in real-world situations, previous works have struggled to model opponents with high stochasticity, such as humans. The issue arises because stochasticity in opponent behavior introduces significant uncertainty into the opponent modeling process, which existing methods have not adequately addressed. We introduce a novel Uncertainty-Aware Opponent Modeling (UAOM) method that addresses two key sources of uncertainty stemming from the inherent randomness of the opponent’s actions. The first pertains to the uncertainty in constructing the opponent model, while the second concerns the uncertainty in applying the model during decision-making. For the first uncertainty, UAOM uses a hybrid behavior modeling module to learn a more powerful opponentaware representation by ensembling the deterministic and probabilistic models to address both aleatoric and epistemic uncertainties in opponent modeling. For the second uncertainty, UAOM uses an opponent-aware dynamic modeling module to learn a dynamicaware representation. We further provide a theoretical analysis showing that jointly optimizing our two modules can enhance downstream reinforcement learning performance while ensuring system convergence. We evaluate UAOM in both simulated settings This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Y. Vorobeychik, S. Das, A. Nowé (eds.), May 19 – 23, 2025, Detroit, Michigan, USA. © 2025 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). and human-agent interaction scenarios. Our experimental results show that the proposed method significantly enhances performance when facing opponents with varying degrees of stochastic behavior, while efficiently managing the uncertainties introduced by such opponents.

NeurIPS Conference 2025 Conference Paper

Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards

  • Honghao Chen
  • Xingzhou Lou
  • Xiaokun Feng
  • Kaiqi Huang
  • Xinlong Wang

Chain of thought reasoning has demonstrated remarkable success in large language models, yet its adaptation to vision-language reasoning remains an open challenge with unclear best practices. Existing attempts typically employ reasoning chains at a coarse-grained level, which struggles to perform fine-grained structured reasoning and, more importantly, are difficult to evaluate the reward and quality of intermediate reasoning. In this work, we delve into chain of step reasoning for vision-language models, enabling assessing reasoning step quality accurately and leading to effective reinforcement learning and inference-time scaling with fine-grained rewards. We present a simple, effective, and fully transparent framework, including the step-level reasoning data, process reward model (PRM), and reinforcement learning training. With the proposed approaches, our models set strong baselines with consistent improvements on challenging vision-language benchmarks. More importantly, we conduct a thorough empirical analysis and ablation study, unveiling the impact of each component and several intriguing properties of inference-time scaling. We believe this paper serves as a baseline for vision-language models and offers insights into more complex multimodal reasoning. Our dataset, PRM, and code at https: //github. com/baaivision/CoS.

IJCAI Conference 2024 Conference Paper

ADMN: Agent-Driven Modular Network for Dynamic Parameter Sharing in Cooperative Multi-Agent Reinforcement Learning

  • Yang Yu
  • Qiyue Yin
  • Junge Zhang
  • Pei Xu
  • Kaiqi Huang

Parameter sharing is a common strategy in multi-agent reinforcement learning (MARL) to make the training more efficient and scalable. However, applying parameter sharing among agents indiscriminately hinders the emergence of agents diversity and degrades the final cooperative performance. To better balance parameter sharing and agents diversity, we propose a novel Agent-Driven Modular Network (ADMN), where agents share a base network consisting of multiple specialized modules, and each agent has its own routing to connect these modules. In ADMN, modules are shared among agents to improve the training efficiency, while the combination of different modules brings rich diversity. The agent routing at different time steps is learned end-to-end to achieve a dynamic and adaptive balance. Specifically, we also propose an information-theoretical regularization between the routing of agents and their behavior to further guarantee the identifiability of different routing. We evaluated ADMN in challenging StarCraft micromanagement games and Google Research Football games, and results demonstrate the superior performance of ADMN, particularly in larger or heterogeneous cooperative tasks.

NeurIPS Conference 2024 Conference Paper

Beyond Accuracy: Tracking more like Human via Visual Search

  • Dailing Zhang
  • Shiyu Hu
  • Xiaokun Feng
  • Xuchen Li
  • Meiqi Wu
  • Jing Zhang
  • Kaiqi Huang

Human visual search ability enables efficient and accurate tracking of an arbitrary moving target, which is a significant research interest in cognitive neuroscience. The recently proposed Central-Peripheral Dichotomy (CPD) theory sheds light on how humans effectively process visual information and track moving targets in complex environments. However, existing visual object tracking algorithms still fall short of matching human performance in maintaining tracking over time, particularly in complex scenarios requiring robust visual search skills. These scenarios often involve Spatio-Temporal Discontinuities (i. e. , STDChallenge), prevalent in long-term tracking and global instance tracking. To address this issue, we conduct research from a human-like modeling perspective: (1) Inspired by the CPD, we pro- pose a new tracker named CPDTrack to achieve human-like visual search ability. The central vision of CPDTrack leverages the spatio-temporal continuity of videos to introduce priors and enhance localization precision, while the peripheral vision improves global awareness and detects object movements. (2) To further evaluate and analyze STDChallenge, we create the STDChallenge Benchmark. Besides, by incorporating human subjects, we establish a human baseline, creating a high- quality environment specifically designed to assess trackers’ visual search abilities in videos across STDChallenge. (3) Our extensive experiments demonstrate that the proposed CPDTrack not only achieves state-of-the-art (SOTA) performance in this challenge but also narrows the behavioral differences with humans. Additionally, CPDTrack exhibits strong generalizability across various challenging benchmarks. In summary, our research underscores the importance of human-like modeling and offers strategic insights for advancing intelligent visual target tracking. Code and models are available at https: //github. com/ZhangDailing8/CPDTrack.

AAAI Conference 2024 Conference Paper

DDAE: Towards Deep Dynamic Vision BERT Pretraining

  • Honghao Chen
  • Xiangwen Kong
  • Xiangyu Zhang
  • Xin Zhao
  • Kaiqi Huang

Recently, masked image modeling (MIM) has demonstrated promising prospects in self-supervised representation learning. However, existing MIM frameworks recover all masked patches equivalently, ignoring that the reconstruction difficulty of different patches can vary sharply due to their diverse distance from visible patches. In this paper, we propose a novel deep dynamic supervision to enable MIM methods to dynamically reconstruct patches with different degrees of difficulty at different pretraining phases and depths of the model. Our deep dynamic supervision helps to provide more locality inductive bias for ViTs especially in deep layers, which inherently makes up for the absence of local prior for self-attention mechanism. Built upon the deep dynamic supervision, we propose Deep Dynamic AutoEncoder (DDAE), a simple yet effective MIM framework that utilizes dynamic mechanisms for pixel regression and feature self-distillation simultaneously. Extensive experiments across a variety of vision tasks including ImageNet classification, semantic segmentation on ADE20K and object detection on COCO demonstrate the effectiveness of our approach.

NeurIPS Conference 2024 Conference Paper

MemVLT: Vision-Language Tracking with Adaptive Memory-based Prompts

  • Xiaokun Feng
  • Xuchen Li
  • Shiyu Hu
  • Dailing Zhang
  • Meiqi Wu
  • Jing Zhang
  • Xiaotang Chen
  • Kaiqi Huang

Vision-language tracking (VLT) enhances traditional visual object tracking by integrating language descriptions, requiring the tracker to flexibly understand complex and diverse text in addition to visual information. However, most existing vision-language trackers still overly rely on initial fixed multimodal prompts, which struggle to provide effective guidance for dynamically changing targets. Fortunately, the Complementary Learning Systems (CLS) theory suggests that the human memory system can dynamically store and utilize multimodal perceptual information, thereby adapting to new scenarios. Inspired by this, (i) we propose a Memory-based Vision-Language Tracker (MemVLT). By incorporating memory modeling to adjust static prompts, our approach can provide adaptive prompts for tracking guidance. (ii) Specifically, the memory storage and memory interaction modules are designed in accordance with CLS theory. These modules facilitate the storage and flexible interaction between short-term and long-term memories, generating prompts that adapt to target variations. (iii) Finally, we conduct extensive experiments on mainstream VLT datasets (e. g. , MGIT, TNL2K, LaSOT and LaSOT$_{ext}$). Experimental results show that MemVLT achieves new state-of-the-art performance. Impressively, it achieves 69. 4% AUC on the MGIT and 63. 3% AUC on the TNL2K, improving the existing best result by 8. 4% and 4. 7%, respectively.

IJCAI Conference 2024 Conference Paper

Population-Based Diverse Exploration for Sparse-Reward Multi-Agent Tasks

  • Pei Xu
  • Junge Zhang
  • Kaiqi Huang

Exploration under sparse rewards is a key challenge for multi-agent reinforcement learning problems. Although population-based learning shows its potential in producing diverse behaviors, most previous works still focus on improving the exploration of a single joint policy. In this paper, we show that with a suitable exploration method, maintaining a population of joint policies rather than one joint policy can significantly improve exploration. Our key idea is to guide each member of the population to explore different regions of the environment. To this end, we propose a member-aware exploration objective which explicitly guides each member to maximize deviation from the explored regions of other members, thus forcing them to explore different regions. In addition, we further propose an exploration-enhanced policy constraint to guide each member to learn a joint policy that is both different from other members and promotes exploration, thus increasing the probability of exploring different regions. Under reward-free setting, our method achieves 72% average improvement in the number of explored states compared to classical exploration methods in the multiple-particle environment. Moreover, under sparse-reward setting, we show that the proposed method significantly outperforms the state-of-the-art methods in the multiple-particle environment, the Google Research Football, and StarCraft II micromanagement tasks.

ICML Conference 2024 Conference Paper

Revealing the Dark Secrets of Extremely Large Kernel ConvNets on Robustness

  • Honghao Chen
  • Yurong Zhang
  • Xiaokun Feng
  • Xiangxiang Chu
  • Kaiqi Huang

Robustness is a vital aspect to consider when deploying deep learning models into the wild. Numerous studies have been dedicated to the study of the robustness of vision transformers (ViTs), which have dominated as the mainstream backbone choice for vision tasks since the dawn of 2020s. Recently, some large kernel convnets make a comeback with impressive performance and efficiency. However, it still remains unclear whether large kernel networks are robust and the attribution of their robustness. In this paper, we first conduct a comprehensive evaluation of large kernel convnets’ robustness and their differences from typical small kernel counterparts and ViTs on six diverse robustness benchmark datasets. Then to analyze the underlying factors behind their strong robustness, we design experiments from both quantitative and qualitative perspectives to reveal large kernel convnets’ intriguing properties that are completely different from typical convnets. Our experiments demonstrate for the first time that pure CNNs can achieve exceptional robustness comparable or even superior to that of ViTs. Our analysis on occlusion invariance, kernel attention patterns and frequency characteristics provide novel insights into the source of robustness. Code available at: https: //github. com/Lauch1ng/LKRobust.

AAMAS Conference 2024 Conference Paper

Safe Reinforcement Learning with Free-form Natural Language Constraints and Pre-Trained Language Models

  • Xingzhou Lou
  • Junge Zhang
  • Ziyan Wang
  • Kaiqi Huang
  • Yali Du

Safe reinforcement learning (RL) agents accomplish given tasks while adhering to specific constraints. Employing constraints expressed via easily-understandable human language offers considerable potential for real-world applications due to its accessibility and non-reliance on domain expertise. Previous safe RL methods with natural language constraints typically adopt a recurrent neural network, which leads to limited capabilities when dealing with various forms of human language input. Furthermore, these methods often require a ground-truth cost function, necessitating domain expertise for the conversion of language constraints into a well-defined cost function that determines constraint violation. To address these issues, we proposes to use pre-trained language models (LM) to facilitate RL agents’ comprehension of natural language constraints and allow them to infer costs for safe policy learning. Through the use of pre-trained LMs and the elimination of the need for a ground-truth cost, our method enhances safe policy learning under a diverse set of human-derived free-form natural language constraints. Experiments on grid-world navigation and robot control show that the proposed method can achieve strong performance while adhering to given constraints. The usage of pre-trained LMs allows our method to comprehend complicated constraints and learn safe policies without the need for ground-truth cost at any stage of training or evaluation. Extensive ablation studies are conducted to demonstrate the efficacy of each part of our method.

AAAI Conference 2024 Conference Paper

TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient

  • Xingzhou Lou
  • Junge Zhang
  • Timothy J. Norman
  • Kaiqi Huang
  • Yali Du

Multi-Agent Policy Gradient (MAPG) has made significant progress in recent years. However, centralized critics in state-of-the-art MAPG methods still face the centralized-decentralized mismatch (CDM) issue, which means sub-optimal actions by some agents will affect other agent's policy learning. While using individual critics for policy updates can avoid this issue, they severely limit cooperation among agents. To address this issue, we propose an agent topology framework, which decides whether other agents should be considered in policy gradient and achieves compromise between facilitating cooperation and alleviating the CDM issue. The agent topology allows agents to use coalition utility as learning objective instead of global utility by centralized critics or local utility by individual critics. To constitute the agent topology, various models are studied. We propose Topology-based multi-Agent Policy gradiEnt (TAPE) for both stochastic and deterministic MAPG methods. We prove the policy improvement theorem for stochastic TAPE and give a theoretical explanation for the improved cooperation among agents. Experiment results on several benchmarks show the agent topology is able to facilitate agent cooperation and alleviate CDM issue respectively to improve performance of TAPE. Finally, multiple ablation studies and a heuristic graph search algorithm are devised to show the efficacy of the agent topology.

NeurIPS Conference 2023 Conference Paper

A Multi-modal Global Instance Tracking Benchmark (MGIT): Better Locating Target in Complex Spatio-temporal and Causal Relationship

  • Shiyu Hu
  • Dailing Zhang
  • wu meiqi
  • Xiaokun Feng
  • Xuchen Li
  • Xin Zhao
  • Kaiqi Huang

Tracking an arbitrary moving target in a video sequence is the foundation for high-level tasks like video understanding. Although existing visual-based trackers have demonstrated good tracking capabilities in short video sequences, they always perform poorly in complex environments, as represented by the recently proposed global instance tracking task, which consists of longer videos with more complicated narrative content. Recently, several works have introduced natural language into object tracking, desiring to address the limitations of relying only on a single visual modality. However, these selected videos are still short sequences with uncomplicated spatio-temporal and causal relationships, and the provided semantic descriptions are too simple to characterize video content. To address these issues, we (1) first propose a new multi-modal global instance tracking benchmark named MGIT. It consists of 150 long video sequences with a total of 2. 03 million frames, aiming to fully represent the complex spatio-temporal and causal relationships coupled in longer narrative content. (2) Each video sequence is annotated with three semantic grains (i. e. , action, activity, and story) to model the progressive process of human cognition. We expect this multi-granular annotation strategy can provide a favorable environment for multi-modal object tracking research and long video understanding. (3) Besides, we execute comparative experiments on existing multi-modal object tracking benchmarks, which not only explore the impact of different annotation methods, but also validate that our annotation method is a feasible solution for coupling human understanding into semantic labels. (4) Additionally, we conduct detailed experimental analyses on MGIT, and hope the explored performance bottlenecks of existing algorithms can support further research in multi-modal object tracking. The proposed benchmark, experimental results, and toolkit will be released gradually on http: //videocube. aitestunion. com/.

IJCAI Conference 2023 Conference Paper

Exploration via Joint Policy Diversity for Sparse-Reward Multi-Agent Tasks

  • Pei Xu
  • Junge Zhang
  • Kaiqi Huang

Exploration under sparse rewards is a key challenge for multi-agent reinforcement learning problems. Previous works argue that complex dynamics between agents and the huge exploration space in MARL scenarios amplify the vulnerability of classical count-based exploration methods when combined with agents parameterized by neural networks, resulting in inefficient exploration. In this paper, we show that introducing constrained joint policy diversity into a classical count-based method can significantly improve exploration when agents are parameterized by neural networks. Specifically, we propose a joint policy diversity to measure the difference between current joint policy and previous joint policies, and then use a filtering-based exploration constraint to further refine the joint policy diversity. Under the sparse-reward setting, we show that the proposed method significantly outperforms the state-of-the-art methods in the multiple-particle environment, the Google Research Football, and StarCraft II micromanagement tasks. To the best of our knowledge, on the hard 3s_vs_5z task which needs non-trivial strategies to defeat enemies, our method is the first to learn winning strategies without domain knowledge under the sparse-reward setting.

AAMAS Conference 2023 Conference Paper

PECAN: Leveraging Policy Ensemble for Context-Aware Zero-Shot Human-AI Coordination

  • Xingzhou Lou
  • Jiaxian Guo
  • Junge Zhang
  • Jun Wang
  • Kaiqi Huang
  • Yali Du

Zero-shot human-AI coordination holds the promise of collaborating with humans without human data. Prevailing methods try to train the ego agent with a population of partners via self-play. However, these methods suffer from two problems: 1) The diversity of a population with finite partners is limited, thereby limiting the capacity of the trained ego agent to collaborate with a novel human; 2) Current methods only provide a common best response for every partner in the population, which may result in poor zero-shot coordination performance with a novel partner or humans. To address these issues, we first propose the policy ensemble method to increase the diversity of partners in the population, and then develop a context-aware method enabling the ego agent to analyze and identify the partner’s potential policy primitives so that it can take different actions accordingly. In this way, the ego agent is able to learn more universal cooperative behaviors for collaborating with diverse partners. We conduct experiments on the Overcooked environment, and evaluate the zero-shot human-AI coordination performance of our method with both behavior-cloned human proxies and real humans. The results demonstrate that our method significantly increases the diversity of partners and enables ego agents to learn more diverse behaviors than baselines, thus achieving state-of-theart performance in all scenarios. We also open-source a human-AI ∗Work done while visiting King’s College London. †Correspondence. Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023, London, United Kingdom. © 2023 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). All rights reserved. coordination study framework on the Overcooked for the convenience of future studies. Codes and demo videos are available at https: //sites. google. com/view/pecan-overcooked.

AAMAS Conference 2023 Conference Paper

Prioritized Tasks Mining for Multi-Task Cooperative Multi-Agent Reinforcement Learning

  • Yang Yu
  • Qiyue Yin
  • Junge Zhang
  • Kaiqi Huang

Multi-task learning improves data efficiency in cooperative multiagent reinforcement learning, since agents can learn multiple related tasks simultaneously and the cooperation knowledge in a task can be utilized by others. However, existing methods mainly learn multiple cooperation tasks uniformly, regardless of their complexity and significance. In this paper, we propose a new framework called Prioritized Tasks Mining (PTM) for multi-task cooperation problems, which helps agents to identify and mine higher priority cooperation tasks, so as to learn more effective coordinated strategies for multiple cooperation tasks. Specially, agents will use the hindsight during training to identify the priority of different tasks, and make an exploration and exploitation in higher priority cooperative tasks to mine more sophisticated coordinated strategies. We evaluate PTM in challenging multi-task StarCraft micromanagement games with different scales, and results demonstrate that our method consistently outperforms all strong baselines.

ICLR Conference 2023 Conference Paper

Re-parameterizing Your Optimizers rather than Architectures

  • Xiaohan Ding
  • Honghao Chen
  • Xiangyu Zhang 0005
  • Kaiqi Huang
  • Jungong Han
  • Guiguang Ding

The well-designed structures in neural networks reflect the prior knowledge incorporated into the models. However, though different models have various priors, we are used to training them with model-agnostic optimizers such as SGD. In this paper, we propose to incorporate model-specific prior knowledge into optimizers by modifying the gradients according to a set of model-specific hyper-parameters. Such a methodology is referred to as Gradient Re-parameterization, and the optimizers are named RepOptimizers. For the extreme simplicity of model structure, we focus on a VGG-style plain model and showcase that such a simple model trained with a RepOptimizer, which is referred to as RepOpt-VGG, performs on par with or better than the recent well-designed models. From a practical perspective, RepOpt-VGG is a favorable base model because of its simple structure, high inference speed and training efficiency. Compared to Structural Re-parameterization, which adds priors into models via constructing extra training-time structures, RepOptimizers require no extra forward/backward computations and solve the problem of quantization. We hope to spark further research beyond the realms of model structure design. Code and models https://github.com/DingXiaoH/RepOptimizers.

AAAI Conference 2023 Conference Paper

Subspace-Aware Exploration for Sparse-Reward Multi-Agent Tasks

  • Pei Xu
  • Junge Zhang
  • Qiyue Yin
  • Chao Yu
  • Yaodong Yang
  • Kaiqi Huang

Exploration under sparse rewards is a key challenge for multi-agent reinforcement learning problems. One possible solution to this issue is to exploit inherent task structures for an acceleration of exploration. In this paper, we present a novel exploration approach, which encodes a special structural prior on the reward function into exploration, for sparse-reward multi-agent tasks. Specifically, a novel entropic exploration objective which encodes the structural prior is proposed to accelerate the discovery of rewards. By maximizing the lower bound of this objective, we then propose an algorithm with moderate computational cost, which can be applied to practical tasks. Under the sparse-reward setting, we show that the proposed algorithm significantly outperforms the state-of-the-art algorithms in the multiple-particle environment, the Google Research Football and StarCraft II micromanagement tasks. To the best of our knowledge, on some hard tasks (such as 27m_vs_30m}) which have relatively larger number of agents and need non-trivial strategies to defeat enemies, our method is the first to learn winning strategies under the sparse-reward setting.

NeurIPS Conference 2022 Conference Paper

InsPro: Propagating Instance Query and Proposal for Online Video Instance Segmentation

  • Fei He
  • Haoyang Zhang
  • Naiyu Gao
  • Jian Jia
  • Yanhu Shan
  • Xin Zhao
  • Kaiqi Huang

Video instance segmentation (VIS) aims at segmenting and tracking objects in videos. Prior methods typically generate frame-level or clip-level object instances first and then associate them by either additional tracking heads or complex instance matching algorithms. This explicit instance association approach increases system complexity and fails to fully exploit temporal cues in videos. In this paper, we design a simple, fast and yet effective query-based framework for online VIS. Relying on an instance query and proposal propagation mechanism with several specially developed components, this framework can perform accurate instance association implicitly. Specifically, we generate frame-level object instances based on a set of instance query-proposal pairs propagated from previous frames. This instance query-proposal pair is learned to bind with one specific object across frames through conscientiously developed strategies. When using such a pair to predict an object instance on the current frame, not only the generated instance is automatically associated with its precursors on previous frames, but the model gets a good prior for predicting the same object. In this way, we naturally achieve implicit instance association in parallel with segmentation and elegantly take advantage of temporal clues in videos. To show the effectiveness of our method InsPro, we evaluate it on two popular VIS benchmarks, i. e. , YouTube-VIS 2019 and YouTube-VIS 2021. Without bells-and-whistles, our InsPro with ResNet-50 backbone achieves 43. 2 AP and 37. 6 AP on these two benchmarks respectively, outperforming all other online VIS methods.

AAAI Conference 2022 Conference Paper

Learning Disentangled Attribute Representations for Robust Pedestrian Attribute Recognition

  • Jian Jia
  • Naiyu Gao
  • Fei He
  • Xiaotang Chen
  • Kaiqi Huang

Although various methods have been proposed for pedestrian attribute recognition, most studies follow the same feature learning mechanism, i. e. , learning a shared pedestrian image feature to classify multiple attributes. However, this mechanism leads to low-confidence predictions and non-robustness of the model in the inference stage. In this paper, we investigate why this is the case. We mathematically discover that the central cause is that the optimal shared feature cannot maintain high similarities with multiple classifiers simultaneously in the context of minimizing classification loss. In addition, this feature learning mechanism ignores the spatial and semantic distinctions between different attributes. To address these limitations, we propose a novel disentangled attribute feature learning (DAFL) framework to learn a disentangled feature for each attribute, which exploits the semantic and spatial characteristics of attributes. The framework mainly consists of learnable semantic queries, a cascaded semantic-spatial cross-attention (SSCA) module, and a group attention merging (GAM) module. Specifically, based on learnable semantic queries, the cascaded SSCA module iteratively enhances the spatial localization of attribute-related regions and aggregates region features into multiple disentangled attribute features, used for classification and updating learnable semantic queries. The GAM module splits attributes into groups based on spatial distribution and utilizes reliable group attention to supervise query attention maps. Experiments on PETA, RAPv1, PA100k, and RAPv2 show that the proposed method performs favorably against stateof-the-art methods.

AAAI Conference 2022 Conference Paper

QueryProp: Object Query Propagation for High-Performance Video Object Detection

  • Fei He
  • Naiyu Gao
  • Jian Jia
  • Xin Zhao
  • Kaiqi Huang

Video object detection has been an important yet challenging topic in computer vision. Traditional methods mainly focus on designing the image-level or box-level feature propagation strategies to exploit temporal information. This paper argues that with a more effective and efficient feature propagation framework, video object detectors can gain improvement in terms of both accuracy and speed. For this purpose, this paper studies object-level feature propagation, and proposes an object query propagation (QueryProp) framework for high-performance video object detection. The proposed QueryProp contains two propagation strategies: 1) query propagation is performed from sparse key frames to dense non-key frames to reduce the redundant computation on nonkey frames; 2) query propagation is performed from previous key frames to the current key frame to improve feature representation by temporal context modeling. To further facilitate query propagation, an adaptive propagation gate is designed to achieve flexible key frame selection. We conduct extensive experiments on the ImageNet VID dataset. QueryProp achieves comparable accuracy with state-of-the-art methods and strikes a decent accuracy/speed trade-off.

AAAI Conference 2021 Conference Paper

Learning to Reweight Imaginary Transitions for Model-Based Reinforcement Learning

  • Wenzhen Huang
  • Qiyue Yin
  • Junge Zhang
  • Kaiqi Huang

Model-based reinforcement learning (RL) is more sample efficient than model-free RL by using imaginary trajectories generated by the learned dynamics model. When the model is inaccurate or biased, imaginary trajectories may be deleterious for training the action-value and policy functions. To alleviate such problem, this paper proposes to adaptively reweight the imaginary transitions, so as to reduce the negative effects of poorly generated trajectories. More specifically, we evaluate the effect of an imaginary transition by calculating the change of the loss computed on the real samples when we use the transition to train the action-value and policy functions. Based on this evaluation criterion, we construct the idea of reweighting each imaginary transition by a well-designed meta-gradient algorithm. Extensive experimental results demonstrate that our method outperforms state-of-the-art model-based and model-free RL algorithms on multiple tasks. Visualization of our changing weights further validates the necessity of utilizing reweight scheme.

AAAI Conference 2020 Conference Paper

GlobalTrack: A Simple and Strong Baseline for Long-Term Tracking

  • Lianghua Huang
  • Xin Zhao
  • Kaiqi Huang

A key capability of a long-term tracker is to search for targets in very large areas (typically the entire image) to handle possible target absences or tracking failures. However, currently there is a lack of such a strong baseline for global instance search. In this work, we aim to bridge this gap. Specifically, we propose GlobalTrack, a pure global instance search based tracker that makes no assumption on the temporal consistency of the target's positions and scales. GlobalTrack is developed based on two-stage object detectors, and it is able to perform full-image and multi-scale search of arbitrary instances with only a single query as the guide. We further propose a cross-query loss to improve the robustness of our approach against distractors. With no online learning, no punishment on position or scale changes, no scale smoothing and no trajectory refinement, our pure global instance search based tracker achieves comparable, sometimes much better performance on four large-scale tracking benchmarks (i.e., 52.1% AUC on LaSOT, 63.8% success rate on TLP, 60.3% MaxGM on OxUvA and 75.4% normalized precision on TrackingNet), compared to state-of-the-art approaches that typically require complex post-processing. More importantly, our tracker runs without cumulative errors, i.e., any type of temporary tracking failures will not affect its performance on future frames, making it ideal for long-term tracking. We hope this work will be a strong baseline for long-term tracking and will stimulate future works in this area.

AAAI Conference 2020 Conference Paper

Temporal Context Enhanced Feature Aggregation for Video Object Detection

  • Fei He
  • Naiyu Gao
  • Qiaozhe Li
  • Senyao Du
  • Xin Zhao
  • Kaiqi Huang

Video object detection is a challenging task because of the presence of appearance deterioration in certain video frames. One typical solution is to aggregate neighboring features to enhance per-frame appearance features. However, such a method ignores the temporal relations between the aggregated frames, which is critical for improving video recognition accuracy. To handle the appearance deterioration problem, this paper proposes a temporal context enhanced network (TCENet) to exploit temporal context information by temporal aggregation for video object detection. To handle the displacement of the objects in videos, a novel DeformAlign module is proposed to align the spatial features from frame to frame. Instead of adopting a fixed-length window fusion strategy, a temporal stride predictor is proposed to adaptively select video frames for aggregation, which facilitates exploiting variable temporal information and requiring fewer video frames for aggregation to achieve better results. Our TCENet achieves state-of-the-art performance on the ImageNet VID dataset and has a faster runtime. Without bellsand-whistles, our TCENet achieves 80. 3% mAP by only aggregating 3 frames.

AAAI Conference 2019 Conference Paper

3D Object Detection Using Scale Invariant and Feature Reweighting Networks

  • Xin Zhao
  • Zhe Liu
  • Ruolan Hu
  • Kaiqi Huang

3D object detection plays an important role in a large number of real-world applications. It requires us to estimate the localizations and the orientations of 3D objects in real scenes. In this paper, we present a new network architecture which focuses on utilizing the front view images and frustum point clouds to generate 3D detection results. On the one hand, a PointSIFT module is utilized to improve the performance of 3D segmentation. It can capture the information from different orientations in space and the robustness to different scale shapes. On the other hand, our network obtains the useful features and suppresses the features with less information by a SENet module. This module reweights channel features and estimates the 3D bounding boxes more effectively. Our method is evaluated on both KITTI dataset for outdoor scenes and SUN-RGBD dataset for indoor scenes. The experimental results illustrate that our method achieves better performance than the state-of-the-art methods especially when point clouds are highly sparse.

AAAI Conference 2019 Conference Paper

Bootstrap Estimated Uncertainty of the Environment Model for Model-Based Reinforcement Learning

  • Wenzhen Huang
  • Junge Zhang
  • Kaiqi Huang

Model-based reinforcement learning (RL) methods attempt to learn a dynamics model to simulate the real environment and utilize the model to make better decisions. However, the learned environment simulator often has more or less model error which would disturb making decision and reduce performance. We propose a bootstrapped model-based RL method which bootstraps the modules in each depth of the planning tree. This method can quantify the uncertainty of environment model on different state-action pairs and lead the agent to explore the pairs with higher uncertainty to reduce the potential model errors. Moreover, we sample target values from their bootstrap distribution to connect the uncertainties at current and subsequent time-steps and introduce the prior mechanism to improve the exploration efficiency. Experiment results demonstrate that our method efficiently decreases model error and outperforms TreeQN and other stateof-the-art methods on multiple Atari games.

IJCAI Conference 2019 Conference Paper

Pedestrian Attribute Recognition by Joint Visual-semantic Reasoning and Knowledge Distillation

  • Qiaozhe Li
  • Xin Zhao
  • Ran He
  • Kaiqi Huang

Pedestrian attribute recognition in surveillance is a challenging task in computer vision due to significant pose variation, viewpoint change and poor image quality. To achieve effective recognition, this paper presents a graph-based global reasoning framework to jointly model potential visual-semantic relations of attributes and distill auxiliary human parsing knowledge to guide the relational learning. The reasoning framework models attribute groups on a graph and learns a projection function to adaptively assign local visual features to the nodes of the graph. After feature projection, graph convolution is utilized to perform global reasoning between the attribute groups to model their mutual dependencies. Then, the learned node features are projected back to visual space to facilitate knowledge transfer. An additional regularization term is proposed by distilling human parsing knowledge from a pre-trained teacher model to enhance feature representations. The proposed framework is verified on three large scale pedestrian attribute datasets including PETA, RAP, and PA-100k. Experiments show that our method achieves state-of-the-art results.

AAAI Conference 2019 Conference Paper

Visual-Semantic Graph Reasoning for Pedestrian Attribute Recognition

  • Qiaozhe Li
  • Xin Zhao
  • Ran He
  • Kaiqi Huang

Pedestrian attribute recognition in surveillance is a challenging task due to poor image quality, significant appearance variations and diverse spatial distribution of different attributes. This paper treats pedestrian attribute recognition as a sequential attribute prediction problem and proposes a novel visual-semantic graph reasoning framework to address this problem. Our framework contains a spatial graph and a directed semantic graph. By performing reasoning using the Graph Convolutional Network (GCN), one graph captures spatial relations between regions and the other learns potential semantic relations between attributes. An end-to-end architecture is presented to perform mutual embedding between these two graphs to guide the relational learning for each other. We verify the proposed framework on three large scale pedestrian attribute datasets including PETA, RAP, and PA- 100k. Experiments show superiority of the proposed method over state-of-the-art methods and effectiveness of our joint GCN structures for sequential attribute prediction.

AAAI Conference 2018 Conference Paper

Deep Semantic Structural Constraints for Zero-Shot Learning

  • Yan Li
  • Zhen Jia
  • Junge Zhang
  • Kaiqi Huang
  • Tieniu Tan

Zero-shot learning aims to classify unseen image categories by learning a visual-semantic embedding space. In most cases, the traditional methods adopt a separated two-step pipeline that extracts image features from pre-trained CNN models. Then the fixed image features are utilized to learn the embedding space. It leads to the lack of specific structural semantic information of image features for zero-shot learning task. In this paper, we propose an end-to-end trainable Deep Semantic Structural Constraints model to address this issue. The proposed model contains the Image Feature Structure constraint and the Semantic Embedding Structure constraint, which aim to learn structure-preserving image features and endue the learned embedding space with stronger generalization ability respectively. With the assistance of semantic structural information, the model gains more auxiliary clues for zero-shot learning. The state-of-the-art performance certifies the effectiveness of our proposed method.

IJCAI Conference 2018 Conference Paper

Densely Cascaded Shadow Detection Network via Deeply Supervised Parallel Fusion

  • Yupei Wang
  • Xin Zhao
  • Yin Li
  • Xuecai Hu
  • Kaiqi Huang

Shadow detection is an important and challenging problem in computer vision. Recently, single image shadow detection had achieved major progress with the development of deep convolutional networks. However, existing methods are still vulnerable to background clutters, and often fail to capture the global context of an input image. These global contextual and semantic cues are essential for accurately localizing the shadow regions. Moreover, rich spatial details are required to segment shadow regions with precise shape. To this end, this paper presents a novel model characterized by a deeply supervised parallel fusion (DSPF) network and a densely cascaded learning scheme. The DSPF network achieves a comprehensive fusion of global semantic cues and local spatial details by multiple stacked parallel fusion branches, which are learned in a deeply supervised manner. Moreover, the densely cascaded learning scheme is employed to refine the spatial details. Our method is evaluated on two widely used shadow detection benchmarks. Experimental results show that our method outperforms state-of-the-arts by a large margin.

AAAI Conference 2018 Conference Paper

DF 2 Net: Discriminative Feature Learning and Fusion Network for RGB-D Indoor Scene Classification

  • Yabei Li
  • Junge Zhang
  • Yanhua Cheng
  • Kaiqi Huang
  • Tieniu Tan

This paper focuses on the task of RGB-D indoor scene classification. It is a very challenging task due to two folds. 1) Learning robust representation for indoor scene is difficult because of various objects and layouts. 2) Fusing the complementary cues in RGB and Depth is nontrivial since there are large semantic gaps between the two modalities. Most existing works learn representation for classification by training a deep network with softmax loss and fuse the two modalities by simply concatenating the features of them. However, these pipelines do not explicitly consider intra-class and interclass similarity as well as inter-modal intrinsic relationships. To address these problems, this paper proposes a Discriminative Feature Learning and Fusion Network (DF2 Net) with two-stage training. In the first stage, to better represent scene in each modality, a deep multi-task network is constructed to simultaneously minimize the structured loss and the softmax loss. In the second stage, we design a novel discriminative fusion network which is able to learn correlative features of multiple modalities and distinctive features of each modality. Extensive analysis and experiments on SUN RGB-D Dataset and NYU Depth Dataset V2 show the superiority of DF2 Net over other state-of-the-art methods in RGB-D indoor scene classification task.

AAAI Conference 2017 Conference Paper

A Multi-Task Deep Network for Person Re-Identification

  • Weihua Chen
  • Xiaotang Chen
  • Jianguo Zhang
  • Kaiqi Huang

Person re-identification (ReID) focuses on identifying people across different scenes in video surveillance, which is usually formulated as a binary classification task or a ranking task in current person ReID approaches. In this paper, we take both tasks into account and propose a multi-task deep network (MTDnet) that makes use of their own advantages and jointly optimize the two tasks simultaneously for person ReID. To the best of our knowledge, we are the first to integrate both tasks in one network to solve the person ReID. We show that our proposed architecture significantly boosts the performance. Furthermore, deep architecture in general requires a sufficient dataset for training, which is usually not met in person ReID. To cope with this situation, we further extend the MTDnet and propose a cross-domain architecture that is capable of using an auxiliary set to assist training on small target sets. In the experiments, our approach outperforms most of existing person ReID algorithms on representative datasets including CUHK03, CUHK01, VIPeR, iLIDS and PRID2011, which clearly demonstrates the effectiveness of the proposed approach.

IJCAI Conference 2016 Conference Paper

FastLCD: Fast Label Coordinate Descent for the Efficient Optimization of 2D Label MRFs

  • Kangwei Liu
  • Junge Zhang
  • Peipei Yang
  • Kaiqi Huang

Recently, MRFs with two-dimensional (2D) labels have proved useful to many applications, such as image matching and optical flow estimation. Due to the huge 2D label set in these problems, existing optimization algorithms tend to be slow for the inference of 2D label MRFs, and this greatly limits the practical use of 2D label MRFs. To solve the problem, this paper presents an efficient algorithm, named FastLCD. Unlike previous popular move-making algorithms (e. g. , α -expansion) that visit all the labels exhaustively in each step, FastLCD optimizes the 2D label MRFs by performing label coordinate descents alternately in horizontal, vertical and diagonal directions, and by this way, it does not need to visit all the labels exhaustively. FastLCD greatly reduces the search space of the label set and benefits from a lower time complexity. Experimental results show that FastLCD is much faster, while it still yields high quality results.

IJCAI Conference 2016 Conference Paper

Semi-Supervised Multimodal Deep Learning for RGB-D Object Recognition

  • Yanhua Cheng
  • Xin Zhao
  • Rui Cai
  • Zhiwei Li
  • Kaiqi Huang
  • Yong Rui

This paper studies the problem of RGB-D object recognition. Inspired by the great success of deep convolutional neural networks (DCNN) in AI, researchers have tried to apply it to improve the performance of RGB-D object recognition. However, DCNN always requires a large-scale annotated dataset to supervise its training. Manually labeling such a large RGB-D dataset is expensive and time consuming, which prevents DCNN from quickly promoting this research area. To address this problem, we propose a semi-supervised multimodal deep learning framework to train DCNN effectively based on very limited labeled data and massive unlabeled data. The core of our framework is a novel diversity preserving co-training algorithm, which can successfully guide DCNN to learn from the unlabeled RGB-D data by making full use of the complementary cues of the RGB and depth data in object representation. Experiments on the benchmark RGB-D dataset demonstrate that, with only 5% labeled training data, our approach achieves competitive performance for object recognition compared with those state-of-the-art results reported by fully-supervised methods.