Arrow Research search

Author name cluster

Chen Gao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers
2 author rows

Possible papers

21

AAAI Conference 2026 Conference Paper

AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

  • Jirong Zha
  • Yuxuan Fan
  • Tianyu Zhang
  • Geng Chen
  • Yingfeng Chen
  • Chen Gao
  • Xinlei Chen

Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions. To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception.

AAAI Conference 2026 Conference Paper

DIMM: Decoupled Multi-hierarchy Kalman Filter via Reinforcement Learning

  • Jirong Zha
  • Yuxuan Fan
  • Kai Li
  • Han Li
  • Chen Gao
  • Xinlei Chen

State estimation is challenging for target tracking with high maneuverability, as the target's state transition function changes rapidly, irregularly, and is unknown to the estimator. Existing work based on interacting multiple model (IMM) achieves more accurate estimation than single-filter approaches through model combination, aligning appropriate models for different motion modes of the target over time. However, two limitations of conventional IMM remain unsolved. First, the solution space of the model combination is constrained as the target's diverse kinematic properties in different directions are ignored. Second, the model combination weights calculated by the observation likelihood are not accurate enough due to the measurement uncertainty. In this paper, we propose a novel framework, DIMM, to effectively combine estimates from different motion models in each direction, thus increasing the target tracking accuracy. First, DIMM extends the model combination solution space of conventional IMM from a hyperplane to a hypercube by designing a 3D-decoupled multi-hierarchy filter bank, which describes the target's motion with various-order linear models. Second, DIMM generates more reliable combination weight matrices through a differentiable adaptive fusion network for importance allocation rather than solely relying on the observation likelihood; it contains an attention-based twin delayed deep deterministic policy gradient (TD3) method with a hierarchical reward. Experiments demonstrate that DIMM significantly improves the tracking accuracy of existing state estimation methods by 31.61%~99.23%.

AAAI Conference 2026 Conference Paper

SmartAgent: Chain-of-User-Thought for Embodied Personalized Agent in Cyber World

  • Jiaqi Zhang
  • Chen Gao
  • Liyuan Zhang
  • Quoc Viet Hung Nguyen
  • Hongzhi Yin

Recent advances in embodied agents with multimodal perception and reasoning capabilities based on large vision-language models (LVLMs), excel in autonomously interacting either real or cyber worlds, helping people make intelligent decisions in complex environments. However, the current works are normally optimized by golden action trajectories or ideal task-oriented solutions toward a definitive goal. This paradigm considers limited user-oriented factors, which could be the reason for their performance reduction in a wide range of personal assistant applications. To address this, we propose Chain-of-User-Thought (COUT, a novel embodied reasoning paradigm that takes a chain of thought from basic action thinking to explicit and implicit personalized preference thought to incorporate personalized factors into autonomous agent learning. The main challenges of achieving COUT include: 1) the definition of embodied personalized tasks, 2) the embodied environment epitomizes personalized preference, and 3) the way to model embodied personalized actions. To target COUT, we introduce SmartAgent, an agent framework perceiving cyber environments and reasoning personalized requirements as: 1) interacting with GUI to access an item pool, 2) generating users' explicit requirements implied by previous actions, and 3) recommending items to fulfill users' implicit requirements. To demonstrate SmartAgent's capabilities, we also create a brand-new dataset SmartSpot that offers a full-stage personalized action-involved environment. To our best knowledge, our work is the first to formulate the COUT process, serving as a preliminary attempt towards embodied personalized agent learning. Our extensive experiments on SmartSpot illuminate SmartAgent’s functionality among a series of embodied and personalized sub-tasks.

AAAI Conference 2026 Conference Paper

Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology

  • Yatai Ji
  • Zhengqiu Zhu
  • Yong Zhao
  • Beidan Liu
  • Chen Gao
  • Yihao Zhao
  • Sihang Qiu
  • Yue Hu

Aerial Visual Object Search (AVOS) tasks in urban environments require Unmanned Aerial Vehicles (UAVs) to autonomously search for and identify target objects based on visual inputs without external guidance. Existing approaches struggle in complex urban environments due to redundant semantic processing, similar object ambiguity, and the exploration-exploitation dilemma. To advance research and support the AVOS task, we introduce CityAVOS, the first benchmark dataset for autonomous search of static urban objects. It features 2,420 tasks of varying difficulty across six object categories, designed to rigorously evaluate UAV search strategies. To solve the AVOS task, we also propose PRPSearcher (Perception-Reasoning-Planning Searcher), a novel agentic method powered by multi-modal large language models (MLLMs) that enables a UAV agent to think and reason like humans on visual cues when searching for objects. Specifically, PRPSearcher constructs three specialized maps: an object-centric dynamic semantic map enhancing spatial perception, a 3D cognitive map based on semantic "attraction" values for target reasoning, and a 3D uncertainty map for balanced exploration-exploitation search. Moreover, we propose a denoising mechanism to mitigate interference from similar objects and design an Inspiration Promote Thought prompting mechanism for adaptive action planning. Experimental results on CityAVOS demonstrate that PRPSearcher surpasses existing baselines in both success rate and search efficiency (on average: +37.69% SR, +28.96% SPL, -30.69% MSS, and -46.40% NE). Our work paves the way for future advances in embodied visual target search.

NeurIPS Conference 2025 Conference Paper

Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

  • kaiyuan Li
  • Xiaoyue Chen
  • Chen Gao
  • Yong Li
  • Xinlei Chen

Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the use of dynamic high-resolution inputs further increases this burden. Previous approaches have attempted to reduce the number of image tokens through token pruning, typically by selecting tokens based on attention scores or image token diversity. Through empirical studies, we observe that existing methods often overlook the joint impact of pruning on both the current layer's output (local) and the outputs of subsequent layers (global), leading to suboptimal pruning decisions. To address this challenge, we propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens. Specifically, our method utilizes a small calibration set to divide the pruning process into multiple stages. In the early stages, our method emphasizes the impact of pruning on subsequent layers, whereas in the deeper stages, the focus shifts toward preserving the consistency of local outputs. Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78\% compression rate while preserving 96. 7\% of the original models' performance on average. Our code is available at https: //github. com/EmbodiedCity/NeurIPS2025-Balanced-Token-Pruning.

IJCAI Conference 2025 Conference Paper

How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM

  • Jirong Zha
  • Yuxuan Fan
  • Xiao Yang
  • Chen Gao
  • Xinlei Chen

3D spatial understanding is essential in real-world applications such as robotics, autonomous vehicles, virtual reality, and medical imaging. Recently, Large Language Models (LLMs), having demonstrated remarkable success across various domains, have been leveraged to enhance 3D understanding tasks, showing potential to surpass traditional computer vision methods. In this survey, we present a comprehensive review of methods integrating LLMs with 3D spatial understanding. We propose a taxonomy that categorizes existing methods into three branches: image-based methods deriving 3D understanding from 2D visual data, point cloud-based methods working directly with 3D representations, and hybrid modality-based methods combining multiple data streams. We systematically review representative methods along these categories, covering data representations, architectural modifications, and training strategies that bridge textual and 3D modalities. Finally, we discuss current limitations, including dataset scarcity and computational challenges, while highlighting promising research directions in spatial perception, multi-modal fusion, and real-world applications.

AAAI Conference 2025 Conference Paper

Iterative Sparse Attention for Long-sequence Recommendation

  • Guanyu Lin
  • Jinwei Luo
  • Yinfeng Li
  • Chen Gao
  • Qun Luo
  • Depeng Jin

Longer historical behaviors often improve recommendation accuracy but bring efficient problems. As sequences get longer, the following two main challenges have not been addressed: (1) efficient modeling under increasing sequence length and (2) interest drifting within historical items. In this paper, we propose Iterative Sparse Attention for Long-sequence Recommendation (ISA) with Sparse Attention Layer and Iterative Attention Layer to efficiently capture sequential pattern and expand the receptive field of each historical items. We take the pioneering step to address the efficient and interest drifting challenges for the long-sequence recommendation simultaneously. The theoretical analysis illustrates that our proposed iterative method can approximate full attention efficiently. Experiments on two real-world datasets show the superiority of our proposed method against state-of-the-art baselines.

AAAI Conference 2025 Conference Paper

MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector

  • Wenjie Fu
  • Huandong Wang
  • Chen Gao
  • Guanghua Liu
  • Yong Li
  • Tao Jiang

The increasing parameters and expansive dataset of large lan- guage models (LLMs) highlight the urgent demand for a technical solution to audit the underlying privacy risks and copyright issues associated with LLMs. Existing studies have partially addressed this need through an exploration of the pre-training data detection problem, which is an instance of a membership inference attack (MIA). This problem involves determining whether a given piece of text has been used during the pre-training phase of the target LLM. Although existing methods have designed various sophisticated MIA score functions to achieve considerable detection performance in pre-trained LLMs, how to achieve high-confidence detection and how to perform MIA on aligned LLMs remain challenging. In this paper, we propose MIA-Tuner, a novel instruction-based MIA method, which instructs LLMs themselves to serve as a more precise pre-training data detector internally, rather than design an external MIA score function. Furthermore, we design two instruction-based safeguards to respectively mitigate the privacy risks brought by the existing methods and MIA-Tuner. To comprehensively evaluate the most recent state-of-the-art LLMs, we collect a more up-to-date MIA benchmark dataset, named WIKIMIA-24, to replace the widely adopted benchmark WIKIMIA. We conduct extensive experiments across various aligned and unaligned LLMs over the two benchmark datasets. The results demonstrate that MIA-Tuner increases the AUC of MIAs from 0.7 to a significantly high level of 0.9.

TIST Journal 2025 Journal Article

Mobility Data-Driven Privacy-Preserving Model for Detecting High-Risk Infection Cases

  • Wenjie Fu
  • Huandong Wang
  • Chen Gao
  • Guanghua Liu
  • Yong Li
  • Tao Jiang

In the past few years, infectious diseases like COVID-19 have caused serious distress to the global society and the economy. To prevent its spread, the early detection and assessment of infectious diseases based on molecular tests or antigen testing of bodily have led to countless labor and material costs. Fortunately, with the rapid development of mobile localization and web techniques, the collected massive mobile trajectory data provide a promising solution for detecting positive cases. However, existing mobility data-driven infection case detection methods are limited in terms of modeling the complicated epidemic spreading processes and preserving user privacy of the mobility data. In this article, we propose a novel graph convolutional networks (GCN) model for detecting high-risk infection cases, where we incorporate a spatio-temporal hypergraph to model the complex interaction of individuals. Then, we elaborately design a privacy-preserving framework tightly coupled with the structure of the spatio-temporal hypergraph, which includes a mobility data obfuscation module to protect privacy and an accompanying confidence-aware mechanism to mitigate the consequent performance decline. Moreover, we introduce a causal propagation mechanism to further guarantee the temporal dependency and causal effect of the feature propagation in our spatio-temporal hypergraph, which introduces both the causal transform of node features and the causal gathering of edge features. Finally, extensive experiments on a large mobility dataset collected from location-based services (LBS) show that the proposed model improves the performance of infection case detection by at least 12.47% when compared with several widely adopted baselines. Besides, our code and datasets are available at the link ( https://github.com/wjfu99/EPI-HGNN ).

NeurIPS Conference 2025 Conference Paper

PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer

  • Zhiwei Yang
  • Chen Gao
  • Mike Zheng Shou

Video anomaly detection (VAD) is a critical yet challenging task due to the complex and diverse nature of real-world scenarios. Previous methods typically rely on domain-specific training data and manual adjustments when applying to new scenarios and unseen anomaly types, suffering from high labor costs and limited generalization. Therefore, we aim to achieve generalist VAD, \ie, automatically handle any scene and any anomaly types without training data or human involvement. In this work, we propose PANDA, an agentic AI engineer based on MLLMs. Specifically, we achieve PANDA by comprehensively devising four key capabilities: (1) self-adaptive scene-aware strategy planning, (2) goal-driven heuristic reasoning, (3) tool-augmented self-reflection, and (4) self-improving chain-of-memory. Concretely, we develop a self-adaptive scene-aware RAG mechanism, enabling PANDA to retrieve anomaly-specific knowledge for anomaly detection strategy planning. Next, we introduce a latent anomaly-guided heuristic prompt strategy to enhance reasoning precision. Furthermore, PANDA employs a progressive reflection mechanism alongside a suite of context-aware tools to iteratively refine decision-making in complex scenarios. Finally, a chain-of-memory mechanism enables PANDA to leverage historical experiences for continual performance improvement. Extensive experiments demonstrate that PANDA achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings without training and manual involvement, validating its generalizable and robust anomaly detection capability. Code is released at https: //github. com/showlab/PANDA.

NeurIPS Conference 2025 Conference Paper

RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation

  • Songhao Han
  • Boxiang Qiu
  • Yue Liao
  • Siyuan Huang
  • Chen Gao
  • Shuicheng Yan
  • Si Liu

Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs’ strengths in semantic reasoning and long-horizon planning. These System 2 capabilities—characterized by deliberative, goal-directed thinking—remain underexplored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1–System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.

NeurIPS Conference 2025 Conference Paper

RoboScape: Physics-informed Embodied World Model

  • Yu Shang
  • Xin Zhang
  • Yinzhou Tang
  • Lei Jin
  • Chen Gao
  • Wei Wu
  • Yong Li

World models have become indispensable tools for embodied intelligence, serving as powerful simulators capable of generating realistic robotic videos while addressing critical data scarcity challenges. However, current embodied world models exhibit limited physical awareness, particularly in modeling 3D geometry and motion dynamics, resulting in unrealistic video generation for contact-rich robotic scenarios. In this paper, we present RoboScape, a unified physics-informed world model that jointly learns RGB video generation and physics knowledge within an integrated framework. We introduce two key physics-informed joint training tasks: temporal depth prediction that enhances 3D geometric consistency in video rendering, and keypoint dynamics learning that implicitly encodes physical properties (e. g. , object shape and material characteristics) while improving complex motion modeling. Extensive experiments demonstrate that RoboScape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios. We further validate its practical utility through downstream applications including robotic policy training with generated data and policy evaluation. Our work provides new insights for building efficient physics-informed world models to advance embodied intelligence research. Our code and demos are available at: https: //github. com/tsinghua-fib-lab/RoboScape.

NeurIPS Conference 2025 Conference Paper

Towards Realistic Earth-Observation Constellation Scheduling: Benchmark and Methodology

  • Luting Wang
  • Yinghao Xiang
  • Hongliang Huang
  • Dongjun Li
  • Chen Gao
  • Si Liu

Agile Earth Observation Satellites (AEOSs) constellations offer unprecedented flexibility for monitoring the Earth’s surface, but their scheduling remains challenging under large-scale scenarios, dynamic environments, and stringent constraints. Existing methods often simplify these complexities, limiting their real-world performance. We address this gap with a unified framework integrating a standardized benchmark suite and a novel scheduling model. Our benchmark suite, AEOS-Bench, contains $3, 907$ finely tuned satellite assets and $16, 410$ scenarios. Each scenario features $1$ to $50$ satellites and $50$ to $300$ imaging tasks. These scenarios are generated via a high-fidelity simulation platform, ensuring realistic satellite behavior such as orbital dynamics and resource constraints. Ground truth scheduling annotations are provided for each scenario. To our knowledge, AEOS-Bench is the first large-scale benchmark suite tailored for realistic constellation scheduling. Building upon this benchmark, we introduce AEOS-Former, a Transformer-based scheduling model that incorporates a constraint-aware attention mechanism. A dedicated internal constraint module explicitly models the physical and operational limits of each satellite. Through simulation-based iterative learning, AEOS-Former adapts to diverse scenarios, offering a robust solution for AEOS constellation scheduling. Experimental results demonstrate that AEOS-Former outperforms baseline models in task completion and energy efficiency, with ablation studies highlighting the contribution of each component. Code and data are provided in https: //github. com/buaa-colalab/AEOSBench.

ICRA Conference 2024 Conference Paper

Aerial Image-based Inter-day Registration for Precision Agriculture

  • Chen Gao
  • Franz Daxinger
  • Lukas Roth
  • Fabiola Maffra
  • Paul Beardsley
  • Margarita Chli
  • Lucas Teixeira

Satellite imagery has traditionally been used to collect crop statistics, but its low resolution and registration accuracy limit agricultural analytics to plant stand levels and large areas. Precision agriculture seeks analytic tools at near single plant level, and this work explores how to improve aerial photogrammetry to enable inter-day precision agriculture analytics for intervals of up to a month. Our work starts by presenting an accurately registered image time series, captured up to twice a week, by an unmanned aerial vehicle over a wheat crop field. The dataset is registered using photogrammetry aided by fiducial ground control points (GCPs). Unfortunately, GCPs severely disrupt crop management activities. To address this, we propose a novel inter-day registration approach that only relies once on GCPs, at the beginning of the season. The method utilises LoFTR [1], a state-of-the-art image-matching transformer. The original LoFTR network was trained using imagery of outdoor urban areas. One of our contributions is to extend LoFTR’s training method, which uses matching images of a static scene, to a dynamic scene of plants undergoing growth. Another contribution is a thorough evaluation of our registration method that integrates intraday crop reconstruction with earlier-day scans in a seven degree-of-freedom alignment. Experimental results show the advantage of our approach over other matching algorithms and demonstrate the importance of retraining using crop scenes, and a training method customised for growing crops, with an average registration error of 27 cm across a season.

NeurIPS Conference 2024 Conference Paper

Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration

  • Wenjie Fu
  • Huandong Wang
  • Chen Gao
  • Guanghua Liu
  • Yong Li
  • Tao Jiang

Membership Inference Attacks (MIA) aim to infer whether a target data record has been utilized for model training or not. Existing MIAs designed for large language models (LLMs) can be bifurcated into two types: reference-free and reference-based attacks. Although reference-based attacks appear promising performance by calibrating the probability measured on the target model with reference models, this illusion of privacy risk heavily depends on a reference dataset that closely resembles the training set. Both two types of attacks are predicated on the hypothesis that training records consistently maintain a higher probability of being sampled. However, this hypothesis heavily relies on the overfitting of target models, which will be mitigated by multiple regularization methods and the generalization of LLMs. Thus, these reasons lead to high false-positive rates of MIAs in practical scenarios. We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA). Specifically, we introduce a self-prompt approach, which constructs the dataset to fine-tune the reference model by prompting the target LLM itself. In this manner, the adversary can collect a dataset with a similar distribution from public APIs. Furthermore, we introduce probabilistic variation, a more reliable membership signal based on LLM memorization rather than overfitting, from which we rediscover the neighbour attack with theoretical grounding. Comprehensive evaluation conducted on three datasets and four exemplary LLMs shows that SPV-MIA raises the AUC of MIAs from 0. 7 to a significantly high level of 0. 9. Our code and dataset are available at: https: //github. com/tsinghua-fib-lab/NeurIPS2024_SPV-MIA

NeurIPS Conference 2021 Conference Paper

Mining the Benefits of Two-stage and One-stage HOI Detection

  • Aixi Zhang
  • Yue Liao
  • Si Liu
  • Miao Lu
  • Yongliang Wang
  • Chen Gao
  • Xiaobo Li

Two-stage methods have dominated Human-Object Interaction~(HOI) detection for several years. Recently, one-stage HOI detection methods have become popular. In this paper, we aim to explore the essential pros and cons of two-stage and one-stage methods. With this as the goal, we find that conventional two-stage methods mainly suffer from positioning positive interactive human-object pairs, while one-stage methods are challenging to make an appropriate trade-off on multi-task learning, \emph{i. e. }, object detection, and interaction classification. Therefore, a core problem is how to take the essence and discard the dregs from the conventional two types of methods. To this end, we propose a novel one-stage framework with disentangling human-object detection and interaction classification in a cascade manner. In detail, we first design a human-object pair generator based on a state-of-the-art one-stage HOI detector by removing the interaction classification module or head and then design a relatively isolated interaction classifier to classify each human-object pair. Two cascade decoders in our proposed framework can focus on one specific task, detection or interaction classification. In terms of the specific implementation, we adopt a transformer-based HOI detector as our base model. The newly introduced disentangling paradigm outperforms existing methods by a large margin, with a significant relative mAP gain of 9. 32% on HICO-Det. The source codes are available at https: //github. com/YueLiao/CDN.

NeurIPS Conference 2021 Conference Paper

Progressive Feature Interaction Search for Deep Sparse Network

  • Chen Gao
  • Yinfeng Li
  • Quanming Yao
  • Depeng Jin
  • Yong Li

Deep sparse networks (DSNs), of which the crux is exploring the high-order feature interactions, have become the state-of-the-art on the prediction task with high-sparsity features. However, these models suffer from low computation efficiency, including large model size and slow model inference, which largely limits these models' application value. In this work, we approach this problem with neural architecture search by automatically searching the critical component in DSNs, the feature-interaction layer. We propose a distilled search space to cover the desired architectures with fewer parameters. We then develop a progressive search algorithm for efficient search on the space and well capture the order-priority property in sparse prediction tasks. Experiments on three real-world benchmark datasets show promising results of PROFIT in both accuracy and efficiency. Further studies validate the feasibility of our designed search space and search algorithm.

IJCAI Conference 2019 Conference Paper

DeepAPF: Deep Attentive Probabilistic Factorization for Multi-site Video Recommendation

  • Huan Yan
  • Xiangning Chen
  • Chen Gao
  • Yong Li
  • Depeng Jin

Existing web video systems recommend videos according to users' viewing history from its own website. However, since many users watch videos in multiple websites, this approach fails to capture these users' interests across sites. In this paper, we investigate the user viewing behavior in multiple sites based on a large scale real dataset. We find that user interests are comprised of cross-site consistent part and site-specific part with different degrees of the importance. Existing linear matrix factorization recommendation model has limitation in modeling such complicated interactions. Thus, we propose a model of Deep Attentive Probabilistic Factorization (DeepAPF) to exploit deep learning method to approximate such complex user-video interaction. DeepAPF captures both cross-site common interests and site-specific interests with non-uniform importance weights learned by the attentional network. Extensive experiments show that our proposed model outperforms by 17. 62%, 7. 9% and 8. 1% with the comparison of three state-of-the-art baselines. Our study provides insight to integrate user viewing records from multiple sites via the trusted third party, which gains mutual benefits in video recommendation.

NeurIPS Conference 2019 Conference Paper

Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition

  • Jinwoo Choi
  • Chen Gao
  • Joseph C. E. Messou
  • Jia-Bin Huang

Human activities often occur in specific scene contexts, e. g. , playing basketball on a basketball court. Training a model using existing video datasets thus inevitably captures and leverages such bias (instead of using the actual discriminative cues). The learned representation may not generalize well to new action classes or different tasks. In this paper, we propose to mitigate scene bias for video representation learning. Specifically, we augment the standard cross-entropy loss for action classification with 1) an adversarial loss for scene types and 2) a human mask confusion loss for videos where the human actors are masked out. These two losses encourage learning representations that are unable to predict the scene types and the correct actions when there is no evidence. We validate the effectiveness of our method by transferring our pre-trained model to three different tasks, including action classification, temporal localization, and spatio-temporal action detection. Our results show consistent improvement over the baseline model without debiasing.