Arrow Research search

Author name cluster

Xiaojie Jin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers
2 author rows

Possible papers

19

NeurIPS Conference 2025 Conference Paper

A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis

  • Dongheng Lin
  • Mengxue Qu
  • Kunyang Han
  • Jianbo Jiao
  • Xiaojie Jin
  • Yunchao Wei

Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal, typically outputting only frame-wise anomaly scores without spatial or semantic context. Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific. We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation. Our approach is built upon a chained test-time reasoning process that sequentially connects these tasks, enabling holistic zero-shot anomaly analysis without any additional training. Specifically, our approach leverages intra-task reasoning to refine temporal detections and inter-task chaining for spatial and semantic understanding, yielding improved interpretability and generalization in a fully zero-shot manner. Without any additional data or gradients, our method achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks. The results demonstrate that careful prompt design with task-wise chaining can unlock the reasoning power of foundation models, enabling practical, interpretable video anomaly analysis in a fully zero-shot manner. Project Page: https: //rathgrith. github. io/Unified Frame VAA/.

NeurIPS Conference 2025 Conference Paper

COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

  • Xueqing Deng
  • Linjie Yang
  • Qihang Yu
  • Ali Athar
  • Chenglin Yang
  • Xiaojie Jin
  • Xiaohui Shen
  • Liang-Chieh Chen

This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.

ICLR Conference 2024 Conference Paper

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

  • Sihan Chen
  • Xingjian He
  • Handong Li
  • Xiaojie Jin
  • Jiashi Feng
  • Jing Liu 0001

Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations. To address this issue, we propose COSA, a COncatenated SAmple pretrained vision-language foundation model. COSA can jointly model visual contents and event-level temporal cues using only image-text corpora. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo video-paragraph corpus, enabling richer scene transformations and explicit event-description correspondence. Extensive experiments demonstrate that COSA consistently improves performance across a broad range of semantic vision-language downstream tasks, including paragraph-to-video retrieval, text-to-video/image retrieval, video/image captioning and video QA. Notably, COSA achieves state-of-the-art results on various competitive benchmarks. Code and model are released at https://github.com/TXH-mercury/COSA.

AAAI Conference 2024 Conference Paper

Exploring Domain Incremental Video Highlights Detection with the LiveFood Benchmark

  • Sen Pei
  • Shixiong Xu
  • Xiaojie Jin

Video highlights detection (VHD) is an active research field in computer vision, aiming to locate the most user-appealing clips given raw video inputs. However, most VHD methods are based on the closed world assumption, i.e., a fixed number of highlight categories is defined in advance and all training data are available beforehand. Consequently, existing methods have poor scalability with respect to increasing highlight domains and training data. To address above issues, we propose a novel video highlights detection method named Global Prototype Encoding (GPE) to learn incrementally for adapting to new domains via parameterized prototypes. To facilitate this new research direction, we collect a finely annotated dataset termed LiveFood, including over 5,100 live gourmet videos that consist of four domains: ingredients, cooking, presentation, and eating. To the best of our knowledge, this is the first work to explore video highlights detection in the incremental learning setting, opening up new land to apply VHD for practical scenarios where both the concerned highlight domains and training data increase over time. We demonstrate the effectiveness of GPE through extensive experiments. Notably, GPE surpasses popular domain incremental learning methods on LiveFood, achieving significant mAP improvements on all domains. Concerning the classic datasets, GPE also yields comparable performance as previous arts. The code is available at: https://github.com/ForeverPs/IncrementalVHD_GPE.

IJCAI Conference 2024 Conference Paper

OSIC: A New One-Stage Image Captioner Coined

  • Bo Wang
  • Zhao Zhang
  • Mingbo Zhao
  • Xiaojie Jin
  • Mingliang Xu
  • Meng Wang

Mainstream image captioning models are usually two-stage captioners, i. e. , encoding the region features by a pre-trained detector and then feeding them into a language model to generate the captions. However, such a two-stage procedure will lead to a task-based information gap that decreases the performance, because the region features in the detection task are suboptimal representations and cannot provide all the necessary information for subsequent captions generation. Besides, the region features are usually represented from the last layer of the detectors that lose the local details of images. In this paper, we propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning, which directly transforms the images into descriptive sentences in one stage for eliminating the information gap. Specifically, to obtain rich features, multi-level features are captured by Swin Transformer, and then fed into a novel dynamic multi-sight embedding module to exploit both the global structure and local texture of input images. To enhance the global modeling capacity of the visual encoder, we propose a new dual-dimensional refining to non-locally model the features interaction. As a result, OSIC can directly obtain rich semantic information to improve the captioner. Extensive comparisons on the benchmark MS-COCO, Flickr8K and Flickr30K datasets verified the superior performance of our method.

AAAI Conference 2024 Conference Paper

Stitching Segments and Sentences towards Generalization in Video-Text Pre-training

  • Fan Ma
  • Xiaojie Jin
  • Heng Wang
  • Jingjia Huang
  • Linchao Zhu
  • Yi Yang

Video-language pre-training models have recently achieved remarkable results on various multi-modal downstream tasks. However, most of these models rely on contrastive learning or masking modeling to align global features across modalities, neglecting the local associations between video frames and text tokens. This limits the model’s ability to perform fine-grained matching and generalization, especially for tasks that selecting segments in long videos based on query texts. To address this issue, we propose a novel stitching and matching pre-text task for video-language pre-training that encourages fine-grained interactions between modalities. Our task involves stitching video frames or sentences into longer sequences and predicting the positions of cross-model queries in the stitched sequences. The individual frame and sentence representations are thus aligned via the stitching and matching strategy, encouraging the fine-grained interactions between videos and texts. in the stitched sequences for the cross-modal query. We conduct extensive experiments on various benchmarks covering text-to-video retrieval, video question answering, video captioning, and moment retrieval. Our results demonstrate that the proposed method significantly improves the generalization capacity of the video-text pre-training models.

NeurIPS Conference 2021 Conference Paper

All Tokens Matter: Token Labeling for Training Better Vision Transformers

  • Zi-Hang Jiang
  • Qibin Hou
  • Li Yuan
  • Daquan Zhou
  • Yujun Shi
  • Xiaojie Jin
  • Anran Wang
  • Jiashi Feng

In this paper, we present token labeling---a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84. 4% Top-1 accuracy on ImageNet. The result can be further increased to 86. 4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pretrained models on downstream tasks with dense prediction, such as semantic segmentation. Our code and model are publiclyavailable at https: //github. com/zihangJiang/TokenLabeling.

NeurIPS Conference 2021 Conference Paper

Conflict-Averse Gradient Descent for Multi-task learning

  • Bo Liu
  • Xingchao Liu
  • Xiaojie Jin
  • Peter Stone
  • Qiang Liu

The goal of multi-task learning is to enable more efficient learning than single task learning by sharing model structures for a diverse set of tasks. A standard multi-task learning objective is to minimize the average loss across all tasks. While straightforward, using this objective often results in much worse final performance for each task than learning them independently. A major challenge in optimizing a multi-task model is the conflicting gradients, where gradients of different task objectives are not well aligned so that following the average gradient direction can be detrimental to specific tasks' performance. Previous work has proposed several heuristics to manipulate the task gradients for mitigating this problem. But most of them lack convergence guarantee and/or could converge to any Pareto-stationary point. In this paper, we introduce Conflict-Averse Gradient descent (CAGrad) which minimizes the average loss function, while leveraging the worst local improvement of individual tasks to regularize the algorithm trajectory. CAGrad balances the objectives automatically and still provably converges to a minimum over the average loss. It includes the regular gradient descent (GD) and the multiple gradient descent algorithm (MGDA) in the multi-objective optimization (MOO) literature as special cases. On a series of challenging multi-task supervised learning and reinforcement learning tasks, CAGrad achieves improved performance over prior state-of-the-art multi-objective gradient manipulation methods.

ICLR Conference 2020 Conference Paper

AtomNAS: Fine-Grained End-to-End Neural Architecture Search

  • Jieru Mei
  • Yingwei Li 0002
  • Xiaochen Lian
  • Xiaojie Jin
  • Linjie Yang
  • Alan L. Yuille
  • Jianchao Yang

Search space design is very critical to neural architecture search (NAS) algorithms. We propose a fine-grained search space comprised of atomic blocks, a minimal search unit that is much smaller than the ones used in recent NAS algorithms. This search space allows a mix of operations by composing different types of atomic blocks, while the search space in previous methods only allows homogeneous operations. Based on this search space, we propose a resource-aware architecture search framework which automatically assigns the computational resources (e.g., output channel numbers) for each operation by jointly considering the performance and the computational cost. In addition, to accelerate the search process, we propose a dynamic network shrinkage technique which prunes the atomic blocks with negligible influence on outputs on the fly. Instead of a search-and-retrain two-stage paradigm, our method simultaneously searches and trains the target architecture. Our method achieves state-of-the-art performance under several FLOPs configurations on ImageNet with a small searching cost. We open our entire codebase at: https://github.com/meijieru/AtomNAS.

ICLR Conference 2020 Conference Paper

Neural Epitome Search for Architecture-Agnostic Network Compression

  • Daquan Zhou
  • Xiaojie Jin
  • Qibin Hou
  • Kaixin Wang
  • Jianchao Yang
  • Jiashi Feng

Traditional compression methods including network pruning, quantization, low rank factorization and knowledge distillation all assume that network architectures and parameters should be hardwired. In this work, we propose a new perspective on network compression, i.e., network parameters can be disentangled from the architectures. From this viewpoint, we present the Neural Epitome Search (NES), a new neural network compression approach that learns to find compact yet expressive epitomes for weight parameters of a specified network architecture end-to-end. The complete network to compress can be generated from the learned epitome via a novel transformation method that adaptively transforms the epitomes to match shapes of the given architecture. Compared with existing compression methods, NES allows the weight tensors to be independent of the architecture design and hence can achieve a good trade-off between model compression rate and performance given a specific model size constraint. Experiments demonstrate that, on ImageNet, when taking MobileNetV2 as backbone, our approach improves the full-model baseline by 1.47% in top-1 accuracy with 25% MAdd reduction and AutoML for Model Compression (AMC) by 2.5% with nearly the same compression ratio. Moreover, taking EfficientNet-B0 as baseline, our NES yields an improvement of 1.2% but are with 10% less MAdd. In particular, our method achieves a new state-of-the-art results of 77.5% under mobile settings (<350M MAdd). Code will be made publicly available.

IJCAI Conference 2018 Conference Paper

Sharing Residual Units Through Collective Tensor Factorization To Improve Deep Neural Networks

  • Yunpeng Chen
  • Xiaojie Jin
  • Bingyi Kang
  • Jiashi Feng
  • Shuicheng Yan

The residual unit and its variations are wildly used in building very deep neural networks for alleviating optimization difficulty. In this work, we revisit the standard residual function as well as its several successful variants and propose a unified framework based on tensor Block Term Decomposition (BTD) to explain these apparently different residual functions from the tensor decomposition view. With the BTD framework, we further propose a novel basic network architecture, named the Collective Residual Unit (CRU). CRU further enhances parameter efficiency of deep residual neural networks by sharing core factors derived from collective tensor factorization over the involved residual units. It enables efficient knowledge sharing across multiple residual units, reduces the number of model parameters, lowers the risk of over-fitting, and provides better generalization ability. Extensive experimental results show that our proposed CRU network brings outstanding parameter efficiency -- it achieves comparable classification performance with ResNet-200 while using a model size as small as ResNet-50 on the ImageNet-1k and Places365-Standard benchmark datasets.

ICML Conference 2018 Conference Paper

WSNet: Compact and Efficient Networks Through Weight Sampling

  • Xiaojie Jin
  • Yingzhen Yang
  • Ning Xu 0001
  • Jianchao Yang
  • Nebojsa Jojic
  • Jiashi Feng
  • Shuicheng Yan

We present a new approach and a novel architecture, termed WSNet, for learning compact and efficient deep neural networks. Existing approaches conventionally learn full model parameters independently and then compress them via ad hoc processing such as model pruning or filter factorization. Alternatively, WSNet proposes learning model parameters by sampling from a compact set of learnable parameters, which naturally enforces parameter sharing throughout the learning process. We demonstrate that such a novel weight sampling approach (and induced WSNet) promotes both weights and computation sharing favorably. By employing this method, we can more efficiently learn much smaller networks with competitive performance compared to baseline networks with equal numbers of convolution filters. Specifically, we consider learning compact and efficient 1D convolutional neural networks for audio classification. Extensive experiments on multiple audio classification datasets verify the effectiveness of WSNet. Combined with weight quantization, the resulted models are up to 180x smaller and theoretically up to 16x faster than the well-established baselines, without noticeable performance drop.

NeurIPS Conference 2017 Conference Paper

Dual Path Networks

  • Yunpeng Chen
  • Jianan Li
  • Huaxin Xiao
  • Xiaojie Jin
  • Shuicheng Yan
  • Jiashi Feng

In this work, we present a simple, highly efficient and modularized Dual Path Network (DPN) for image classification which presents a new topology of connection paths internally. By revealing the equivalence of the state-of-the-art Residual Network (ResNet) and Densely Convolutional Network (DenseNet) within the HORNN framework, we find that ResNet enables feature re-usage while DenseNet enables new features exploration which are both important for learning good representations. To enjoy the benefits from both path topologies, our proposed Dual Path Network shares common features while maintaining the flexibility to explore new features through dual path architectures. Extensive experiments on three benchmark datasets, ImagNet-1k, Places365 and PASCAL VOC, clearly demonstrate superior performance of the proposed DPN over state-of-the-arts. In particular, on the ImagNet-1k dataset, a shallow DPN surpasses the best ResNeXt-101(64x4d) with 26% smaller model size, 25% less computational cost and 8% lower memory consumption, and a deeper DPN (DPN-131) further pushes the state-of-the-art single model performance with about 2 times faster training speed. Experiments on the Places365 large-scale scene dataset, PASCAL VOC detection dataset, and PASCAL VOC segmentation dataset also demonstrate its consistently better performance than DenseNet, ResNet and the latest ResNeXt model over various applications.

AAAI Conference 2017 Conference Paper

Multi-Path Feedback Recurrent Neural Networks for Scene Parsing

  • Xiaojie Jin
  • Yunpeng Chen
  • Zequn Jie
  • Jiashi Feng
  • Shuicheng Yan

In this paper, we consider the scene parsing problem and propose a novel Multi-Path Feedback recurrent neural network (MPF-RNN) for parsing scene images. MPF-RNN can enhance the capability of RNNs in modeling long-range context information at multiple levels and better distinguish pixels that are easy to confuse. Different from feedforward CNNs and RNNs with only single feedback, MPF-RNN propagates the contextual features learned at top layer through multiple weighted recurrent connections to learn bottom features. For better training MPF-RNN, we propose a new strategy that considers accumulative loss at multiple recurrent steps to improve performance of the MPF-RNN on parsing small objects. With these two novel components, MPF-RNN has achieved significant improvement over strong baselines (VGG16 and Res101) on five challenging scene parsing benchmarks, including traditional SiftFlow, Barcelona, CamVid, Stanford Background as well as the recently released large-scale ADE20K.

IJCAI Conference 2017 Conference Paper

Online Robust Low-Rank Tensor Learning

  • Ping Li
  • Jiashi Feng
  • Xiaojie Jin
  • Luming Zhang
  • Xianghua Xu
  • Shuicheng Yan

The rapid increase of multidimensional data (a. k. a. tensor) like videos brings new challenges for low-rank data modeling approaches such as dynamic data size, complex high-order relations, and multiplicity of low-rank structures. Resolving these challenges require a new tensor analysis method that can perform tensor data analysis online, which however is still absent. In this paper, we propose an Online Robust Low-rank Tensor Modeling (ORLTM) approach to address these challenges. ORLTM dynamically explores the high-order correlations across all tensor modes for low-rank structure modeling. To analyze mixture data from multiple subspaces, ORLTM introduces a new dictionary learning component. ORLTM processes data streamingly and thus requires quite low memory cost that is independent of data size. This makes ORLTM quite suitable for processing large-scale tensor data. Empirical studies have validated the effectiveness of the proposed method on both synthetic data and one practical task, i. e. , video background subtraction. In addition, we provide theoretical analysis regarding computational complexity and memory cost, demonstrating the efficiency of ORLTM rigorously.

NeurIPS Conference 2017 Conference Paper

Predicting Scene Parsing and Motion Dynamics in the Future

  • Xiaojie Jin
  • Huaxin Xiao
  • Xiaohui Shen
  • Jimei Yang
  • Zhe Lin
  • Yunpeng Chen
  • Zequn Jie
  • Jiashi Feng

It is important for intelligent systems, e. g. autonomous vehicles and robotics to anticipate the future in order to plan early and make decisions accordingly. Predicting the future scene parsing and motion dynamics helps the agents better understand the visual environment better as the former provides dense semantic segmentations, i. e. what objects will be present and where they will appear, while the latter provides dense motion information, i. e. how the objects move in the future. In this paper, we propose a novel model to predict the scene parsing and motion dynamics in unobserved future video frames simultaneously. Using history information (preceding frames and corresponding scene parsing results) as input, our model is able to predict the scene parsing and motion for arbitrary time steps ahead. More importantly, our model is superior compared to other methods that predict parsing and motion separately, as the complementary relationship between the two tasks are fully utilized in our model through joint learning. To our best knowledge, this is the first attempt in jointly predicting scene parsing and motion dynamics in the future frames. On the large-scale Cityscapes dataset, it is demonstrated that our model produces significantly better parsing and motion prediction results compared to well established baselines. In addition, we also show our model can be used to predict the steering angle of the vehicles, which further verifies the ability of our model to learn underlying latent parameters.

IJCAI Conference 2017 Conference Paper

Training Group Orthogonal Neural Networks with Privileged Information

  • Yunpeng Chen
  • Xiaojie Jin
  • Jiashi Feng
  • Shuicheng Yan

Learning rich and diverse representations is critical for the performance of deep convolutional neural networks (CNNs). In this paper, we consider how to use privileged information to promote inherent diversity of a single CNN model such that the model can learn better representations and offer stronger generalization ability. To this end, we propose a novel group orthogonal convolutional neural network (GoCNN) that learns untangled representations within each layer by exploiting provided privileged information and enhances representation diversity effectively. We take image classification as an example where image segmentation annotations are used as privileged information during the training process. Experiments on two benchmark datasets – ImageNet and PASCAL VOC – clearly demonstrate the strong generalization ability of our proposed GoCNN model. On the ImageNet dataset, GoCNN improves the performance of state-of-the-art ResNet-152 model by absolute value of 1. 2% while only uses privileged information of 10% of the training images, confirming effectiveness of GoCNN on utilizing available privileged knowledge to train better CNNs.

AAAI Conference 2016 Conference Paper

Deep Learning with S-Shaped Rectified Linear Activation Units

  • Xiaojie Jin
  • Chunyan Xu
  • Jiashi Feng
  • Yunchao Wei
  • Junjun Xiong
  • Shuicheng Yan

Rectified linear activation units are important components for state-of-the-art deep convolutional networks. In this paper, we propose a novel S-shaped rectified linear activation unit (SReLU) to learn both convex and non-convex functions, imitating the multiple function forms given by the two fundamental laws, namely the Webner-Fechner law and the Stevens law, in psychophysics and neural sciences. Specifically, SReLU consists of three piecewise linear functions, which are formulated by four learnable parameters. The SReLU is learned jointly with the training of the whole deep network through back propagation. During the training phase, to initialize SReLU in different layers, we propose a “freezing” method to degenerate SReLU into a predefined leaky rectified linear unit in the initial several training epochs and then adaptively learn the good initial values. SReLU can be universally used in the existing deep networks with negligible additional parameters and computation cost. Experiments with two popular CNN architectures, Network in Network and GoogLeNet on scale-various benchmarks including CI- FAR10, CIFAR100, MNIST and ImageNet demonstrate that SReLU achieves remarkable improvement compared to other activation functions.

NeurIPS Conference 2016 Conference Paper

Tree-Structured Reinforcement Learning for Sequential Object Localization

  • Zequn Jie
  • Xiaodan Liang
  • Jiashi Feng
  • Xiaojie Jin
  • Wen Lu
  • Shuicheng Yan

Existing object proposal algorithms usually search for possible object regions over multiple locations and scales \emph{ separately}, which ignore the interdependency among different objects and deviate from the human perception procedure. To incorporate global interdependency between objects into object localization, we propose an effective Tree-structured Reinforcement Learning (Tree-RL) approach to sequentially search for objects by fully exploiting both the current observation and historical search paths. The Tree-RL approach learns multiple searching policies through maximizing the long-term reward that reflects localization accuracies over all the objects. Starting with taking the entire image as a proposal, the Tree-RL approach allows the agent to sequentially discover multiple objects via a tree-structured traversing scheme. Allowing multiple near-optimal policies, Tree-RL offers more diversity in search paths and is able to find multiple objects with a single feed-forward pass. Therefore, Tree-RL can better cover different objects with various scales which is quite appealing in the context of object proposal. Experiments on PASCAL VOC 2007 and 2012 validate the effectiveness of the Tree-RL, which can achieve comparable recalls with current object proposal algorithms via much fewer candidate windows.