Arrow Research search

Author name cluster

Jing Bi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
1 author row

Possible papers

7

AAAI Conference 2026 System Paper

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

  • Yolo Yunlong Tang
  • Jing Bi
  • Chao Huang
  • Susan Liang
  • Daiki Shimada
  • Hang Hua
  • Yunzhong Xiao
  • Yizhi Song

In this work, we introduce CAT-V (Caption Anything in Video), a training-free framework for fine-grained object-centric video captioning of user-selected instances. CAT-V combines (i) a SAMURAI-based Segmenter for precise object masks across frames, (ii) a TRACE-Uni Temporal Analyzer for event boundary detection and coarse event descriptions, and (iii) an InternVL-2.5 Captioner that, conditioned on spatiotemporal visual prompts and chain-of-thought (CoT) guidance, produces detailed, temporally coherent captions about object attributes, actions, states, interactions, and context. The system supports point, box, and region prompts and maintains temporal sensitivity by tracking object states across segments. In contrast to vanilla video captioning that is overly abstract and dense video captioning that is often terse, CAT-V enables object-level specificity with spatial accuracy and temporal coherence, without additional training data.

AAAI Conference 2025 Conference Paper

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

  • Yunlong Tang
  • Daiki Shimada
  • Jing Bi
  • Mingqian Feng
  • Hang Hua
  • Chenliang Xu

Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to localize audio-visual events in videos temporally. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,081 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.

NeurIPS Conference 2025 Conference Paper

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

  • Yunlong Tang
  • Pinxin Liu
  • Mingqian Feng
  • Zhangyun Tan
  • Rui Mao
  • Chao Huang
  • Jing Bi
  • Yunzhong Xiao

Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2, 711 real-world and synthetic image instances with 5, 083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources are available at https: //yunlong10. github. io/MMPerspective/

JBHI Journal 2025 Journal Article

Multi-Omics Graph Knowledge Representation for Pneumonia Prognostic Prediction

  • Wenyu Xing
  • Miao Li
  • Yiwen Liu
  • Xin Liu
  • Yifang Li
  • Yanping Yang
  • Jing Bi
  • Jiangang Chen

Early prognostic prediction is crucial for determining appropriate clinical interventions. Previous single-omics models had limitations, such as high contingency and overlooking complex physical conditions. In this paper, we introduced multi-omics graph knowledge representation to predict in-hospital outcomes for pneumonia patients. This method utilizes CT imaging and three non-imaging omics information, and explores a knowledge graph for modeling multi-omics relations to enhance the overall information representation. For imaging omics, a multichannel pyramidal recursive MLP and Longformer-based 3D deep learning module was developed to extract depth features in lung window, while radiomics features were simultaneously extracted in both lung and mediastinal windows. Non-imaging omics involved the adoption of laboratory, microbial, and clinical indices to complement the patient's physical condition. Following feature screening, the similarity fusion network and graph convolutional network (GCN) were employed to determine omics similarity and provide prognostic prediction. The results of comparative experiments and generalization validation demonstrat that the proposed multi-omics GCN-based prediction model has good robustness and outperformed previous single-type omics, classical machine learning, and previous deep learning methods. Thus, the proposed multi-omics graph knowledge representation model enhances early prognostic prediction performance in pneumonia, facilitating a comprehensive assessment of disease severity and timely intervention for high-risk patients.

NeurIPS Conference 2025 Conference Paper

ZeroSep: Separate Anything in Audio with Zero Training

  • Chao Huang
  • Yuesheng Ma
  • Junxuan Huang
  • Susan Liang
  • Yunlong Tang
  • Jing Bi
  • Wenqiang Liu
  • Nima Mesgarani

Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.

EAAI Journal 2023 Journal Article

Data-driven methods for stress field predictions in random heterogeneous materials

  • Enjamamul Hoq
  • Osama Aljarrah
  • Jun Li
  • Jing Bi
  • Alfa Heryudono
  • Wenzhen Huang

Predicting full-field stress responses is of fundamental importance to assessing materials failure and has various engineering applications in design optimization, manufacturing process control, and structural health monitoring. This article develops and evaluates different data-driven methods for efficient and accurate predictions of full stress fields in random heterogeneous materials. The first approach integrates model order reduction of proper orthogonal decomposition (POD) with classical machine learning techniques (K-nearest neighbors, random forest, and artificial neural networks) to predict full-field responses based on POD-reduced coefficients. However, this strategy shows limitations in predicting full stress fields, especially for heterogeneous material inclusions of small size or being close to the domain boundary. After that, two computer vision-based deep learning approaches were developed for full-field predictions. The first one uses a Resnet-based Convolutional Neural Network (CNN), and the second is based on a modified conditional Generative Adversarial Network (cGAN). Two representative example problems were studied: a random heterogeneous material inclusion or a void varying in size and location. In contrast to POD-based classical machine learning, almost invisible differences were found between the entire stress fields in finite element simulations and computer vision-based deep learning (CNN/cGAN) predictions, with significantly reduced mean squared error (MSE) and correlation values (R2) mostly above 0. 99. On the other hand, the proposed cGAN provides more accurate predictions than CNN with fewer epochs.

AAAI Conference 2020 Conference Paper

Learning from Interventions Using Hierarchical Policies for Safe Learning

  • Jing Bi
  • Vikas Dhiman
  • Tianyou Xiao
  • Chenliang Xu

Learning from Demonstrations (LfD) via Behavior Cloning (BC) works well on multiple complex tasks. However, a limitation of the typical LfD approach is that it requires expert demonstrations for all scenarios, including those in which the algorithm is already well-trained. The recently proposed Learning from Interventions (LfI) overcomes this limitation by using an expert overseer. The expert overseer only intervenes when it suspects that an unsafe action is about to be taken. Although LfI significantly improves over LfD, the state-of-the-art LfI fails to account for delay caused by the expert’s reaction time and only learns short-term behavior. We address these limitations by 1) interpolating the expert’s interventions back in time, and 2) by splitting the policy into two hierarchical levels, one that generates sub-goals for the future and another that generates actions to reach those desired subgoals. This sub-goal prediction forces the algorithm to learn long-term behavior while also being robust to the expert’s reaction time. Our experiments show that LfI using sub-goals in a hierarchical policy framework trains faster and achieves better asymptotic performance than typical LfD.