Arrow Research search

Author name cluster

Junbin Xiao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers
1 author row

Possible papers

3

NeurIPS Conference 2025 Conference Paper

EgoBlind: Towards Egocentric Visual Assistance for the Blind

  • Junbin Xiao
  • Nanxin Huang
  • Hao Qiu
  • Zhulin Tao
  • Xun Yang
  • Richang Hong
  • Meng Wang
  • Angela Yao

We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1, 392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5, 311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60\%, which is far behind human performance of 87. 4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at \url{https: //github. com/doc-doc/EgoBlind}.

AAAI Conference 2023 Conference Paper

FakeSV: A Multimodal Benchmark with Rich Social Context for Fake News Detection on Short Video Platforms

  • Peng Qi
  • Yuyan Bu
  • Juan Cao
  • Wei Ji
  • Ruihao Shui
  • Junbin Xiao
  • Danding Wang
  • Tat-Seng Chua

Short video platforms have become an important channel for news sharing, but also a new breeding ground for fake news. To mitigate this problem, research of fake news video detection has recently received a lot of attention. Existing works face two roadblocks: the scarcity of comprehensive and largescale datasets and insufficient utilization of multimodal information. Therefore, in this paper, we construct the largest Chinese short video dataset about fake news named FakeSV, which includes news content, user comments, and publisher profiles simultaneously. To understand the characteristics of fake news videos, we conduct exploratory analysis of FakeSV from different perspectives. Moreover, we provide a new multimodal detection model named SV-FEND, which exploits the cross-modal correlations to select the most informative features and utilizes the social context information for detection. Extensive experiments evaluate the superiority of the proposed method and provide detailed comparisons of different methods and modalities for future works. Our dataset and codes are available in https://github.com/ICTMCG/FakeSV.

AAAI Conference 2022 Conference Paper

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

  • Junbin Xiao
  • Angela Yao
  • Zhiyuan Liu
  • Yicong Li
  • Wei Ji
  • Tat-Seng Chua

Video question answering requires the models to understand and reason about both the complex video and language data to correctly derive the answers. Existing efforts have been focused on designing sophisticated cross-modal interactions to fuse the information from two modalities, while encoding the video and question holistically as frame and word sequences. Despite their success, these methods are essentially revolving around the sequential nature of video- and question-contents, providing little insight to the problem of question-answering and lacking interpretability as well. In this work, we argue that while video is presented in frame sequence, the visual elements (e. g. , objects, actions, activities and events) are not sequential but rather hierarchical in semantic space. To align with the multi-granular essence of linguistic concepts in language queries, we propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner, with the guidance of corresponding textual cues. Despite the simplicity, our extensive experiments demonstrate the superiority of such conditional hierarchical graph architecture, with clear performance improvements over prior methods and also better generalization across different type of questions. Further analyses also demonstrate the model’s reliability as it shows meaningful visual-textual evidences for the predicted answers.