Author name cluster

Ali Vosoughi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers

1 author row

AAAI Conference 2026 System Paper

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

Yolo Yunlong Tang
Jing Bi
Chao Huang
Susan Liang
Daiki Shimada
Hang Hua
Yunzhong Xiao
Yizhi Song

In this work, we introduce CAT-V (Caption Anything in Video), a training-free framework for fine-grained object-centric video captioning of user-selected instances. CAT-V combines (i) a SAMURAI-based Segmenter for precise object masks across frames, (ii) a TRACE-Uni Temporal Analyzer for event boundary detection and coarse event descriptions, and (iii) an InternVL-2.5 Captioner that, conditioned on spatiotemporal visual prompts and chain-of-thought (CoT) guidance, produces detailed, temporally coherent captions about object attributes, actions, states, interactions, and context. The system supports point, box, and region prompts and maintains temporal sensitivity by tracking object states across segments. In contrast to vanilla video captioning that is overly abstract and dense video captioning that is often terse, CAT-V enables object-level specificity with spatial accuracy and temporal coherence, without additional training data.

PDF Details DOI

YNIMG Journal 2025 Journal Article

Inferring causal relations from multivariate data using Large-Scale Augmented Granger Causality (lsAGC)

Axel Wismüller
Ali Vosoughi
Akhil Kasturi

Causal inference from high-dimensional and short time-series data is crucial to scientific discovery across diverse fields. Yet, standard approaches frequently fail under these constraints. We propose Large-scale Augmented Granger Causality (lsAGC), integrating dimension reduction, a Granger-based predictive framework, and data augmentation, to handle large-scale networks even when T<N. Extensive simulations on synthetic and semi-realistic fMRI data (3-34 nodes, both linear and nonlinear) confirm lsAGC's efficiency in tackling high-dimensional data. Validation on real clinical fMRI data from 40 subjects (118 brain regions) demonstrates superior performance, with lsAGC achieving AUC 0.83 versus 0.50-0.62 for modern baselines including PCMCI, sparse VAR, and deconvolution-based GC. Empirically, lsAGC outperforms baseline methods in multiple benchmarks. For instance, on a 34-node network with only 50 samples, lsAGC maintains an AUROC above 0.70, whereas others fall below 0.60. Moreover, lsAGC is computationally efficient (8.3s vs. hours for 118-region networks) and robust to noise, nonlinear effects, and short time spans. This combination of speed and accuracy renders lsAGC practical for real-world contexts in neuroscience, climate science, and economics, where short, large-scale time series predominate.

Details DOI

NeurIPS Conference 2025 Conference Paper

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Yunlong Tang
Pinxin Liu
Mingqian Feng
Zhangyun Tan
Rui Mao
Chao Huang
Jing Bi
Yunzhong Xiao

Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2, 711 real-world and synthetic image instances with 5, 083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources are available at https: //yunlong10. github. io/MMPerspective/

PDF Details