Author name cluster

Jun Song

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers

2 author rows

AAAI Conference 2026 Conference Paper

Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

Yinchao Ma
Qiang Zhou
Zhibin Wang
Xianing Chen
Hanqing Yang
Jun Song
Bo Zheng

Video large language models have demonstrated remarkable capabilities in video understanding tasks. However, the redundancy of video tokens introduces significant computational overhead during inference, limiting their practical deployment. Many compression algorithms are proposed to prioritize retaining features with the highest attention scores to minimize perturbations in attention computations. However, the correlation between attention scores and their actual contribution to correct answers remains ambiguous. To address the above limitation, we propose a novel contribution-aware token compression algorithm for video understanding (CaCoVID) that explicitly optimizes the token selection policy based on the contribution of tokens to correct predictions. First, we introduce a reinforcement learning-based framework that optimizes a policy network to select video token combinations with the greatest contribution to correct predictions. This paradigm shifts the focus from passive token preservation to active discovery of optimal compressed token combinations. Secondly, we propose a combinatorial policy optimization algorithm with online combination space sampling, which dramatically reduces the exploration space for video token combinations and accelerates the convergence speed of policy optimization. Extensive experiments on diverse video understanding benchmarks demonstrate the effectiveness of CaCoVID. Codes will be released.

PDF Details DOI

AAAI Conference 2026 Conference Paper

DeepPhy: Benchmarking Agentic VLMs on Physical Reasoning

Xinrun Xu
Pi Bu
Ye Wang
Börje F. Karlsson
Ziming Wang
Tengtao Song
Qi Zhu
Jun Song

Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel benchmark framework designed to systematically evaluate VLMs' understanding and reasoning about fundamental physical principles through a series of challenging simulated environments. DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates fine-grained evaluation metrics. Our evaluation finds that even state-of-the-art VLMs struggle to translate descriptive physical knowledge into precise, predictive control.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

Xuyang Liu
Ziming Wang
Junjie Chen
Yuhang Han
Yingyao Wang
Jiale Yuan
Jun Song
Siteng Huang

Large vision-language models (LVLMs) excel at visual understanding but face efficiency challenges due to quadratic complexity when processing long multimodal contexts. While token compression can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to account for the unique multi-view characteristics of high-resolution LVLMs that use dynamic cropping. Current methods treat all tokens uniformly, yet our analysis shows that global thumbnails can naturally guide the compression of local crops by providing holistic context for evaluating informativeness. In this paper, we first analyze the dynamic cropping strategy, revealing both the complementary relationship between thumbnails and crops and the distinct characteristics across different crops. Based on these insights, we propose ''Global Compression Commander'' (GlobalCom2), a novel plug-and-play token compression framework for high-resolution LVLMs. GlobalCom2 uses the thumbnail as a ''commander'' to adaptively guide the compression of local crops, preserving informative details while removing redundancy. Extensive experiments demonstrate that GlobalCom2 maintains over 90% of model performance while compressing 90% of visual tokens, reducing FLOPs to 9.1% and peak memory usage to 60% of the original.

PDF Details DOI

AAAI Conference 2026 Conference Paper

How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective

Bo Peng
Pi Bu
Keyu Pan
Xinrun Xu
Yingxiu Zhao
Miao Chen
Yang Du
Lin Li

Recent advances in vision–language models (VLMs) have shed light on human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents still rely on high-level commands or discretised action spaces—``non-native'' settings that diverge markedly from the real world. Moreover, current benchmarks focus exclusively on high-level tasks, while lacking joint evaluation and analysis on both low- and high-level. To bridge these gaps, we present \textbf{NativeEmbodied}, a challenging benchmark for VLM-driven embodied agents that adopts a unified, native low-level action space. Built upon diverse simulated scenes, NativeEmbodied first designs three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed and comprehensive performance analysis, we further decouple the entangled skills behind complex tasks and construct four types of low-level tasks, each corresponding to a key fundamental embodied skill. This joint evaluation across task and skill granularities enables a fine-grained assessment of embodied agent. Comprehensive experiments on the best VLMs reveal pronounced deficiencies in certain fundamental embodied skills. Further analysis shows that these bottlenecks severely constrain performance on high-level tasks. Our NativeEmbodied not only pinpoints the key challenges faced by current VLM-driven embodied agents, but also provides valuable insight for future development of this field.

PDF Details DOI

AAAI Conference 2026 Conference Paper

LLaVA-UHD v2: Exploiting Hierarchical Vision Granularity in MLLMs via Inverse Semantic Pyramid

Yipeng Zhang
Yifan Liu
Zonghao Guo
Yidan Zhang
Xuesong Yang
Xiaoying Zhang
Chi Chen
Jun Song

Vision transformers (ViTs) are widely employed in multimodal large language models (MLLMs) for visual encoding. However, they exhibit inferior performance on tasks regarding fine-grained visual perception. We attribute this to the inner limitations of ViTs in capturing diverse visual semantic levels. To address this, we present Hierarchical window (Hiwin) transformer as a plug-and-play solution for MLLMs, centered around our inverse semantic pyramid (ISP). Hiwin transformer comprises two key modules: (i) a visual detail injection module, which progressively injects low-level visual details into high-level language-aligned semantics features, thereby constructing an ISP, and (ii) a hierarchical window attention module, which leverages cross-scale windows to condense multi-level semantics from the ISP. Notably, our design achieves an average boost of 3.7% across 14 benchmarks compared with the baseline method, 9.3% on DocVQA for instance.

PDF Details DOI

AAAI Conference 2026 Conference Paper

MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs

Junpeng Ma
Qizhe Zhang
Ming Lu
Zhibin Wang
Qiang Zhou
Jun Song
Shanghang Zhang

Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B.

PDF Details DOI

EAAI Journal 2025 Journal Article

Automotive fuse & relay box plug-in modules assembly correctness detection system based on machine vision

ZhengWei Gong
Jun Song
Ping Zhang

The automotive fuse and relay box is vital for electrical safety and reliability, demanding stringent quality control before leaving the factory. However, existing methods face limitations such as light interference, inability to detect non-fuse plug-in modules, lack of worker-friendly interfaces, insufficient data recording features, and a lack of comparative diagnostic capabilities for detection results. To address these issues, an artificial intelligence (AI)-powered automotive fuse and relay box assembly correctness detection system based on machine vision is proposed. This system incorporates a closed image acquisition setup, advanced machine vision techniques, and My Structured Query Language (MySQL) database operations for efficient data management. A comprehensive detection rule-setting subsystem, developed with Python Qt 5 (PyQt5) graphical user interface (GUI), integrates classification detection, similarity detection, color detection, and text recognition, allowing users to easily create detection rules. Additionally, a PyQt5-based template selection subsystem further streamlines template identification for various scenarios. The detection system combines these four methods with an object detection method for real-time, accurate assembly verification. The core You Only Look Once version 11 extra-large (YOLOx) model provides fast and precise localization, while supplementary modules—Residual Neural Network with 18 layers for classification detection, Siamese network-based similarity detection, binary character recognition, and color detection—work synergistically to enhance detection robustness and accuracy. The system achieves an average detection time of 0. 141 s per module for correct assemblies and 1. 398 s for faulty assemblies. Demonstrating 99. 9 % accuracy, high adaptability, and efficient detection, the system is highly suitable for large-scale, real-world production environments.

AAAI Conference 2025 Conference Paper

POI Recommendation via Multi-Objective Adversarial Imitation Learning

Zhenglin Wan
Anjun Gao
Xingrui Yu
Pingfu Chao
Jun Song
Maohao Ran

Point-of-Interest (POI) recommendation aims to predict users' future locations based on their historical check-ins. Despite the success of recent deep learning approaches in capturing POI semantics and user behavior, they continue to face the persistent problem of data sparsity and incompleteness. In this paper, we introduce Multi-Objective Adversarial Imitation Recommender (MOAIR), a novel framework that integrates Generative Adversarial Imitation Learning with multi-objective to address this issue. MOAIR effectively captures user behavior patterns and spatial-temporal contextual information via graph-enhanced self-supervised state encoder and overcomes data sparsity by robustly learning from limited data and generating diverse samples. By accommodating diverse user patterns in the training data, the framework also mitigates the typical mode-collapse issue in generative adversarial learning and thus enhances the overall performance. MOAIR employs a multi-objective imitation learning architecture where the imitation learning agent (IL agent) explores the POI space and receives multifaceted reward signals. Utilizing the Paralleled Proximal Policy Optimization (3PO) framework to optimize multi-objectives, the IL agent ensures efficient and stable policy updates. Additionally, to address the issue of high noise in POI recommendation scenarios, we use a novel generative way to define our policy net and incorporate a variational bottleneck for regularization to enhance the stability of adversarial learning. Comprehensive experiments reveal the superior performance for MOAIR compared to other baseline approaches, especially with sparse training data.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Demystify Mamba in Vision: A Linear Attention Perspective

Dongchen Han
Ziyi Wang
Zhuofan Xia
Yizeng Han
Yifan Pu
Chunjiang Ge
Jun Song
Shiji Song

Mamba is an effective state space model with linear computation complexity. It has recently shown impressive efficiency in dealing with high-resolution inputs across various vision tasks. In this paper, we reveal that the powerful Mamba model shares surprising similarities with linear attention Transformer, which typically underperform conventional Transformer in practice. By exploring the similarities and disparities between the effective Mamba and subpar linear attention Transformer, we provide comprehensive analyses to demystify the key factors behind Mamba’s success. Specifically, we reformulate the selective state space model and linear attention within a unified formulation, rephrasing Mamba as a variant of linear attention Transformer with six major distinctions: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design. For each design, we meticulously analyze its pros and cons, and empirically evaluate its impact on model performance in vision tasks. Interestingly, the results highlight the forget gate and block design as the core contributors to Mamba’s success, while the other four designs are less crucial. Based on these findings, we propose a Mamba-Inspired Linear Attention (MILA) model by incorporating the merits of these two key designs into linear attention. The resulting model outperforms various vision Mamba models in both image classification and high-resolution dense prediction tasks, while enjoying parallelizable computation and fast inference speed. Code is available at https: //github. com/LeapLabTHU/MLLA.

PDF Details DOI

ICML Conference 2024 Conference Paper

Enhancing Sufficient Dimension Reduction via Hellinger Correlation

Seungbeom Hong
Ilmun Kim
Jun Song

In this work, we develop a new theory and method for sufficient dimension reduction (SDR) in single-index models, where SDR is a sub-field of supervised dimension reduction based on conditional independence. Our work is primarily motivated by the recent introduction of the Hellinger correlation as a dependency measure. Utilizing this measure, we have developed a method capable of effectively detecting the dimension reduction subspace, complete with theoretical justification. Through extensive numerical experiments, we demonstrate that our proposed method significantly enhances and outperforms existing SDR methods. This improvement is largely attributed to our proposed method’s deeper understanding of data dependencies and the refinement of existing SDR techniques.

TMLR Journal 2023 Journal Article

Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Jun Song
Niao He
Lijun Ding
Chaoyue Zhao

Trust-region methods based on Kullback-Leibler divergence are pervasively used to stabilize policy optimization in reinforcement learning. In this paper, we exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions, namely Wasserstein policy optimization (WPO) and Sinkhorn policy optimization (SPO). Instead of restricting the policy to a parametric distribution class, we directly optimize the policy distribution and derive their close-form policy updates based on the Lagrangian duality. Theoretically, we show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes. Moreover, we prove that with a decaying Lagrangian multiplier to the trust region constraint, both methods converge to global optimality. Experiments across tabular domains, robotic locomotion, and continuous control tasks further demonstrate the performance improvement of both approaches, more robustness of WPO to sample insufficiency, and faster convergence of SPO, over state-of-art policy gradient methods.

AAAI Conference 2015 Conference Paper

Structured Embedding via Pairwise Relations and Long-Range Interactions in Knowledge Base

Fei Wu
Jun Song
Yi Yang
Xi Li
Zhongfei Zhang
Yueting Zhuang

We consider the problem of embedding entities and relations of knowledge bases into low-dimensional continuous vector spaces (distributed representations). Unlike most existing approaches, which are primarily efficient for modelling pairwise relations between entities, we attempt to explicitly model both pairwise relations and long-range interactions between entities, by interpreting them as linear operators on the low-dimensional embeddings of the entities. Therefore, in this paper we introduces path ranking to capture the long-range interactions of knowledge graph and at the same time preserve the pairwise relations of knowledge graph; we call it structured embedding via pairwise relation and longrange interactions (referred to as SePLi). Comparing with the-state-of-the-art models, SePLi achieves better performances of embeddings.