Changsong Wen Papers

AAAI Conference 2026 Conference Paper

Efficient Segmentation with Multimodal Large Language Model via Token Routing

Changsong Wen
Zelin Peng
Yu Huang
Wei Shen

Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in addressing open-world segmentation tasks. However, the substantial computational cost of the LLM components presents a significant challenge, especially in segmentation tasks, where efficiency has long been a central concern. Existing efficient MLLM approaches typically reduce computation cost by pruning visual tokens in the early layers, as they account for the majority of the input sequence. Despite their efficiency, this is incompatible with dense prediction tasks such as segmentation, since removing visual tokens leads to the loss of essential object parts and spatial details. To better understand the roles of visual tokens in segmentation, we analyze the attention weights of both image and mask tokens within LLM. We find that image tokens are important throughout all layers, whereas mask tokens only attend to image tokens at deeper layers. Based on the observation, we build an efficient segmentation framework based on MLLMs by introducing a sophisticated token routing strategy. This strategy dynamically determines when and how different tokens participate in computation: For mask tokens, they are only inserted at deeper layers of the LLM to reduce redundant computation, since they rarely attend to image tokens in early layers; For image tokens, only a small number of them, named proxies, are updated via full feedforward network (FFN) computation, while the update of the remaining tokens is guided by these proxies, i.e., efficiently computed through a lightweight projector applied on the difference of the proxies during their update. Our method achieves a 1.5× acceleration over the original LLM process by reducing its FLOPs to 56%, while maintaining the same segmentation performance.

PDF Details DOI

ICML Conference 2025 Conference Paper

Tackling View-Dependent Semantics in 3D Language Gaussian Splatting

Jiazhong Cen
Xudong Zhou
Jiemin Fang
Changsong Wen
Lingxi Xie
Xiaopeng Zhang 0008
Wei Shen 0002
Qi Tian 0001

Recent advancements in 3D Gaussian Splatting (3D-GS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints—a phenomenon we term view-dependent semantics. To address this challenge, we propose LaGa ( La nguage Ga ussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of +18. 7% mIoU over the previous SOTA on the LERF-OVS dataset. Our code is available at: https: //github. com/https: //github. com/SJTU-DeepVisionLab/LaGa.

Details

Possible papers

Efficient Segmentation with Multimodal Large Language Model via Token Routing

Tackling View-Dependent Semantics in 3D Language Gaussian Splatting