Efficient Segmentation with Multimodal Large Language Model via Token Routing

Changsong Wen; Zelin Peng; Yu Huang; Wei Shen

doi:10.1609/aaai.v40i13.38032

Back to AAAI

AAAI 2026

Efficient Segmentation with Multimodal Large Language Model via Token Routing

Conference Paper AAAI Technical Track on Computer Vision X Artificial Intelligence

PDF Details DOI

Abstract

Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in addressing open-world segmentation tasks. However, the substantial computational cost of the LLM components presents a significant challenge, especially in segmentation tasks, where efficiency has long been a central concern. Existing efficient MLLM approaches typically reduce computation cost by pruning visual tokens in the early layers, as they account for the majority of the input sequence. Despite their efficiency, this is incompatible with dense prediction tasks such as segmentation, since removing visual tokens leads to the loss of essential object parts and spatial details. To better understand the roles of visual tokens in segmentation, we analyze the attention weights of both image and mask tokens within LLM. We find that image tokens are important throughout all layers, whereas mask tokens only attend to image tokens at deeper layers. Based on the observation, we build an efficient segmentation framework based on MLLMs by introducing a sophisticated token routing strategy. This strategy dynamically determines when and how different tokens participate in computation: For mask tokens, they are only inserted at deeper layers of the LLM to reduce redundant computation, since they rarely attend to image tokens in early layers; For image tokens, only a small number of them, named proxies, are updated via full feedforward network (FFN) computation, while the update of the remaining tokens is guided by these proxies, i.e., efficiently computed through a lightweight projector applied on the difference of the proxies during their update. Our method achieves a 1.5× acceleration over the original LLM process by reducing its FLOPs to 56%, while maintaining the same segmentation performance.

Efficient Segmentation with Multimodal Large Language Model via Token Routing

Abstract

Authors

Keywords

Context