Arrow Research search
Back to NeurIPS

NeurIPS 2025

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

Abstract

Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose \ours, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective “ thought dropout ” operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that \ours can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks—covering a range of reasoning difficulties under both 3B and 7B models—consistently reveal that the \textit{model progressively learns to bypass unnecessary reasoning steps as training advances}. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https: //github. com/kokolerk/TON.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
Annual Conference on Neural Information Processing Systems
Archive span
1987-2025
Indexed papers
30776
Paper id
697568778956273812