Author name cluster

Zhenbo Luo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

1 author row

AAAI Conference 2026 Conference Paper

AutoLink: Autonomous Schema Exploration and Expansion for Scalable Schema Linking in Text-to-SQL at Scale

Ziyang Wang
Yuanlei Zheng
Zhenbiao Cao
Xiaojin Zhang
Zhongyu Wei
Pei Fu
Zhenbo Luo
Wei Chen

For industrial-scale text-to-SQL, supplying the entire database schema to Large Language Models (LLMs) is impractical due to context window limits and irrelevant noise. Schema linking, which filters the schema to a relevant subset, is therefore critical. However, existing methods incur prohibitive costs, struggle to trade off recall and noise, and scale poorly to large databases. We present AutoLink, an autonomous agent framework that reformulates schema linking as an iterative, agent-driven process. Guided by an LLM, AutoLink dynamically explores and expands the linked schema subset, progressively identifying necessary schema components without inputting the full database schema. Our experiments demonstrate AutoLink's superior performance, achieving state-of-the-art strict schema linking recall of 97.4% on Bird-Dev and 91.2% on Spider 2.0-Lite, with competitive execution accuracy, i.e., 68.7% EX on Bird-Dev (better than CHESS) and 34.9% EX on Spider 2.0-Lite (ranking 2nd on the official leaderboard). Crucially, AutoLink exhibits exceptional scalability, maintaining high recall, efficient token consumption, and robust execution accuracy on large schemas (e.g., over 3,000 columns) where existing methods severely degrade—making it a highly scalable, high-recall schema-linking solution for industrial text-to-SQL systems.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

Dingkang Liang
Cheng Zhang
Xiaopeng Xu
Jianzhong Ju
Zhenbo Luo
Xiang Bai

Task scheduling has become increasingly critical for embodied AI, where agents need to follow natural language instructions and execute actions efficiently in 3D physical worlds. Existing datasets for task planning in 3D environments often simplify the problem, lacking operations research knowledge for task scheduling and 3D grounding for real-world applications. In this work, we propose Operations Research Knowledge-based 3D Grounded Task Scheduling (OKS3D), a new task that requires synerization of language understanding, 3D grounding, and efficiency optimization for embodied agents. OKS3D reflects real-world demands by requiring agents to generate efficient, step-by-step schedules that are grounded in 3D space. To facilitate research on OKS3D, we construct a large-scale dataset called OKS3D-60K, comprising 60K tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on the OKS3D-60K dataset validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To address this gap, we propose Blink–Think–Link (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) \textbf{Blink} - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) \textbf{Think} - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) \textbf{Link} - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) {BTL Reward – the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. } Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates competitive performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI agents.

PDF Details

NeurIPS Conference 2025 Conference Paper

Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

Wenhui Tan
Jiaze Li
Jianzhong Ju
Zhenbo Luo
Ruihua Song
Jian Luan

Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor $c$ randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head's non-deterministic nature to explore diverse reasoning paths and exploit more compact ones. This approach enables CoLaR to: i) **perform reasoning at a dense latent level** (i. e. , silently), substantially reducing reasoning chain length, and ii) **dynamically adjust reasoning speed** at inference time by simply prompting the desired compression factor. Extensive experiments across four mathematical reasoning datasets demonstrate that CoLaR achieves 14. 1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53. 3% with only 4. 8% performance degradation compared to explicit CoT method. Moreover, when applied to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR demonstrates performance gains of up to 5. 4% while dramatically reducing latent reasoning chain length by 82. 8%. The code and models will be released upon acceptance.

PDF Details

NeurIPS Conference 2025 Conference Paper

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang
Ziheng Wang
Boshen Xu
Yang Du
Kejun Lin
Zihan Xiao
Zihao Yue
Jianzhong Ju

Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their ability to generalize remains limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend more difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small but comprehensive and balanced benchmark suitable for LVLM evaluation, which is sourced from available public benchmarks. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using significantly less training data than prior LVLM approaches, while improving its general video understanding capabilities. Project Page: https: //xuboshen. github. io/Time-R1/.

PDF Details

Possible papers

AutoLink: Autonomous Schema Exploration and Expansion for Scalable Schema Linking in Text-to-SQL at Scale

Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding