Author name cluster

Run Luo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

2 author rows

ICLR Conference 2025 Conference Paper

DEEM: Diffusion models serve as the eyes of large language models for image perception

Run Luo
Yunshui Li
Longze Chen
Wanwei He
Ting-En Lin
Ziqiang Liu
Lei Zhang 0201
Zikai Song

The development of large language models (LLMs) has significantly advanced the emergence of large multimodal models (LMMs). While LMMs have achieved tremendous success by promoting the synergy between multimodal comprehension and creation, they often face challenges when confronted with out-of-distribution data, such as which can hardly distinguish orientation, quantity, color, structure, etc. This is primarily due to their reliance on image encoders trained to encode images into task-relevant features, which may lead them to disregard irrelevant details. Delving into the modeling capabilities of diffusion models for images naturally prompts the question: Can diffusion models serve as the eyes of large language models for image perception? In this paper, we propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder. This addresses the drawbacks of previous methods that solely relied on image encoders like CLIP-ViT, thereby enhancing the model's resilience against out-of-distribution samples and reducing visual hallucinations. Importantly, this is achieved without requiring additional training modules and with fewer training parameters. We extensively evaluated DEEM on both our newly constructed RobustVQA benchmark and other well-known benchmarks, POPE and MMVP, for visual hallucination and perception. In particular, DEEM improves LMM's visual perception performance to a large extent (e.g., 4\% ↑ on RobustVQA, 6.5\% ↑ on MMVP and 12.8 \% ↑ on POPE ). Compared to the state-of-the-art interleaved content generation models, DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data (10\%), and a smaller base model size. Extensive experiments demonstrate that DEEM enhances the performance of LMMs on various downstream tasks without inferior performance in the long term, including visual question answering, image captioning, and text-conditioned image synthesis.

Details

AAAI Conference 2025 Conference Paper

Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs

Lei Zhang
Yunshui Li
Jiaming Li
Xiaobo Xia
Jiaxi Yang
Run Luo
Minzheng Wang
Longze Chen

Some of the latest released Code Large Language Models (Code LLMs) have been trained on repository-level code data, enabling them to perceive repository structures and utilize cross-file code information. This capability allows us to directly concatenate the content of repository code files in prompts to achieve repository-level code completion. However, in real development scenarios, directly concatenating all code repository files in a prompt can easily exceed the context window of Code LLMs, leading to a significant decline in completion performance. Additionally, overly long prompts can increase completion latency, negatively impacting the user experience. In this study, we conducted extensive experiments, including completion error analysis, topology dependency analysis, and cross-file content analysis, to investigate the factors affecting repository-level code completion. Based on the conclusions drawn from these preliminary experiments, we proposed a strategy called **Hierarchical Context Pruning (HCP)** to construct high-quality completion prompts. We applied the **HCP** to six Code LLMs and evaluated them on the CrossCodeEval dataset. The experimental results showed that, compared to previous methods, the prompts constructed using our **HCP** strategy achieved higher completion accuracy on five out of six Code LLMs. Additionally, the **HCP** managed to keep the prompt length around 8k tokens (whereas the full repository code is approximately 50k tokens), significantly improving completion throughput. Our code and data will be publicly available.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-time Emotional Speech Synthesis

Run Luo
Ting-En Lin
Haonan Zhang
Yuchuan Wu
Xiong Liu
Yongbin Li
Longze Chen
Jiaming Li

Recent advancements in omnimodal learning have significantly improved understanding and generation across images, text, and speech, yet these developments remain predominantly confined to proprietary models. The lack of high-quality omnimodal datasets and the challenges of real-time emotional speech synthesis have notably hindered progress in open-source research. To address these limitations, we introduce OpenOmni, a two-stage training framework that integrates omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pretrained speech model undergoes further training on image-text tasks, enabling (near) zero-shot generalization from vision to speech, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder is trained on speech tasks with direct preference optimization, which enables real-time emotional speech synthesis with high fidelity. Extensive experiments demonstrate that OpenOmni surpasses state-of-the-art models across omnimodal, vision-language, and speech-language benchmarks. It achieves a 4-point absolute improvement on OmniBench over the leading open-source model VITA, despite using 5$\times$ fewer training examples and a smaller model size (7B vs. 7$\times$8B). Besides, OpenOmni achieves real-time speech generation with less than 1 second latency at non-autoregressive mode, reducing inference time by 5$\times$ compared to autoregressive methods, and improves emotion classification accuracy by 7. 7\%. The codebase is available at https: //github. com/RainBowLuoCS/OpenOmni.

PDF Details

AAAI Conference 2025 Conference Paper

Temporal Coherent Object Flow for Multi-Object Tracking

Zikai Song
Run Luo
Lintao Ma
Ying Tang
Yi-Ping Phoebe Chen
Junqing Yu
Wei Yang

Multi-object tracking is a challenging vision task that requires simultaneous reasoning about object detection and object association. Conventional solutions use frame as the basic unit and typically rely on a motion predictor that exploits the appearance features to associate detected candidates, leading to insufficient adaptability to long-term associations. In this study, we propose a section-based multi-object tracking approach that integrates a temporal coherent Object Flow Tracker (OFTrack), capable of achieving simultaneous multi-frame tracking by treating multiple consecutive frames as the basic processing unit, denoted as a “section”. Our OFTrack boosts the optical flow to the object flow by employing object perception and section-based motion estimation strategies. Object perception adopts object-aware sampling and scale-aware correlation to enable precise target discrimination. Motion estimation models the correlation of different objects in multi-frames via specialized temporal-spatial attention to achieve robust association in very long videos. Additionally, to address the oscillation of unpredictable trajectories in multi-frame estimation, we have designed temporal coherent enhancement including the trajectory masking pre-training and the smoothing constraint on trajectory curves. Comprehensive experiments on several widely used benchmarks demonstrate the superior performance of our approach.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning

Run Luo
Renke Shan
Longze Chen
Ziqiang Liu
Lu Wang
Min Yang
Xiaobo Xia

Large vision-language models (LVLMs) have emerged as foundational tools for real-world AI applications. Despite their remarkable capabilities, current LVLMs process entire images at the token level, leading to significant inefficiencies compared to human cognition, which selectively focuses on high-level vision concepts. This token-level redundancy becomes increasingly problematic for high-resolution images and long video sequences, resulting in large computational costs and limited scalability in practical applications. To address this limitation, we introduce the concept of a vision concept model, a novel paradigm that enables LVLMs to dynamically extract the most relevant vision concepts from complex inputs, based on task-specific instructions. To optimize this vision concept modeling process, we propose VCM, a self-supervised framework that leverages vision-language correlations across diverse instances. VCM is designed to learn meaningful vision concepts without the need for expensive concept-level annotations. At its core, it employs a forward-backward optimization algorithm that supports LVLMs to adjust concept granularity and spatial alignment dynamically. Experiments demonstrate that VCM remarkably reduces computational costs (e. g. , achieving up to 85\% fewer FLOPs for LLaVA-1. 5-7B), while maintaining strong performance across a series of vision-language tasks. The codebase is available at https: //github. com/RainBowLuoCS/VCM.

PDF Details

AAAI Conference 2024 Conference Paper

DiffusionTrack: Diffusion Model for Multi-Object Tracking

Run Luo
Zikai Song
Lintao Ma
Jinlin Wei
Wei Yang
Min Yang

Multi-object tracking (MOT) is a challenging vision task that aims to detect individual objects within a single frame and associate them across multiple frames. Recent MOT approaches can be categorized into two-stage tracking-by-detection (TBD) methods and one-stage joint detection and tracking (JDT) methods. Despite the success of these approaches, they also suffer from common problems, such as harmful global or local inconsistency, poor trade-off between robustness and model complexity, and lack of flexibility in different scenes within the same video. In this paper we propose a simple but robust framework that formulates object detection and association jointly as a consistent denoising diffusion process from paired noise boxes to paired ground-truth boxes. This novel progressive denoising diffusion strategy substantially augments the tracker's effectiveness, enabling it to discriminate between various objects. During the training stage, paired object boxes diffuse from paired ground-truth boxes to random distribution, and the model learns detection and tracking simultaneously by reversing this noising process. In inference, the model refines a set of paired randomly generated boxes to the detection and tracking results in a flexible one-step or multi-step denoising diffusion process. Extensive experiments on three widely used MOT benchmarks, including MOT17, MOT20, and DanceTrack, demonstrate that our approach achieves competitive performance compared to the current state-of-the-art methods. Code is available at https://github.com/RainBowLuoCS/DiffusionTrack.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Compact Transformer Tracker with Correlative Masked Modeling

Zikai Song
Run Luo
Junqing Yu
Yi-Ping Phoebe Chen
Wei Yang

Transformer framework has been showing superior performances in visual object tracking for its great strength in information aggregation across the template and search image with the well-known attention mechanism. Most recent advances focus on exploring attention mechanism variants for better information aggregation. We find these schemes are equivalent to or even just a subset of the basic self-attention mechanism. In this paper, we prove that the vanilla self-attention structure is sufficient for information aggregation, and structural adaption is unnecessary. The key is not the attention structure, but how to extract the discriminative feature for tracking and enhance the communication between the target and search image. Based on this finding, we adopt the basic vision transformer (ViT) architecture as our main tracker and concatenate the template and search image for feature embedding. To guide the encoder to capture the invariant feature for tracking, we attach a lightweight correlative masked decoder which reconstructs the original template and search image from the corresponding masked tokens. The correlative masked decoder serves as a plugin for the compact transformer tracker and is skipped in inference. Our compact tracker uses the most simple structure which only consists of a ViT backbone and a box head, and can run at 40 fps. Extensive experiments show the proposed compact transform tracker outperforms existing approaches, including advanced attention variants, and demonstrates the sufficiency of self-attention in tracking tasks. Our method achieves state-of-the-art performance on five challenging datasets, along with the VOT2020, UAV123, LaSOT, TrackingNet, and GOT-10k benchmarks. Our project is available at https://github.com/HUSTDML/CTTrack.

PDF Details DOI