Author name cluster

Lin Song

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

1 author row

NeurIPS Conference 2025 Conference Paper

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

Yicheng Xiao
Lin Song
Yukang Chen
Yingmin Luo
Yuxin Chen
Yukang Gan
Wei Huang
Xiu Li

Recent text-to-image systems face limitations in handling multimodal inputs and complex reasoning tasks. We introduce MindOmni, a unified multimodal large language model that addresses these challenges by incorporating reasoning generation through reinforcement learning. MindOmni leverages a three-phase training strategy: i) design of a unified vision language model with a decoder-only diffusion module, ii) supervised fine-tuning with Chain-of-Thought (CoT) instruction data, and iii) our proposed Reasoning Generation Policy Optimization (RGPO) algorithm, utilizing multimodal feedback to effectively guide policy updates. Experimental results demonstrate that MindOmni outperforms existing models, achieving impressive performance on both understanding and generation benchmarks, meanwhile showcasing advanced fine-grained reasoning generation capabilities, especially with mathematical reasoning instruction. All codes will be made public.

PDF Details

NeurIPS Conference 2024 Conference Paper

MambaTree: Tree Topology is All You Need in State Space Model

Yicheng Xiao
Lin Song
Shaoli Huang
Jiangshan Wang
Siyu Song
Yixiao Ge
Xiu Li
Ying Shan

The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the MambaTree network, which first dynamically generates a tree topology based on spatial relationships and input features. Then, feature propagation is performed based on this graph, thereby breaking the original sequence constraints to achieve stronger representation capabilities. Additionally, we introduce a linear complexity dynamic programming algorithm to enhance long-range interactions without increasing computational cost. MambaTree is a versatile multimodal framework that can be applied to both visual and textual tasks. Extensive experiments demonstrate that our method significantly outperforms existing structured state space models on image classification, object detection and segmentation. Besides, by fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost.

PDF Details DOI

YNICL Journal 2024 Journal Article

Mapping grey matter and cortical thickness alterations associated with subjective cognitive decline and mild cognitive impairment among rural-dwelling older adults in China: A population-based study

Ziwei Chen
Qianqian Xie
Jiafeng Wang
Yan Wang
Huisi Zhang
Chunyan Li
Yongxiang Wang
Lin Cong

BACKGROUND: The structural brain alterations for subjective cognitive decline (SCD) and mild cognitive impairment (MCI) are poorly defined. We sought to characterize grey matter volume (GMV) and cortical thickness associated with SCD and MCI among rural-dwelling older adults in China. METHODS: This population-based cross-sectional study included 1072 dementia-free participants from the brain MRI sub-study of MIND-China (2018-2020). We defined MCI following the Petersen's criteria, and SCD as the self-rated Ascertain Dementia 8-item Questionnaire score ≥ 2. Data were analyzed using voxel-based morphometry (VBM), surface-based morphometry analysis (SBM), and logistic regression models. RESULTS: SCD was defined in 243 persons and MCI in 246 individuals. The VBM analysis showed that MCI (vs. normal cognition) was significantly associated with reduced GMV in brain regions such as the bilateral parahippocampus, bilateral hippocampus, and bilateral fusiform (P 0.05). The ROI-wise SBM analysis revealed that SCD was significantly associated with cortical thinning in the right paracentral sulcus, left caudal middle frontal gyrus, and left entorhinal cortex (P < 0.05) and that MCI was significantly associated with cortical thinning in the left temporal lobe, left frontal lobe, bilateral parietal lobe and bilateral fusiform (P < 0.05). CONCLUSIONS: The brain regions with reduced GMV or cortical thickness in older adults gradually expand from normal cognition through SCD to MCI, suggesting that characterizing structural brain alterations may help define the cognitive spectrum at the pre-dementia phase. These findings have potential implications for understanding the neuropathological process of cognitive deterioration in aging.

Details DOI

NeurIPS Conference 2023 Conference Paper

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

Rui Yang
Lin Song
Yanwei Li
Sijie Zhao
Yixiao Ge
Xiu Li
Ying Shan

This paper aims to efficiently enable Large Language Models (LLMs) to use multi-modal tools. The advanced proprietary LLMs, such as ChatGPT and GPT-4, have shown great potential for tool usage through sophisticated prompt engineering. Nevertheless, these models typically rely on prohibitive computational costs and publicly inaccessible data. To address these challenges, we propose the GPT4Tools based on self-instruct to enable open-source LLMs, such as LLaMA and OPT, to use tools. It generates an instruction-following dataset by prompting an advanced teacher with various multi-modal contexts. By using the Low-Rank Adaptation (LoRA) optimization, our approach facilitates the open-source LLMs to solve a range of visual problems, including visual comprehension and image generation. Moreover, we provide a benchmark to evaluate the ability of LLMs to use tools, which is performed in both zero-shot and fine-tuning ways. Extensive experiments demonstrate the effectiveness of our method on various language models, which not only significantly improves the accuracy of invoking seen tools, but also enables the zero-shot capacity for unseen tools.

PDF Details

NeurIPS Conference 2023 Conference Paper

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

Cheng Cheng
Lin Song
Ruoyi Xue
Hang Wang
Hongbin Sun
Yixiao Ge
Ying Shan

The contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts, enabling effective zero-shot image recognition. Nevertheless, few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples, resulting in longer inference time and the risk of overfitting in certain domains. To tackle these challenges, we propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner. With a few training samples, our method can enable effective few-shot learning capabilities and generalize to unseen data or tasks without additional fine-tuning, achieving competitive performance and high efficiency. Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3. 6\% on eight image classification datasets with higher inference speed. Furthermore, our model is simple and flexible, serving as a plug-and-play module directly applicable to downstream tasks. Without further fine-tuning, Meta-Adapter obtains notable performance improvements in open-vocabulary object detection and segmentation tasks.

PDF Details

NeurIPS Conference 2021 Conference Paper

Dynamic Grained Encoder for Vision Transformers

Lin Song
Songyang Zhang
Songtao Liu
Zeming Li
Xuming He
Hongbin Sun
Jian Sun
Nanning Zheng

Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save computational costs. Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region. Thus it achieves a fine-grained representation in discriminative regions while keeping high efficiency. Besides, the dynamic grained encoder is compatible with most vision transformer frameworks. Without bells and whistles, our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification. Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach. Code is available at https: //github. com/StevenGrove/vtpack.

PDF Details

NeurIPS Conference 2020 Conference Paper

Fine-Grained Dynamic Head for Object Detection

Lin Song
Yanwei Li
Zhengkai Jiang
Zeming Li
Hongbin Sun
Jian Sun
Nanning Zheng

The Feature Pyramid Network (FPN) presents a remarkable approach to alleviate the scale variance in object representation by performing instance-level assignments. Nevertheless, this strategy ignores the distinct characteristics of different sub-regions in an instance. To this end, we propose a fine-grained dynamic head to conditionally select a pixel-level combination of FPN features from different scales for each instance, which further releases the ability of multi-scale feature representation. Moreover, we design a spatial gate with the new activation function to reduce computational complexity dramatically through spatially sparse convolutions. Extensive experiments demonstrate the effectiveness and efficiency of the proposed method on several state-of-the-art detection benchmarks. Code is available at https: //github. com/StevenGrove/DynamicHead.

PDF Details

NeurIPS Conference 2020 Conference Paper

Rethinking Learnable Tree Filter for Generic Feature Transform

Lin Song
Yanwei Li
Zhengkai Jiang
Zeming Li
Xiangyu Zhang
Hongbin Sun
Jian Sun
Nanning Zheng

The Learnable Tree Filter presents a remarkable approach to model structure-preserving relations for semantic segmentation. Nevertheless, the intrinsic geometric constraint forces it to focus on the regions with close spatial distance, hindering the effective long-range interactions. To relax the geometric constraint, we give the analysis by reformulating it as a Markov Random Field and introduce a learnable unary term. Besides, we propose a learnable spanning tree algorithm to replace the original non-differentiable one, which further improves the flexibility and robustness. With the above improvements, our method can better capture long range dependencies and preserve structural details with linear complexity, which is extended to several vision tasks for more generic feature transform. Extensive experiments on object detection/instance segmentation demonstrate the consistent improvements over the original version. For semantic segmentation, we achieve leading performance (82. 1% mIoU) on the Cityscapes benchmark without bells-and whistles. Code is available at https: //github. com/StevenGrove/LearnableTreeFilterV2.

PDF Details

NeurIPS Conference 2019 Conference Paper

Learnable Tree Filter for Structure-preserving Feature Transform

Lin Song
Yanwei Li
Zeming Li
Gang Yu
Hongbin Sun
Jian Sun
Nanning Zheng

Learning discriminative global features plays a vital role in semantic segmentation. And most of the existing methods adopt stacks of local convolutions or non-local blocks to capture long-range context. However, due to the absence of spatial structure preservation, these operators ignore the object details when enlarging receptive fields. In this paper, we propose the learnable tree filter to form a generic tree filtering module that leverages the structural property of minimal spanning tree to model long-range dependencies while preserving the details. Furthermore, we propose a highly efficient linear-time algorithm to reduce resource consumption. Thus, the designed modules can be plugged into existing deep neural networks conveniently. To this end, tree filtering modules are embedded to formulate a unified framework for semantic segmentation. We conduct extensive ablation studies to elaborate on the effectiveness and efficiency of the proposed method. Specifically, it attains better performance with much less overhead compared with the classic PSP block and Non-local operation under the same backbone. Our approach is proved to achieve consistent improvements on several benchmarks without bells-and-whistles. Code and models are available at https: //github. com/StevenGrove/TreeFilter-Torch.

PDF Details