Arrow Research search

Author name cluster

Lin Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
2 author rows

Possible papers

13

AAAI Conference 2026 Conference Paper

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

  • Jianfeng Si
  • Lin Sun
  • Zhewen Tan
  • Xiangzheng Zhang

Current methods for content safety in Large Language Models (LLMs), such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), often rely on multi-stage training pipelines and lack fine-grained, post-deployment controllability. To address these limitations, we propose a unified co-training framework that efficiently integrates multiple safety behaviors: positive (lawful/prosocial), negative (unfiltered/risk-prone) and rejective (refusal-oriented/conservative) within a single SFT stage. Notably, each behavior is dynamically activated via a simple system-level instruction, or magic token, enabling stealthy and efficient behavioral switching at inference time. This flexibility supports diverse deployment scenarios, such as positive for safe user interaction, negative for internal red-teaming, and rejective for context-aware refusals triggered by upstream moderation signals. This co-training strategy induces a distinct Safety Alignment Margin in the output space, characterized by well-separated response distributions corresponding to each safety mode. The existence of this margin provides empirical evidence for the model's safety robustness and enables unprecedented fine-grained control. Experiments show that our method matches the safety alignment quality of SFT+DPO, with our 8B model notably surpassing DeepSeek-R1 (671B) in safety performance, while significantly reducing both training complexity and deployment costs. This work presents a scalable, efficient, and highly controllable solution for LLM content safety.

ECAI Conference 2025 Conference Paper

D-CHO: Task-Oriented Satellite Conditional Handover Decision in NTN Based on Multi-Agent Game

  • Lin Sun
  • Jinming Liu
  • Fanmeng Hong
  • Haopeng Chen
  • Xiupu Lang
  • Lin Gui 0001

In non-terrestrial networks, satellite constellations based on the Low-Earth-Orbit (LEO) have become crucial for ensuring seamless global connectivity. The flexible continuity guarantee is demanded for task-oriented user connection requests in satellite networks. In this paper, we propose a task-oriented satellite conditional handover scheme, D-CHO, based on multi-agent game theory. For the problem formalization, this paper focuses on delay overhead, satellite utilization deviation, and load performance to model the multi-objective optimization. According to game theory, a Nash equilibrium exists among the multi-task game strategies that require satellite links. Through exploration and exploitation, the optimal satellite handover sequence scheme can be identified. This paper explores the optimal solution based on the MAPPO algorithm, which enables multiple tasks to make independent decisions based on their partial observations without requiring global information. This approach is beneficial for the adaptive expansion in response to dynamic changes in different satellite networks. The simulation results show that D-CHO improves performance by 22%, 26%, and 11% compared to the SCDP, G-CHO, and MADDPG-CHO algorithms, respectively, and exhibits better scalability while maintaining satisfactory performance.

AAAI Conference 2025 Conference Paper

Expand VSR Benchmark for VLLM to Expertize in Spatial Rules

  • Peijin Xie
  • Lin Sun
  • Bingquan Liu
  • Dexin Wang
  • Xiangzheng Zhang
  • Chengjie Sun
  • Jiajia Zhang

Distinguishing spatial relations is a basic part of human cognition which requires fine-grained perception on cross-instance. Although benchmarks like MME, MMBench and SEED comprehensively have evaluated various capabilities which already include visual spatial reasoning(VSR). There is still a lack of sufficient quantity and quality evaluation and optimization datasets for Vision Large Language Models(VLLMs) specifically targeting visual positional reasoning. To handle this, we first diagnosed current VLLMs with the VSR dataset and proposed a unified test set. We found current VLLMs to exhibit a contradiction of over-sensitivity to language instructions and under-sensitivity to visual positional information. By expanding the original benchmark from two aspects of tunning data and model structure, we mitigated this phenomenon. To our knowledge, we expanded spatially positioned image data controllably using diffusion models for the first time and integrated original visual encoding(CLIP) with other 3 powerful visual encoders(SigLIP, SAM and DINO). After conducting combination experiments on scaling data and models, we obtained a VLLM VSR Expert(VSRE) that not only generalizes better to different instructions but also accurately distinguishes differences in visual positional information. VSRE achieved over a 27% increase in accuracy on the VSR test set. It becomes a performant VLLM on the position reasoning of both the VSR dataset and relevant subsets of other evaluation benchmarks. We hope it will accelerate advancements in VLLM on VSR learning.

AAAI Conference 2024 Conference Paper

PathAsst: A Generative Foundation AI Assistant towards Artificial General Intelligence of Pathology

  • Yuxuan Sun
  • Chenglu Zhu
  • Sunyi Zheng
  • Kai Zhang
  • Lin Sun
  • Zhongyi Shui
  • Yunlong Zhang
  • Honglin Li

As advances in large language models (LLMs) and multimodal techniques continue to mature, the development of general-purpose multimodal large language models (MLLMs) has surged, offering significant applications in interpreting natural images. However, the field of pathology has largely remained untapped, particularly in gathering high-quality data and designing comprehensive model frameworks. To bridge the gap in pathology MLLMs, we present PathAsst, a multimodal generative foundation AI assistant to revolutionize diagnostic and predictive analytics in pathology. The development of PathAsst involves three pivotal steps: data acquisition, CLIP model adaptation, and the training of PathAsst's multimodal generative capabilities. Firstly, we collect over 207K high-quality pathology image-text pairs from authoritative sources. Leveraging the advanced power of ChatGPT, we generate over 180K instruction-following samples. Furthermore, we devise additional instruction-following data specifically tailored for invoking eight pathology-specific sub-models we prepared, allowing the PathAsst to effectively collaborate with these models, enhancing its diagnostic ability. Secondly, by leveraging the collected data, we construct PathCLIP, a pathology-dedicated CLIP, to enhance PathAsst's capabilities in interpreting pathology images. Finally, we integrate PathCLIP with the Vicuna-13b and utilize pathology-specific instruction-tuning data to enhance the multimodal generation capacity of PathAsst and bolster its synergistic interactions with sub-models. The experimental results of PathAsst show the potential of harnessing AI-powered generative foundation model to improve pathology diagnosis and treatment processes. We open-source our dataset, as well as a comprehensive toolkit for extensive pathology data collection and preprocessing at https://github.com/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology.

AAAI Conference 2024 Conference Paper

UMIE: Unified Multimodal Information Extraction with Instruction Tuning

  • Lin Sun
  • Kai Zhang
  • Qingyuan Li
  • Renze Lou

Multimodal information extraction (MIE) gains significant attention as the popularity of multimedia content increases. However, current MIE methods often resort to using task-specific model structures, which results in limited generalizability across tasks and underutilizes shared knowledge across MIE tasks. To address these issues, we propose UMIE, a unified multimodal information extractor to unify three MIE tasks as a generation problem using instruction tuning, being able to effectively extract both textual and visual mentions. Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks. Furthermore, in-depth analysis demonstrates UMIE's strong generalization in the zero-shot setting, robustness to instruction variants, and interpretability. Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and large language models within the MIE domain. Our code, data, and model are available at https://github.com/ZUCC-AI/UMIE.

AAAI Conference 2021 Conference Paper

RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER

  • Lin Sun
  • Jiquan Wang
  • Kai Zhang
  • Yindu Su
  • Fangsheng Weng

Recently multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets. However, most of the multimodal methods use attention mechanisms to extract visual clues regardless of whether the text and image are relevant. Practically, the irrelevant textimage pairs account for a large proportion in tweets. The visual clues that are unrelated to the texts will exert uncertain or even negative effects on multimodal model learning. In this paper, we introduce a method of text-image relation propagation into the multimodal BERT model. We integrate soft or hard gates to select visual clues and propose a multitask algorithm to train on the MNER datasets. In the experiments, we deeply analyze the changes in visual attention before and after the use of text-image relation propagation. Our model achieves state-of-the-art performance on the MNER datasets.

NeurIPS Conference 2020 Conference Paper

Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization

  • Qi Chen
  • Lin Sun
  • Ernest Cheung
  • Alan L. Yuille

Recent voxel-based 3D object detectors for autonomous vehicles learn point cloud representations either from bird eye view (BEV) or range view (RV, a. k. a. the perspective view). However, each view has its own strengths and weaknesses. In this paper, we present a novel framework to unify and leverage the benefits from both BEV and RV. The widely-used cuboid-shaped voxels in Cartesian coordinate system only benefit learning BEV feature map. Therefore, to enable learning both BEV and RV feature maps, we introduce Hybrid-Cylindrical-Spherical voxelization. Our findings show that simply adding detection on another view as auxiliary supervision will lead to poor performance. We proposed a pair of cross-view transformers to transform the feature maps into the other view and introduce cross-view consistency loss on them. Comprehensive experiments on the challenging NuScenes Dataset validate the effectiveness of our proposed method by virtue of joint optimization and complementary information on both views. Remarkably, our approach achieved mAP of 55. 8%, outperforming all published approaches by at least 3% in overall performance and up to 16. 5% in safety-crucial categories like cyclist.

NeurIPS Conference 2020 Conference Paper

Grasp Proposal Networks: An End-to-End Solution for Visual Learning of Robotic Grasps

  • Chaozheng Wu
  • Jian Chen
  • Qiaoyu Cao
  • Jianchi Zhang
  • Yunxin Tai
  • Lin Sun
  • Kui Jia

Learning robotic grasps from visual observations is a promising yet challenging task. Recent research shows its great potential by preparing and learning from large-scale synthetic datasets. For the popular, 6 degree-of-freedom (6-DOF) grasp setting of parallel-jaw gripper, most of existing methods take the strategy of heuristically sampling grasp candidates and then evaluating them using learned scoring functions. This strategy is limited in terms of the conflict between sampling efficiency and coverage of optimal grasps. To this end, we propose in this work a novel, end-to-end \emph{Grasp Proposal Network (GPNet)}, to predict a diverse set of 6-DOF grasps for an unseen object observed from a single and unknown camera view. GPNet builds on a key design of grasp proposal module that defines \emph{anchors of grasp centers} at discrete but regular 3D grid corners, which is flexible to support either more precise or more diverse grasp predictions. To test GPNet, we contribute a synthetic dataset of 6-DOF object grasps; evaluation is conducted using rule-based criteria, simulation test, and real test. Comparative results show the advantage of our methods over existing ones. Notably, GPNet gains better simulation results via the specified coverage, which helps achieve a ready translation in real test. Our code and dataset are available on \url{https: //github. com/CZ-Wu/GPNet}.