Author name cluster

Sibei Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in Large Language Models

Yue Xu
Chengyan Fu
Li Xiong
Sibei Yang
Wenjie Wang

Pre-training large language models (LLMs) on vast text corpora enhances natural language processing capabilities but risks encoding social biases, particularly gender bias. While parameter-modification methods like fine-tuning mitigate bias, they are resource-intensive, unsuitable for closed-source models, and lack adaptability to evolving societal norms. Instruction-based approaches offer flexibility but often compromise general performance on normal tasks. To address these limitations, we propose $\textit{FaIRMaker}$, an automated and model-independent framework that employs an $\textbf{auto-search and refinement}$ paradigm to adaptively generate Fairwords, which act as instructions to reduce gender bias and enhance response quality. $\textit{FaIRMaker}$ enhances the debiasing capacity by enlarging the Fairwords search space while preserving the utility and making it applicable to closed-source models by training a sequence-to-sequence model that adaptively refines Fairwords into effective debiasing instructions when facing gender-related queries and performance-boosting prompts for neutral inputs. Extensive experiments demonstrate that $\textit{FaIRMaker}$ effectively mitigates gender bias while preserving task integrity and ensuring compatibility with both open- and closed-source LLMs.

PDF Details

ICLR Conference 2025 Conference Paper

CityAnchor: City-scale 3D Visual Grounding with Multi-modality LLMs

Jinpeng Li
Haiping Wang 0004
Jiabin Chen
Yuan Liu 0025
Zhiyang Dou
Yuexin Ma
Sibei Yang
Yuan Li

In this paper, we present a 3D visual grounding method called CityAnchor for localizing an urban object in a city-scale point cloud. Recent developments in multiview reconstruction enable us to reconstruct city-scale point clouds but how to conduct visual grounding on such a large-scale urban point cloud remains an open problem. Previous 3D visual grounding system mainly concentrates on localizing an object in an image or a small-scale point cloud, which is not accurate and efficient enough to scale up to a city-scale point cloud. We address this problem with a multi-modality LLM which consists of two stages, a coarse localization and a fine-grained matching. Given the text descriptions, the coarse localization stage locates possible regions on a projected 2D map of the point cloud while the fine-grained matching stage accurately determines the most matched object in these possible regions. We conduct experiments on the CityRefer dataset and a new synthetic dataset annotated by us, both of which demonstrate our method can produce accurate 3D visual grounding on a city-scale 3D point cloud.

Details

NeurIPS Conference 2025 Conference Paper

Discovering Compositional Hallucinations in LVLMs

Sibei Yang
Ge Zheng
Jiajin Tang
Jiaye Qian
Hanzhuo Huang
Cheng Shi

Large language models (LLMs) and vision-language models (LVLMs) have driven the paradigm shift towards general-purpose foundation models. However, both of them are prone to hallucinations, which compromise their factual accuracy and reliability. While existing research primarily focuses on isolated textual- or visual-centric errors, a critical yet underexplored phenomenon persists in LVLMs: Even neither of textual- or visual centric errors occur, LVLMs often struggle with a new and subtle hallucination mode that arising from composition of them. In this paper, we define this issue as Simple Compositional Hallucination (SCHall). Through an preliminary analysis, we present two key findings: (1) visual abstraction fails under compositional questioning, and (2) visual inputs induce degradation in language processing, leading to hallucinations. To facilitate future research on this phenomenon, we introduce a custom benchmark, SCBench, and propose a novel VLR-distillation method, which serves as the first baseline to effectively mitigate SCHall. Furthermore, experiment results on publicly available benchmarks, including both hallucination-specific and general-purpose ones, demonstrate the effectiveness of our VLR-distillation method.

PDF Details

ICLR Conference 2025 Conference Paper

Discovering Influential Neuron Path in Vision Transformers

Yifan Wang
Yifei Liu
Yingdong Shi
Changming Li
Anqi Pang
Sibei Yang
Jingyi Yu 0001
Kan Ren

Vision Transformer models exhibit immense power yet remain opaque to human understanding, posing challenges and risks for practical applications. While prior research has attempted to demystify these models through input attribution and neuron role analysis, there's been a notable gap in considering layer-level information and the holistic path of information flow across layers. In this paper, we investigate the significance of influential neuron paths within vision Transformers, which is a path of neurons from the model input to output that impacts the model inference most significantly. We first propose a joint influence measure to assess the contribution of a set of neurons to the model outcome. And we further provide a layer-progressive neuron locating approach that efficiently selects the most influential neuron at each layer trying to discover the crucial neuron path from input to output within the target model. Our experiments demonstrate the superiority of our method finding the most influential neuron path along which the information flows, over the existing baseline solutions. Additionally, the neuron paths have illustrated that vision Transformers exhibit some specific inner working mechanism for processing the visual information within the same image category. We further analyze the key effects of these neurons on the image classification task, showcasing that the found neuron paths have already preserved the model capability on downstream tasks, which may also shed some lights on real-world applications like model pruning. The project website including implementation code is available at https://foundation-model-research.github.io/NeuronPath/.

Details

NeurIPS Conference 2025 Conference Paper

Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video

Xueyang Yu
Cheng Shi
Yang Wang
Sibei Yang

Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP-Bench (Ego Streaming Proactive Benchmark) alongside the ESTP-F1 metric—a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi-stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while achieving state-of-the-art (SOTA) performance on the standard COIN benchmark.

PDF Details

NeurIPS Conference 2025 Conference Paper

Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

Jiaye Qian
Ge Zheng
Yuchen Zhu
Sibei Yang

Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer’s causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question–answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.

PDF Details

ICLR Conference 2025 Conference Paper

MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow

Hanzhuo Huang
Yuan Liu 0025
Ge Zheng
Jiepeng Wang 0001
Zhiyang Dou
Sibei Yang

In this paper, we present MVTokenFlow for high-quality 4D content creation from monocular videos. Recent advancements in generative models such as video diffusion models and multiview diffusion models enable us to create videos or 3D models. However, extending these generative models for dynamic 4D content creation is still a challenging task that requires the generated content to be consistent spatially and temporally. To address this challenge, MVTokenFlow utilizes the multiview diffusion model to generate multiview images on different timesteps, which attains spatial consistency across different viewpoints and allows us to reconstruct a reasonable coarse 4D field. Then, MVTokenFlow further regenerates all the multiview images using the rendered 2D flows as guidance. The 2D flows effectively associate pixels from different timesteps and improve the temporal consistency by reusing tokens in the regeneration process. Finally, the regenerated images are spatiotemporally consistent and utilized to refine the coarse 4D field to get a high-quality 4D field. Experiments demonstrate the effectiveness of our design and show significantly improved quality than baseline methods. Project page: https://soolab.github.io/MVTokenFlow.

Details

NeurIPS Conference 2025 Conference Paper

Vision Function Layer in Multimodal LLMs

Cheng Shi
Yizhou Yu
Sibei Yang

This study identifies that visual-related functional decoding is distributed across different decoder layers in Multimodal Large Language Models (MLLMs). Typically, each function, such as counting, grounding, or OCR recognition, narrows down to two or three layers, which we define as Vision Function Layers (VFL). Additionally, the depth and its order of different VFLs exhibits a consistent pattern across different MLLMs, which is well-aligned with human behaviors (e. g. , recognition occurs first, followed by counting, and then grounding). These findings are derived from Visual Token Swapping, our novel analytical framework that modifies targeted KV cache entries to precisely elucidate layer-specific functions during decoding. Furthermore, these insights offer substantial utility in tailoring MLLMs for real-world downstream applications. For instance, when LoRA training is selectively applied to VFLs whose functions align with the training data, VFL-LoRA not only outperform full-LoRA but also prevent out-of-domain function forgetting. Moreover, by analyzing the performance differential on training data when particular VFLs are ablated, VFL-select automatically classifies data by function, enabling highly efficient data selection to directly bolster corresponding capabilities. Consequently, VFL-select surpasses human experts in data selection, and achieves 98% of full-data performance with only 20% of the original dataset. This study delivers deeper comprehension of MLLM visual processing, fostering the creation of more efficient, interpretable, and robust models.

PDF Details

IJCAI Conference 2024 Conference Paper

RealDex: Towards Human-like Grasping for Robotic Dexterous Hand

Yumeng Liu
Yaxun Yang
Youzhuo Wang
Xiaofei Wu
Jiamin Wang
Yichen Yao
Sören Schwertfeger
Sibei Yang

In this paper, we introduce RealDex, a pioneering dataset capturing authentic dexterous hand grasping motions infused with human behavioral patterns, enriched by multi-view and multimodal visual data. Utilizing a teleoperation system, we seamlessly synchronize human-robot hand poses in real time. This collection of human-like motions is crucial for training dexterous hands to mimic human movements more naturally and precisely. RealDex holds immense promise in advancing humanoid robot for automated perception, cognition, and manipulation in real-world scenarios. Moreover, we introduce a cutting-edge dexterous grasping motion generation framework, which aligns with human experience and enhances real-world applicability through effectively utilizing Multimodal Large Language Models. Extensive experiments have demonstrated the superior performance of our method on RealDex and other open datasets. The dataset and associated code are available at https: //4dvlab. github. io/RealDex_page/.

PDF Details DOI

ICLR Conference 2024 Conference Paper

The Devil is in the Object Boundary: Towards Annotation-free Instance Segmentation using Foundation Models

Cheng Shi 0001
Sibei Yang

Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, $\textit{i.e.}$, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose $\textbf{\textit{Zip}}$ which $\textbf{Z}$ips up CL$\textbf{ip}$ and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5\% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations. Code is released at https://github.com/ChengShiest/Zip-Your-CLIP

Details

AAAI Conference 2023 Conference Paper

CCQ: Cross-Class Query Network for Partially Labeled Organ Segmentation

Xuyang Liu
Bingbing Wen
Sibei Yang

Learning multi-organ segmentation from multiple partially-labeled datasets attracts increasing attention. It can be a promising solution for the scarcity of large-scale, fully labeled 3D medical image segmentation datasets. However, existing algorithms of multi-organ segmentation on partially-labeled datasets neglect the semantic relations and anatomical priors between different categories of organs, which is crucial for partially-labeled multi-organ segmentation. In this paper, we tackle the limitations above by proposing the Cross-Class Query Network (CCQ). CCQ consists of an image encoder, a cross-class query learning module, and an attentive refinement segmentation module. More specifically, the image encoder captures the long-range dependency of a single image via the transformer encoder. Cross-class query learning module first generates query vectors that represent semantic concepts of different categories and then utilizes these query vectors to find the class-relevant features of image representation for segmentation. The attentive refinement segmentation module with an attentive skip connection incorporates the high-resolution image details and eliminates the class-irrelevant noise. Extensive experiment results demonstrate that CCQ outperforms all the state-of-the-art models on the MOTS dataset, which consists of seven organ and tumor segmentation tasks. Code is available at https://github.com/Yang-007/CCQ.git.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

Ge Zheng
Bin Yang
Jiajin Tang
Hong-Yu Zhou
Sibei Yang

A long-standing goal of AI systems is to perform complex multimodal reasoning like humans. Recently, large language models (LLMs) have made remarkable strides in such multi-step reasoning on the language modality solely by leveraging the chain of thought (CoT) to mimic human thinking. However, the transfer of these advancements to multimodal contexts introduces heightened challenges, including but not limited to the impractical need for labor-intensive annotation and the limitations in terms of flexibility, generalizability, and explainability. To evoke CoT reasoning in multimodality, this work first conducts an in-depth analysis of these challenges posed by multimodality and presents two key insights: “keeping critical thinking” and “letting everyone do their jobs” in multimodal CoT reasoning. Furthermore, this study proposes a novel DDCoT prompting that maintains a critical attitude through negative-space prompting and incorporates multimodality into reasoning by first dividing the reasoning responsibility of LLMs into reasoning and recognition and then integrating the visual recognition capability of visual models into the joint reasoning process. The rationales generated by DDCoT not only improve the reasoning abilities of both large and small language models in zero-shot prompting and fine-tuning learning, significantly outperforming state-of-the-art methods but also exhibit impressive generalizability and explainability.

PDF Details

NeurIPS Conference 2023 Conference Paper

Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator

Hanzhuo Huang
Yufan Feng
Cheng Shi
Lan Xu
Jingyi Yu
Sibei Yang

Text-to-video is a rapidly growing research area that aims to generate a semantic, identical, and temporal coherence sequence of frames that accurately align with the input text prompt. This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient. To generate a semantic-coherent video, exhibiting a rich portrayal of temporal semantics such as the whole process of flower blooming rather than a set of ``moving images'', we propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence, while pre-trained latent diffusion models (LDMs) as the animator to generate the high fidelity frames. Furthermore, to ensure temporal and identical coherence while maintaining semantic coherence, we propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path interpolation. Without any video data and training requirements, Free-Bloom generates vivid and high-quality videos, awe-inspiring in generating complex scenes with semantic meaningful frame sequences. In addition, Free-Bloom is naturally compatible with LDMs-based extensions.

PDF Details

AAAI Conference 2019 Conference Paper

Non-Local Context Encoder: Robust Biomedical Image Segmentation against Adversarial Attacks

Xiang He
Sibei Yang
Guanbin Li
Haofeng Li
Huiyou Chang
Yizhou Yu

Recent progress in biomedical image segmentation based on deep convolutional neural networks (CNNs) has drawn much attention. However, its vulnerability towards adversarial samples cannot be overlooked. This paper is the first one that discovers that all the CNN-based state-of-the-art biomedical image segmentation models are sensitive to adversarial perturbations. This limits the deployment of these methods in safety-critical biomedical fields. In this paper, we discover that global spatial dependencies and global contextual information in a biomedical image can be exploited to defend against adversarial attacks. To this end, non-local context encoder (NLCE) is proposed to model short- and longrange spatial dependencies and encode global contexts for strengthening feature activations by channel-wise attention. The NLCE modules enhance the robustness and accuracy of the non-local context encoding network (NLCEN), which learns robust enhanced pyramid feature representations with NLCE modules, and then integrates the information across different levels. Experiments on both lung and skin lesion segmentation datasets have demonstrated that NLCEN outperforms any other state-of-the-art biomedical image segmentation methods against adversarial attacks. In addition, NLCE modules can be applied to improve the robustness of other CNN-based biomedical image segmentation methods.

PDF Details