Author name cluster

Bin Lin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

2 author rows

AAAI Conference 2026 Conference Paper

360Explorer: Exploring 4D Controllable World in Panoramic Videos

Xinhua Cheng
Haiyang Zhou
Wangbo Yu
Tanghui Jia
Bin Lin
Yunyang Ge
Weiqi Li
Li Yuan

We present 360Explorer, a novel approach for generating 4D controllable panoramic videos conditioned on user-provided 3D instructions for exploring and manipulating dynamic worlds. Compared to existing perspective-based methods struggle to address spatial consistency during camera rotation in place, we introduce the panoramic view in controllable video generation models to inherently maintain the view recall consistency. By introducing dynamic point clouds as the 4D scene representations, 360Explorer unifies the modeling of camera transformations and object movements as incomplete renders to describe precise control instructions in 3D worlds. To tackle the data limitation in acquiring multi-viewpoint panoramic videos, we further propose a reverse warping strategy to construct the training dataset on easily accessible monocular panoramic videos. Extensive experiments demonstrate that 360Explorer achieves superior performance in creating 4D controllable panoramic videos with camera transformation and object movements aligned with diverse provided instructions.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Look-Back: Implicit Visual Re-focusing in MLLM Reasoning

Shuo Yang
Yuwei Niu
Yuyang Liu
Yang Ye
Bin Lin
Li Yuan

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in multimodal reasoning. However, they often excessively rely on textual information during the later stages of inference, neglecting the crucial integration of visual input. Current methods typically address this by explicitly injecting visual information to guide the reasoning process. In this work, through an analysis of MLLM attention patterns, we made an intriguing observation: with appropriate guidance, MLLMs can spontaneously re-focus their attention on visual inputs during the later stages of reasoning, even without explicit visual injection. This spontaneous shift in focus suggests that MLLMs are intrinsically capable of performing visual fusion reasoning. Building on this insight, we introduce Look-Back, an implicit approach designed to guide MLLMs to look back at visual information in a self-directed manner during reasoning. Look-Back empowers the model to autonomously determine when, where, and how to re-focus on visual inputs, eliminating the need for explicit model-structure constraints or additional input. We demonstrate that Look-Back significantly enhances the model's reasoning and perception capabilities, as evidenced by extensive empirical evaluations on multiple multimodal benchmarks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Next Patch Prediction for AutoRegressive Visual Generation

Yatian Pang
Peng Jin
Shuo Yang
Bin Zhu
Bin Lin
Chaoran Feng
Zhenyu Tang
Liuhan Chen

Autoregressive models, built based on the Next Token Prediction (NTP) paradigm, show great potential in developing a unified framework that integrates both language and vision tasks. Pioneering works introduce NTP to autoregressive visual generation tasks. In this work, we rethink the NTP for autoregressive image generation and extend it to a novel Next Patch Prediction (NPP) paradigm. Our key idea is to group and aggregate image tokens into patch tokens with higher information density. By using patch tokens as a more compact input sequence, the autoregressive model is trained to predict the next patch, significantly reducing computational costs. To further exploit the natural hierarchical structure of image data, we propose a multi-scale coarse-to-fine patch grouping strategy. With this strategy, the training process begins with a large patch size and ends with vanilla NTP where the patch size is 1x1, thus maintaining the original inference process without modifications. Extensive experiments across a diverse range of model sizes demonstrate that NPP could reduce the training cost to around 0.6 times while improving image generation quality by up to 1.0 FID score on the ImageNet 256x256 generation benchmark. Notably, our method retains the original autoregressive model architecture without introducing additional trainable parameters or specifically designing a custom image tokenizer, offering a flexible and plug-and-play solution for enhancing autoregressive visual generation.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle

Zhenyu Tang
Junwu Zhang
Xinhua Cheng
Wangbo Yu
Chaoran Feng
Yatian Pang
Bin Lin
Li Yuan

Recent 3D large reconstruction models typically employ a two-stage process, including first generate multi-view images by a multi-view diffusion model, and then utilize a feed-forward model to reconstruct images to 3D content. However, multi-view diffusion models often produce low-quality and inconsistent images, adversely affecting the quality of the final 3D reconstruction. To address this issue, we propose a unified 3D generation framework called Cycle3D, which cyclically utilizes a 2D diffusion-based generation module and a feed-forward 3D reconstruction module during the multi-step diffusion process. Concretely, 2D diffusion model is applied for generating high-quality texture, and the reconstruction model guarantees multi-view consistency. Moreover, 2D diffusion model can further control the generated content and inject reference-view information for unseen views, thereby enhancing the diversity and texture consistency of 3D generation during the denoising process. Extensive experiments demonstrate the superior ability of our method to create 3D content with high-quality and consistency compared with state-of-the-art baselines.

PDF Details DOI

EAAI Journal 2024 Journal Article

SA-MVSNet: Self-attention-based multi-view stereo network for 3D reconstruction of images with weak texture

Ronghao Yang
Wang Miao
Zhenxin Zhang
Zhenlong Liu
Mubai Li
Bin Lin

Details DOI

NeurIPS Conference 2024 Conference Paper

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Lin Chen
Xilin Wei
Jinsong Li
Xiaoyi Dong
Pan Zhang
Yuhang Zang
Zehui Chen
Haodong Duan

We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and capable captioning model for arbitrary videos, with 4. 8M high-quality aesthetic videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that reached SOTA performance on three advancing video benchmarks. To achieve this, taking aside the non-scalable costly human annotators, we find using GPT4V to caption video with a naive multi-frame or frame-concatenation input strategy leads to less detailed and sometimes temporal-confused results. We argue the challenge of designing a high-quality video captioning strategy lies in three aspects: 1) Inter-frame precise temporal change understanding. 2) Intra-frame detailed content description. 3) Frame-number scalability for arbitrary-length videos. To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length. Based on it, we construct ShareGPT4Video, which contains 40K high-quality videos spanning a wide range of categories, and the resulting captions encompass rich world knowledge, object attributes, camera movements, and crucially, detailed and precise temporal descriptions of events. Based on ShareGPT4Video, we further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos. We annotated 4. 8M aesthetically appealing videos by it and verified their effectiveness on a 10-second text2video generation task. For video understanding, we verified the effectiveness of ShareGPT4Video on several current LVLM architectures and presented our superb new LVLM ShareGPT4Video-8B. All the models, strategies, and annotations will be open-sourced and we hope this project can serve as a pivotal resource for advancing both the LVLMs and T2VMs community.

PDF Details DOI

JBHI Journal 2024 Journal Article

Vital Sign Monitoring for Cancer Patients Based on Dual-Path Sensor and Divided-Frequency-CNN Model

Bin Lin
Chuanzheng Jia
Huicheng Yang
Yi Zhang
Xianhe Xie
Zhihao Chen
Xianzeng Zhang

Monitoring vital signs is a key part of standard medical care for cancer patients. However, the traditional methods have instability especially when big fluctuations of signals happen, while the deep-learning-based methods lack pertinence to the sensors. A dual-path micro-bend optical fiber sensor and a targeted model based on the Divided-Frequency-CNN (DFC) are developed in this paper to measure the heart rate (HR) and respiratory rate (RR). For each path, features of frequency division based on the mechanism of signal periodicity cooperate with the operation of stable phase extraction to reduce the interference of body movements for monitoring. Then, the DFC model is designed to learn the inner information from the features robustly. Lastly, a weighted strategy is used to estimate the HR and RR via dual paths to increase the anti-interference for errors from one source. The experiments were carried out on the actual clinical data of cancer patients by a hospital. The results show that the proposed method has good performance in error (3. 51 (4. 51 $\%$ ) and 2. 53 (3. 28 $\%$ ) beats per minute (bpm) for cancer patients with pain and without pain respectively), relevance, and consistency with the values from hospital equipment. Besides, the proposed method significantly improved the ability in the report time interval (30 to 9 min), and mean / confidential interval (3. 60/[−22. 61, 29. 81] to −0. 64 / [−9. 21, 7. 92] for patients with pain and 1. 87 / [−5. 49, 9. 23] to −0. 16 / [−6. 21, 5. 89] for patients without pain) compared with our previous work.

Details DOI

IROS Conference 2022 Conference Paper

Flexible Collision-free Platooning Method for Unmanned Surface Vehicle with Experimental Validations

Bin Du
Bin Lin
Wei Xie 0009
Weidong Zhang 0004
Rudy R. Negenborn
Yusong Pang

This paper addresses the flexible formation problem for unmanned surface vehicles in the presence of obstacles. Building upon the leader-follower formation scheme, a hybrid line-of-sight based flexible platooning method is proposed for follower vehicle to keep tracking the leader ship. A fusion artificial potential field collision avoidance approach is tailored to generate optimal collision-free trajectories for the vehicle to track. To steer the vehicle towards and stay within the neighborhood of the generated collision-free trajectory, a nonlinear model predictive controller is designed. Experimental results are presented to validate the efficiency of proposed method, showing that the unmanned surface vehicle is able to track the leader ship without colliding with the surrounded static obstacles in the considered experiments.

Details