Arrow Research search

Author name cluster

Bin Lin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers
2 author rows

Possible papers

8

AAAI Conference 2026 Conference Paper

360Explorer: Exploring 4D Controllable World in Panoramic Videos

  • Xinhua Cheng
  • Haiyang Zhou
  • Wangbo Yu
  • Tanghui Jia
  • Bin Lin
  • Yunyang Ge
  • Weiqi Li
  • Li Yuan

We present 360Explorer, a novel approach for generating 4D controllable panoramic videos conditioned on user-provided 3D instructions for exploring and manipulating dynamic worlds. Compared to existing perspective-based methods struggle to address spatial consistency during camera rotation in place, we introduce the panoramic view in controllable video generation models to inherently maintain the view recall consistency. By introducing dynamic point clouds as the 4D scene representations, 360Explorer unifies the modeling of camera transformations and object movements as incomplete renders to describe precise control instructions in 3D worlds. To tackle the data limitation in acquiring multi-viewpoint panoramic videos, we further propose a reverse warping strategy to construct the training dataset on easily accessible monocular panoramic videos. Extensive experiments demonstrate that 360Explorer achieves superior performance in creating 4D controllable panoramic videos with camera transformation and object movements aligned with diverse provided instructions.

AAAI Conference 2026 Conference Paper

Look-Back: Implicit Visual Re-focusing in MLLM Reasoning

  • Shuo Yang
  • Yuwei Niu
  • Yuyang Liu
  • Yang Ye
  • Bin Lin
  • Li Yuan

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in multimodal reasoning. However, they often excessively rely on textual information during the later stages of inference, neglecting the crucial integration of visual input. Current methods typically address this by explicitly injecting visual information to guide the reasoning process. In this work, through an analysis of MLLM attention patterns, we made an intriguing observation: with appropriate guidance, MLLMs can spontaneously re-focus their attention on visual inputs during the later stages of reasoning, even without explicit visual injection. This spontaneous shift in focus suggests that MLLMs are intrinsically capable of performing visual fusion reasoning. Building on this insight, we introduce Look-Back, an implicit approach designed to guide MLLMs to look back at visual information in a self-directed manner during reasoning. Look-Back empowers the model to autonomously determine when, where, and how to re-focus on visual inputs, eliminating the need for explicit model-structure constraints or additional input. We demonstrate that Look-Back significantly enhances the model's reasoning and perception capabilities, as evidenced by extensive empirical evaluations on multiple multimodal benchmarks.

AAAI Conference 2026 Conference Paper

Next Patch Prediction for AutoRegressive Visual Generation

  • Yatian Pang
  • Peng Jin
  • Shuo Yang
  • Bin Zhu
  • Bin Lin
  • Chaoran Feng
  • Zhenyu Tang
  • Liuhan Chen

Autoregressive models, built based on the Next Token Prediction (NTP) paradigm, show great potential in developing a unified framework that integrates both language and vision tasks. Pioneering works introduce NTP to autoregressive visual generation tasks. In this work, we rethink the NTP for autoregressive image generation and extend it to a novel Next Patch Prediction (NPP) paradigm. Our key idea is to group and aggregate image tokens into patch tokens with higher information density. By using patch tokens as a more compact input sequence, the autoregressive model is trained to predict the next patch, significantly reducing computational costs. To further exploit the natural hierarchical structure of image data, we propose a multi-scale coarse-to-fine patch grouping strategy. With this strategy, the training process begins with a large patch size and ends with vanilla NTP where the patch size is 1x1, thus maintaining the original inference process without modifications. Extensive experiments across a diverse range of model sizes demonstrate that NPP could reduce the training cost to around 0.6 times while improving image generation quality by up to 1.0 FID score on the ImageNet 256x256 generation benchmark. Notably, our method retains the original autoregressive model architecture without introducing additional trainable parameters or specifically designing a custom image tokenizer, offering a flexible and plug-and-play solution for enhancing autoregressive visual generation.

AAAI Conference 2025 Conference Paper

Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle

  • Zhenyu Tang
  • Junwu Zhang
  • Xinhua Cheng
  • Wangbo Yu
  • Chaoran Feng
  • Yatian Pang
  • Bin Lin
  • Li Yuan

Recent 3D large reconstruction models typically employ a two-stage process, including first generate multi-view images by a multi-view diffusion model, and then utilize a feed-forward model to reconstruct images to 3D content. However, multi-view diffusion models often produce low-quality and inconsistent images, adversely affecting the quality of the final 3D reconstruction. To address this issue, we propose a unified 3D generation framework called Cycle3D, which cyclically utilizes a 2D diffusion-based generation module and a feed-forward 3D reconstruction module during the multi-step diffusion process. Concretely, 2D diffusion model is applied for generating high-quality texture, and the reconstruction model guarantees multi-view consistency. Moreover, 2D diffusion model can further control the generated content and inject reference-view information for unseen views, thereby enhancing the diversity and texture consistency of 3D generation during the denoising process. Extensive experiments demonstrate the superior ability of our method to create 3D content with high-quality and consistency compared with state-of-the-art baselines.

NeurIPS Conference 2024 Conference Paper

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

  • Lin Chen
  • Xilin Wei
  • Jinsong Li
  • Xiaoyi Dong
  • Pan Zhang
  • Yuhang Zang
  • Zehui Chen
  • Haodong Duan

We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and capable captioning model for arbitrary videos, with 4. 8M high-quality aesthetic videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that reached SOTA performance on three advancing video benchmarks. To achieve this, taking aside the non-scalable costly human annotators, we find using GPT4V to caption video with a naive multi-frame or frame-concatenation input strategy leads to less detailed and sometimes temporal-confused results. We argue the challenge of designing a high-quality video captioning strategy lies in three aspects: 1) Inter-frame precise temporal change understanding. 2) Intra-frame detailed content description. 3) Frame-number scalability for arbitrary-length videos. To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length. Based on it, we construct ShareGPT4Video, which contains 40K high-quality videos spanning a wide range of categories, and the resulting captions encompass rich world knowledge, object attributes, camera movements, and crucially, detailed and precise temporal descriptions of events. Based on ShareGPT4Video, we further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos. We annotated 4. 8M aesthetically appealing videos by it and verified their effectiveness on a 10-second text2video generation task. For video understanding, we verified the effectiveness of ShareGPT4Video on several current LVLM architectures and presented our superb new LVLM ShareGPT4Video-8B. All the models, strategies, and annotations will be open-sourced and we hope this project can serve as a pivotal resource for advancing both the LVLMs and T2VMs community.

JBHI Journal 2024 Journal Article

Vital Sign Monitoring for Cancer Patients Based on Dual-Path Sensor and Divided-Frequency-CNN Model

  • Bin Lin
  • Chuanzheng Jia
  • Huicheng Yang
  • Yi Zhang
  • Xianhe Xie
  • Zhihao Chen
  • Xianzeng Zhang

Monitoring vital signs is a key part of standard medical care for cancer patients. However, the traditional methods have instability especially when big fluctuations of signals happen, while the deep-learning-based methods lack pertinence to the sensors. A dual-path micro-bend optical fiber sensor and a targeted model based on the Divided-Frequency-CNN (DFC) are developed in this paper to measure the heart rate (HR) and respiratory rate (RR). For each path, features of frequency division based on the mechanism of signal periodicity cooperate with the operation of stable phase extraction to reduce the interference of body movements for monitoring. Then, the DFC model is designed to learn the inner information from the features robustly. Lastly, a weighted strategy is used to estimate the HR and RR via dual paths to increase the anti-interference for errors from one source. The experiments were carried out on the actual clinical data of cancer patients by a hospital. The results show that the proposed method has good performance in error (3. 51 (4. 51 $\%$ ) and 2. 53 (3. 28 $\%$ ) beats per minute (bpm) for cancer patients with pain and without pain respectively), relevance, and consistency with the values from hospital equipment. Besides, the proposed method significantly improved the ability in the report time interval (30 to 9 min), and mean / confidential interval (3. 60/[−22. 61, 29. 81] to −0. 64 / [−9. 21, 7. 92] for patients with pain and 1. 87 / [−5. 49, 9. 23] to −0. 16 / [−6. 21, 5. 89] for patients without pain) compared with our previous work.

IROS Conference 2022 Conference Paper

Flexible Collision-free Platooning Method for Unmanned Surface Vehicle with Experimental Validations

  • Bin Du
  • Bin Lin
  • Wei Xie 0009
  • Weidong Zhang 0004
  • Rudy R. Negenborn
  • Yusong Pang

This paper addresses the flexible formation problem for unmanned surface vehicles in the presence of obstacles. Building upon the leader-follower formation scheme, a hybrid line-of-sight based flexible platooning method is proposed for follower vehicle to keep tracking the leader ship. A fusion artificial potential field collision avoidance approach is tailored to generate optimal collision-free trajectories for the vehicle to track. To steer the vehicle towards and stay within the neighborhood of the generated collision-free trajectory, a nonlinear model predictive controller is designed. Experimental results are presented to validate the efficiency of proposed method, showing that the unmanned surface vehicle is able to track the leader ship without colliding with the surrounded static obstacles in the considered experiments.