Arrow Research search

Author name cluster

Xinlong Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers
2 author rows

Possible papers

24

TMLR Journal 2026 Journal Article

MIRA: Multi-view Information Retrieval with Adaptive Routing for Test-time Long-video Comprehension

  • Zecheng Hao
  • Wayne Ma
  • Yufeng Cui
  • Shuang Li
  • Xinlong Wang
  • Tiejun Huang

Foundational Multi-modal Large Language Models (MLLMs) have achieved rapid progress in handling complex tasks across diverse modalities. However, they still struggle to deliver satisfactory performance on Long-video Comprehension (LVC) tasks involving thousands of frames. Existing optimization strategies can be broadly categorized into LVC-specific fine-tuning, built-in token compression and training-free keyframe extraction, with the latter being most suitable for flexible deployment across various MLLMs. Unfortunately, current training-free approaches predominantly focus on query-frame relevance retrieval, overlooking other levels of visual information and the inherent heterogeneity of LVC tasks. In this work, we propose the $\textbf{M}$ulti-view $\textbf{I}$nformation $\textbf{R}$etrieval with $\textbf{A}$daptive Routing ($\textbf{MIRA}$) framework, which evaluates video frames using distinct metrics for relevance and causality, combines these scores to select a balanced pool of keyframes, and employs an adaptive feedback loop to tailor the retrieval process to different user queries, enabling more precise and sample-grained video comprehension. Extensive experiments demonstrate the advanced performance of our scheme across multiple challenging LVC benchmarks. For instance, integrating $\textbf{MIRA}$ with Qwen-2.5-VL yields performance gains of 3.5% to 13.1% on LVB, VideoMME and MLVU.

NeurIPS Conference 2025 Conference Paper

Audio-Sync Video Generation with Multi-Stream Temporal Control

  • Shuchen Weng
  • Haojie Zheng
  • Zheng Chang
  • Si Li
  • Boxin Shi
  • Xinlong Wang

Audio is inherently temporal and closely synchronized with the visual world, making it a naturally aligned and expressive control signal for controllable video generation (e. g. , movies). Beyond control, directly translating audio into video is essential for understanding and visualizing rich audio narratives (e. g. , Podcasts or historical recordings). However, existing approaches fall short in generating high-quality videos with precise audio-visual synchronization, especially across diverse and complex audio types. In this work, we introduce MTV, a versatile framework for audio-sync video generation. MTV explicitly separates audios into speech, effects, and music tracks, enabling disentangled control over lip motion, event timing, and visual mood, respectively—resulting in fine-grained and semantically aligned video generation. To support the framework, we additionally present DEMIX, a dataset comprising high-quality cinematic videos and demixed audio tracks. DEMIX is structured into five overlapped subsets, enabling scalable multi-stage training for diverse generation scenarios. Extensive experiments demonstrate that MTV achieves state-of-the-art performance across six standard metrics spanning video quality, text-video consistency, and audio-video alignment.

ICLR Conference 2025 Conference Paper

Autoregressive Video Generation without Vector Quantization

  • Haoge Deng
  • Ting Pan
  • Haiwen Diao
  • Zhengxiong Luo
  • Yufeng Cui
  • Huchuan Lu
  • Shiguang Shan
  • Yonggang Qi

This paper presents a novel approach that enables autoregressive video generation with high efficiency. We propose to reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. Unlike raster-scan prediction in prior autoregressive models or joint distribution modeling of fixed-length tokens in diffusion models, our approach maintains the causal property of GPT-style models for flexible in-context capabilities, while leveraging bidirectional modeling within individual frames for efficiency. With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA. Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity, i.e., 0.6B parameters. NOVA also outperforms state-of-the-art image diffusion models in text-to-image generation tasks, with a significantly lower training cost. Additionally, NOVA generalizes well across extended video durations and enables diverse zero-shot applications in one unified model. Code and models are publicly available at https://github.com/baaivision/NOVA.

ICLR Conference 2025 Conference Paper

Diffusion Feedback Helps CLIP See Better

  • Wenxuan Wang 0002
  • Quan Sun
  • Fan Zhang
  • Yepeng Tang
  • Jing Liu 0001
  • Xinlong Wang

Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal large language models (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP's strong zero-shot capabilities. The code is publicly available at https://github.com/baaivision/DIVA.

NeurIPS Conference 2025 Conference Paper

End-to-End Vision Tokenizer Tuning

  • Wenxuan Wang
  • Fan Zhang
  • Yufeng Cui
  • Haiwen Diao
  • Zhuoyan Luo
  • Huchuan Lu
  • Jing Liu
  • Xinlong Wang

Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e. g. , image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i. e. , 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.

ICLR Conference 2025 Conference Paper

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

  • Lianghui Zhu
  • Xinggang Wang
  • Xinlong Wang

Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, multi-turn chat, etc.

NeurIPS Conference 2025 Conference Paper

Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards

  • Honghao Chen
  • Xingzhou Lou
  • Xiaokun Feng
  • Kaiqi Huang
  • Xinlong Wang

Chain of thought reasoning has demonstrated remarkable success in large language models, yet its adaptation to vision-language reasoning remains an open challenge with unclear best practices. Existing attempts typically employ reasoning chains at a coarse-grained level, which struggles to perform fine-grained structured reasoning and, more importantly, are difficult to evaluate the reward and quality of intermediate reasoning. In this work, we delve into chain of step reasoning for vision-language models, enabling assessing reasoning step quality accurately and leading to effective reinforcement learning and inference-time scaling with fine-grained rewards. We present a simple, effective, and fully transparent framework, including the step-level reasoning data, process reward model (PRM), and reinforcement learning training. With the proposed approaches, our models set strong baselines with consistent improvements on challenging vision-language benchmarks. More importantly, we conduct a thorough empirical analysis and ablation study, unveiling the impact of each component and several intriguing properties of inference-time scaling. We believe this paper serves as a baseline for vision-language models and offers insights into more complex multimodal reasoning. Our dataset, PRM, and code at https: //github. com/baaivision/CoS.

NeurIPS Conference 2024 Conference Paper

A Simple Image Segmentation Framework via In-Context Examples

  • Yang Liu
  • Chenchen Jing
  • Hengtao Li
  • Muzhi Zhu
  • Hao Chen
  • Xinlong Wang
  • Chunhua Shen

Recently, there have been explorations of generalist segmentation models that can effectively tackle a variety of image segmentation tasks within a unified in-context learning framework. However, these methods still struggle with task ambiguity in in-context segmentation, as not all in-context examples can accurately convey the task information. In order to address this issue, we present SINE, a simple image $\textbf{S}$egmentation framework utilizing $\textbf{in}$-context $\textbf{e}$xamples. Our approach leverages a Transformer encoder-decoder structure, where the encoder provides high-quality image representations, and the decoder is designed to yield multiple task-specific output masks to eliminate task ambiguity effectively. Specifically, we introduce an In-context Interaction module to complement in-context information and produce correlations between the target image and the in-context example and a Matching Transformer that uses fixed matching and a Hungarian algorithm to eliminate differences between different tasks. In addition, we have further perfected the current evaluation system for in-context image segmentation, aiming to facilitate a holistic appraisal of these models. Experiments on various segmentation tasks show the effectiveness of the proposed method.

NeurIPS Conference 2024 Conference Paper

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

  • Xiaotong Li
  • Fan Zhang
  • Haiwen Diao
  • Yueze Wang
  • Xinlong Wang
  • Ling-Yu Duan

Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex understanding of various visual elements, including multiple objects, text information, spatial relations. Their development for comprehensive visual perception hinges on the availability of high-quality image-text datasets that offer diverse visual elements and throughout image descriptions. However, the scarcity of such hyper-detailed datasets currently hinders progress within the MLLM community. The bottleneck stems from the limited perceptual capabilities of current caption engines, which fall short in providing complete and accurate annotations. To facilitate the cutting-edge research of MLLMs on comprehensive vision perception, we thereby propose Perceptual Fusion, using a low-budget but highly effective caption engine for complete and accurate image descriptions. Specifically, Perceptual Fusion integrates diverse perception experts as image priors to provide explicit information on visual elements and adopts an efficient MLLM as a centric pivot to mimic advanced MLLMs' perception abilities. We carefully select 1M highly representative images from uncurated LAION dataset and generate dense descriptions using our engine, dubbed DenseFusion-1M. Extensive experiments validate that our engine outperforms its counterparts, where the resulting dataset significantly improves the perception and cognition abilities of existing MLLMs across diverse vision-language benchmarks, especially with high-resolution images as inputs. The code and dataset are available at https: //huggingface. co/datasets/BAAI/DenseFusion-1M.

ICLR Conference 2024 Conference Paper

Emu: Generative Pretraining in Multimodality

  • Quan Sun
  • Qiying Yu
  • Yufeng Cui
  • Fan Zhang
  • Xiaosong Zhang
  • Yueze Wang
  • Hongcheng Gao
  • Jingjing Liu

We present Emu, a multimodal foundation model that seamlessly generates images and text in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the leverage of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, supporting in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

ICLR Conference 2024 Conference Paper

Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching

  • Yang Liu 0357
  • Muzhi Zhu
  • Hengtao Li
  • Hao Chen 0041
  • Xinlong Wang
  • Chunhua Shen

Powered by large-scale pre-training, vision foundation models exhibit significant potential in open-world image understanding. However, unlike large language models that excel at directly tackling various language tasks, vision foundation models require a task-specific model structure followed by fine-tuning on specific tasks. In this work, we present $\textbf{Matcher}$, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher can segment anything by using an in-context example without training. Additionally, we design three effective components within the Matcher framework to collaborate with these foundation models and unleash their full potential in diverse perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20$^i$ with one example, surpassing the state-of-the-art specialist model by 1.6%. In addition, Matcher achieves 33.0% mIoU on the proposed LVIS-92$^i$ for one-shot semantic segmentation, outperforming the state-of-the-art generalist model by 14.4%. Our visualization results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild.

ICLR Conference 2024 Conference Paper

Uni3D: Exploring Unified 3D Representation at Scale

  • Junsheng Zhou
  • Jinsheng Wang
  • Baorui Ma
  • Yu-Shen Liu
  • Tiejun Huang 0003
  • Xinlong Wang

Scaling up representations for images or text has been extensively investigated in the past few years and has led to revolutions in learning vision and language. However, scalable representation for 3D objects and scenes is relatively unexplored. In this work, we present Uni3D, a 3D foundation model to explore the unified 3D representation at scale. Uni3D uses a 2D initialized ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features. Via the simple architecture and pretext task, Uni3D can leverage abundant 2D pretrained models as initialization and image-text aligned models as the target, unlocking the great potential of 2D model zoos and scaling-up strategies to the 3D world. We efficiently scale up Uni3D to one billion parameters, and set new records on a broad range of 3D tasks, such as zero-shot classification, few-shot classification, open-world understanding and zero-shot part segmentation. We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild. We believe that Uni3D provides a new direction for exploring both scaling up and efficiency of the representation in 3D domain.

NeurIPS Conference 2024 Conference Paper

Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation

  • Muzhi Zhu
  • Yang Liu
  • Zekai Luo
  • Chenchen Jing
  • Hao Chen
  • Guangkai Xu
  • Xinlong Wang
  • Chunhua Shen

The Diffusion Model has not only garnered noteworthy achievements in the realm of image generation but has also demonstrated its potential as an effective pretraining method utilizing unlabeled data. Drawing from the extensive potential unveiled by the Diffusion Model in both semantic correspondence and open vocabulary segmentation, our work initiates an investigation into employing the Latent Diffusion Model for Few-shot Semantic Segmentation. Recently, inspired by the in-context learning ability of large language models, Few-shot Semantic Segmentation has evolved into In-context Segmentation tasks, morphing into a crucial element in assessing generalist segmentation models. In this context, we concentrate on Few-shot Semantic Segmentation, establishing a solid foundation for the future development of a Diffusion-based generalist model for segmentation. Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework. Subsequently, we delve deeper into optimizing the infusion of information from the support mask and simultaneously re-evaluating how to provide reasonable supervision from the query mask. Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework and effectively utilizing the pre-training prior. Experimental results demonstrate that our method significantly outperforms the previous SOTA models in multiple settings.

NeurIPS Conference 2024 Conference Paper

Unveiling Encoder-Free Vision-Language Models

  • Haiwen Diao
  • Yufeng Cui
  • Xiaotong Li
  • Yueze Wang
  • Huchuan Lu
  • Xinlong Wang

Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e. g. , resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i. e. , without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks. It significantly outperforms the counterpart Fuyu-8B with mysterious training procedures and undisclosed training data. We believe that EVE provides a transparent and efficient route for developing pure decoder-only architecture across modalities.

ICML Conference 2024 Conference Paper

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

  • Lianghui Zhu
  • Bencheng Liao
  • Qian Zhang 0009
  • Xinlong Wang
  • Wenyu Liu 0001
  • Xinggang Wang

Recently the state space models (SSMs) with efficient hardware-aware designs, i. e. , the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2. 8x faster than DeiT and saves 86. 8% GPU memory when performing batch inference to extract features on images with a resolution of 1248x1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models.

ICLR Conference 2023 Conference Paper

Conditional Positional Encodings for Vision Transformers

  • Xiangxiang Chu
  • Zhi Tian
  • Bo Zhang 0046
  • Xinlong Wang
  • Chunhua Shen

We propose a conditional positional encoding (CPE) scheme for vision Transformers. Unlike previous fixed or learnable positional encodings that are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE can easily generalize to the input sequences that are longer than what the model has ever seen during the training. Besides, CPE can keep the desired translation equivalence in vision tasks, resulting in improved performance. We implement CPE with a simple Position Encoding Generator (PEG) to get seamlessly incorporated into the current Transformer framework. Built on PEG, we present Conditional Position encoding Vision Transformer (CPVT). We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings and delivers outperforming results.

NeurIPS Conference 2023 Conference Paper

Fine-Grained Visual Prompting

  • Lingfeng Yang
  • Yueze Wang
  • Xiang Li
  • Xinlong Wang
  • Jian Yang

Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels. In this paper, we carefully study the visual prompting designs by exploring more fine-grained markings, such as segmentation masks and their variations. In addition, we introduce a new zero-shot framework that leverages pixel-level annotations acquired from a generalist segmentation model for fine-grained visual prompting. Consequently, our investigation reveals that a straightforward application of blur outside the target mask, referred to as the Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting strategy leverages the precise mask annotations to reduce focus on weakly related regions while retaining spatial coherence between the target and the surrounding background. Our F ine- G rained V isual P rompting ( FGVP ) demonstrates superior performance in zero-shot comprehension of referring expressions on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an average margin of 3. 0\% to 4. 6\%, with a maximum improvement of 12. 5\% on the RefCOCO+ testA subset. The part detection experiments conducted on the PACO dataset further validate the preponderance of FGVP over existing visual prompting techniques. Code is available at https: //github. com/ylingfeng/FGVP.

AAAI Conference 2023 Conference Paper

Point-Teaching: Weakly Semi-supervised Object Detection with Point Annotations

  • Yongtao Ge
  • Qiang Zhou
  • Xinlong Wang
  • Chunhua Shen
  • Zhibin Wang
  • Hao Li

Point annotations are considerably more time-efficient than bounding box annotations. However, how to use cheap point annotations to boost the performance of semi-supervised object detection is still an open question. In this work, we present Point-Teaching, a weakly- and semi-supervised object detection framework to fully utilize the point annotations. Specifically, we propose a Hungarian-based point-matching method to generate pseudo labels for point-annotated images. We further propose multiple instance learning (MIL) approaches at the level of images and points to supervise the object detector with point annotations. Finally, we propose a simple data augmentation, named Point-Guided Copy-Paste, to reduce the impact of those unmatched points. Experiments demonstrate the effectiveness of our method on a few datasets and various data regimes. In particular, Point-Teaching outperforms the previous best method Group R-CNN by 3.1 AP with 5% fully labeled data and 2.3 AP with 30% fully labeled data on the MS COCO dataset. We believe that our proposed framework can largely lower the bar of learning accurate object detectors and pave the way for its broader applications. The code is available at https://github.com/YongtaoGe/Point-Teaching.

AAAI Conference 2021 Conference Paper

Diverse Knowledge Distillation for End-to-End Person Search

  • Xinyu Zhang
  • Xinlong Wang
  • Jia-Wang Bian
  • Chunhua Shen
  • Mingyu You

Person search aims to localize and identify a specific person from a gallery of images. Recent methods can be categorized into two groups, i. e. , two-step and end-to-end approaches. The former views person search as two independent tasks and achieves dominant results using separately trained person detection and re-identification (Re-ID) models. The latter performs person search in an end-to-end fashion. Although the end-to-end approaches yield higher inference efficiency, they largely lag behind those two-step counterparts in terms of accuracy. In this paper, we argue that the gap between the two kinds of methods is mainly caused by the Re-ID subnetworks of end-to-end methods. To this end, we propose a simple yet strong end-to-end network with diverse knowledge distillation to break the bottleneck. We also design a spatialinvariant augmentation to assist model to be invariant to inaccurate detection results. Experimental results on the CUHK- SYSU and PRW datasets demonstrate the superiority of our method against existing approaches—it achieves on par accuracy with state-of-the-art two-step methods while maintaining high efficiency due to the single joint model. Code is available at: https: //git. io/DKD-PersonSearch

JBHI Journal 2020 Journal Article

Learning Hemodynamic Effect of Transcranial Infrared Laser Stimulation Using Longitudinal Data Analysis

  • Qian Wu
  • Xinlong Wang
  • Hanli Liu
  • Li Zeng

Transcranial infrared laser stimulation (TILS) is a promising noninvasive intervention for neurological diseases. Though some experimental work has been done to understand the mechanism of TILS, the reported statistical analysis of data is quite simple and could not provide a comprehensive picture on the effect of TILS. This study learns the effect of TILS on hemodynamics of the human brain from experimental data using longitudinal data analysis methods. Specifically, repeated measures analysis of variance (ANOVA) is first applied to confirm the significance of the TILS effect and its characteristics. Based on that, two parametric mixed-effect models and non-parametric functional mixed-effect model are proposed to model the population-level performance and individual variation of this effect. Interpretations on the fitted models are provided, and comparison of the three proposed models in terms of fitting and prediction performance is made to select the best model. According to the selected model, TILS increases the concentration of oxygenated hemoglobin in the brain and this effect sustains even after the treatment stops. Also, there is considerable variation among individual responses to TILS.

NeurIPS Conference 2020 Conference Paper

SOLOv2: Dynamic and Fast Instance Segmentation

  • Xinlong Wang
  • Rufeng Zhang
  • Tao Kong
  • Lei Li
  • Chunhua Shen

In this work, we design a simple, direct, and fast framework for instance segmentation with strong performance. To this end, we propose a novel and effective approach, termed SOLOv2, following the principle of the SOLO method [32]. First, our new framework is empowered by an efficient and holistic instance mask representation scheme, which dynamically segments each instance in the image, without resorting to bounding box detection. Specifically, the object mask generation is decoupled into a mask kernel prediction and mask feature learning, which are responsible for generating convolution kernels and the feature maps to be convolved with, respectively. Second, SOLOv2 significantly reduces inference overhead with our novel matrix non-maximum suppression (NMS) technique. Our Matrix NMS performs NMS with parallel matrix operations in one shot, and yields better results. We demonstrate that the proposed SOLOv2 achieves the state-of-the- art performance with high efficiency, making it suitable for both mobile and cloud applications. A light-weight version of SOLOv2 executes at 31. 3 FPS and yields 37. 1% AP on COCO test-dev. Moreover, our state-of-the-art results in object detection (from our mask byproduct) and panoptic segmentation show the potential of SOLOv2 to serve as a new strong baseline for many instance-level recognition tasks. Code is available at https: //git. io/AdelaiDet

AAAI Conference 2020 Conference Paper

Task-Aware Monocular Depth Estimation for 3D Object Detection

  • Xinlong Wang
  • Wei Yin
  • Tao Kong
  • Yuning Jiang
  • Lei Li
  • Chunhua Shen

Monocular depth estimation enables 3D perception from a single 2D image, thus attracting much research attention for years. Almost all methods treat foreground and background regions (“things and stuff”) in an image equally. However, not all pixels are equal. Depth of foreground objects plays a crucial role in 3D object recognition and localization. To date how to boost the depth prediction accuracy of foreground objects is rarely discussed. In this paper, we first analyze the data distributions and interaction of foreground and background, then propose the foreground-background separated monocular depth estimation (ForeSeE) method, to estimate the foreground and background depth using separate optimization objectives and decoders. Our method significantly improves the depth estimation performance on foreground objects. Applying ForeSeE to 3D object detection, we achieve 7. 5 AP gains and set new state-of-the-art results among other monocular methods. Code will be available at: https: //github. com/WXinlong/ForeSeE.

ICRA Conference 1998 Conference Paper

Research on Improving Rapid Prototyping

  • Zhiyan Wang
  • Xinlong Wang
  • Shuang Liu

The current rapid prototyping method is a cylindroid approach, the obvious disadvantage of the render program is using fixed step length to make slices, which can not handle the manufacturing accuracy and the speed flexibly. The paper provides the idea of a cone approach and some related work for variable step length slicing, this could be useful for a new type of rapid prototyping.