Arrow Research search

Author name cluster

Lei Cui

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers
2 author rows

Possible papers

17

AAAI Conference 2026 Conference Paper

Towards Effective and Efficient Context-aware Nucleus Detection in Histopathology Whole Slide Images

  • Zhongyi Shui
  • Honglin Li
  • Yunlong Zhang
  • Yuxuan Sun
  • Yiwen Ye
  • Pingyi Chen
  • Ruizhe Guo
  • Lei Cui

Nucleus detection in histopathology whole slide images (WSIs) is crucial for a broad spectrum of clinical applications. The gigapixel size of WSIs necessitates the use of sliding window methodology for nucleus detection. However, mainstream methods process each sliding window independently, which overlooks broader contextual information and easily leads to inaccurate predictions. To address this limitation, recent studies additionally crop a large Filed-of-View (LFoV) patch centered on each sliding window to extract contextual features. However, such methods substantially increase whole-slide inference latency. In this work, we propose an effective and efficient context-aware nucleus detection approach. Specifically, instead of using lFoV patches, we aggregate contextual clues from off-the-shelf features of historically visited sliding windows, which greatly enhances the inference efficiency. Moreover, compared to lFoV patches used in previous works, the sliding window patches have higher magnification and provide finer-grained tissue details, thereby enhancing the classification accuracy. To develop the proposed context-aware model, we utilize annotated patches along with their surrounding unlabeled patches for training. Beyond exploiting high-level tissue context from these surrounding regions, we design a post-training strategy that leverages abundant unlabeled nucleus samples within them to enhance the model's context adaptability. Extensive experimental results on three challenging benchmarks demonstrate the superiority of our method.

NeurIPS Conference 2025 Conference Paper

Think Only When You Need with Large Hybrid-Reasoning Models

  • Lingjie Jiang
  • Xun Wu
  • Shaohan Huang
  • Qingxiu Dong
  • Zewen Chi
  • Li Dong
  • Xingxing Zhang
  • Tengchao Lv

Recent Large Reasoning Models (LRMs) have shown substantially improved reasoning capabilities over traditional Large Language Models (LLMs) by incorporating extended thinking processes prior to producing final responses. However, excessively lengthy thinking introduces substantial overhead in terms of token consumption and latency, which is unnecessary for simple queries. In this work, we introduce Large Hybrid-Reasoning Models (LHRMs), the first kind of model capable of adaptively determining whether to perform reasoning based on the contextual information of user queries. To achieve this, we propose a two-stage training pipeline comprising Hybrid Fine-Tuning (HFT) as a cold start, followed by online reinforcement learning with the proposed Hybrid Group Policy Optimization (HGPO) to implicitly learn to select the appropriate reasoning mode. Furthermore, we introduce a metric called Hybrid Accuracy to quantitatively assess the model’s capability for hybrid reasoning. Extensive experimental results show that LHRMs can adaptively perform hybrid reasoning on queries of varying difficulty and type. It outperforms existing LRMs and LLMs in reasoning and general capabilities while significantly improving efficiency. Together, our work advocates for a reconsideration of the appropriate use of extended reasoning processes and provides a solid starting point for building hybrid reasoning systems.

NeurIPS Conference 2024 Conference Paper

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

  • Wenshan Wu
  • Shaoguang Mao
  • Yadong Zhang
  • Yan Xia
  • Li Dong
  • Lei Cui
  • Furu Wei

Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. However, their abilities in spatial reasoning, a crucial aspect of human cognition, remain relatively unexplored. Human possess a remarkable ability to create mental images of unseen objects and actions through a process known as the Mind's Eye, enabling the imagination of the unseen world. Inspired by this cognitive capacity, we propose Visualization-of-Thought (VoT) prompting. VoT aims to elicit spatial reasoning of LLMs by visualizing their reasoning traces, thereby guiding subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results demonstrated that VoT significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT outperformed existing multimodal large language models (MLLMs) in these tasks. While VoT works surprisingly well on LLMs, the ability to generate mental images to facilitate spatial reasoning resembles the mind's eye process, suggesting its potential viability in MLLMs. Please find the dataset and codes in our project page.

NeurIPS Conference 2023 Conference Paper

Language Is Not All You Need: Aligning Perception with Language Models

  • Shaohan Huang
  • Li Dong
  • Wenhui Wang
  • Yaru Hao
  • Saksham Singhal
  • Shuming Ma
  • Tengchao Lv
  • Lei Cui

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i. e. , few-shot), and follow instructions (i. e. , zero-shot). Specifically, we train KOSMOS-1 from scratch on web-scale multi-modal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that KOSMOS-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i. e. , transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

NeurIPS Conference 2023 Conference Paper

TextDiffuser: Diffusion Models as Text Painters

  • Jingye Chen
  • Yupan Huang
  • Tengchao Lv
  • Lei Cui
  • Qifeng Chen
  • Furu Wei

Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we demonstrate that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. We will make the code, model and dataset publicly available.

AAAI Conference 2023 Conference Paper

TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

  • Minghao Li
  • Tengchao Lv
  • Jingye Chen
  • Lei Cui
  • Yijuan Lu
  • Dinei Florencio
  • Cha Zhang
  • Zhoujun Li

Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at https://aka.ms/trocr.

NeurIPS Conference 2022 Conference Paper

A Unified Model for Multi-class Anomaly Detection

  • Zhiyuan You
  • Lei Cui
  • Yujun Shen
  • Kai Yang
  • Xin Lu
  • Yu Zheng
  • Xinyi Le

Despite the rapid advance of unsupervised anomaly detection, existing methods require to train separate models for different objects. In this work, we present UniAD that accomplishes anomaly detection for multiple classes with a unified framework. Under such a challenging setting, popular reconstruction networks may fall into an "identical shortcut", where both normal and anomalous samples can be well recovered, and hence fail to spot outliers. To tackle this obstacle, we make three improvements. First, we revisit the formulations of fully-connected layer, convolutional layer, as well as attention layer, and confirm the important role of query embedding (i. e. , within attention layer) in preventing the network from learning the shortcut. We therefore come up with a layer-wise query decoder to help model the multi-class distribution. Second, we employ a neighbor masked attention module to further avoid the information leak from the input feature to the reconstructed output feature. Third, we propose a feature jittering strategy that urges the model to recover the correct message even with noisy inputs. We evaluate our algorithm on MVTec-AD and CIFAR-10 datasets, where we surpass the state-of-the-art alternatives by a sufficiently large margin. For example, when learning a unified model for 15 categories in MVTec-AD, we surpass the second competitor on the tasks of both anomaly detection (from 88. 1% to 96. 5%) and anomaly localization (from 89. 5% to 96. 8%). Code is available at https: //github. com/zhiyuanyou/UniAD.

JBHI Journal 2022 Journal Article

Automatic Lung Nodule Segmentation and Intra-Nodular Heterogeneity Image Generation

  • Jiangdian Song
  • Shih-Cheng Huang
  • Brendan Kelly
  • Guanqun Liao
  • Jingyun Shi
  • Ning Wu
  • Weimin Li
  • Zaiyi Liu

Automatic segmentation of lung nodules on computed tomography (CT) images is challenging owing to the variability of morphology, location, and intensity. In addition, few segmentation methods can capture intra-nodular heterogeneity to assist lung nodule diagnosis. In this study, we propose an end-to-end architecture to perform fully automated segmentation of multiple types of lung nodules and generate intra-nodular heterogeneity images for clinical use. To this end, a hybrid loss is considered by introducing a Faster R-CNN model based on generalized intersection over union loss in generative adversarial network. The Lung Image Database Consortium image collection dataset, comprising 2, 635 lung nodules, was combined with 3, 200 lung nodules from five hospitals for this study. Compared with manual segmentation by radiologists, the proposed model obtained an average dice coefficient (DC) of 82. 05% on the test dataset. Compared with U-net, NoduleNet, nnU-net, and other three models, the proposed method achieved comparable performance on lung nodule segmentation and generated more vivid and valid intra-nodular heterogeneity images, which are beneficial in radiological diagnosis. In an external test of 91 patients from another hospital, the proposed model achieved an average DC of 81. 61%. The proposed method effectively addresses the challenges of inevitable human interaction and additional pre-processing procedures in the existing solutions for lung nodule segmentation. In addition, the results show that the intra-nodular heterogeneity images generated by the proposed model are suitable to facilitate lung nodule diagnosis in radiology.

ICML Conference 2021 Conference Paper

AutoSampling: Search for Effective Data Sampling Schedules

  • Ming Sun 0008
  • Haoxuan Dou
  • Baopu Li
  • Junjie Yan
  • Wanli Ouyang
  • Lei Cui

Data sampling acts as a pivotal role in training deep learning models. However, an effective sampling schedule is difficult to learn due to its inherent high-dimension as a hyper-parameter. In this paper, we propose an AutoSampling method to automatically learn sampling schedules for model training, which consists of the multi-exploitation step aiming for optimal local sampling schedules and the exploration step for the ideal sampling distribution. More specifically, we achieve sampling schedule search with shortened exploitation cycle to provide enough supervision. In addition, we periodically estimate the sampling distribution from the learned sampling schedules and perturb it to search in the distribution space. The combination of two searches allows us to learn a robust sampling schedule. We apply our AutoSampling method to a variety of image classification tasks illustrating the effectiveness of the proposed method.

AAAI Conference 2020 Conference Paper

Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification

  • Renchun You
  • Zhiyao Guo
  • Lei Cui
  • Xiang Long
  • Yingze Bao
  • Shilei Wen

Multi-label image and video classification are fundamental yet challenging tasks in computer vision. The main challenges lie in capturing spatial or temporal dependencies between labels and discovering the locations of discriminative features for each class. In order to overcome these challenges, we propose to use cross-modality attention with semantic graph embedding for multi-label classification. Based on the constructed label graph, we propose an adjacency-based similarity graph embedding method to learn semantic label embeddings, which explicitly exploit label relationships. Then our novel cross-modality attention maps are generated with the guidance of learned label embeddings. Experiments on two multi-label image classification datasets (MS-COCO and NUS-WIDE) show our method outperforms other existing state-of-the-arts. In addition, we validate our method on a large multi-label video classification dataset (YouTube-8M Segments) and the evaluation results demonstrate the generalization capability of our method.

AAAI Conference 2020 Conference Paper

Loss-Based Attention for Deep Multiple Instance Learning

  • Xiaoshuang Shi
  • Fuyong Xing
  • Yuanpu Xie
  • Zizhao Zhang
  • Lei Cui
  • Lin Yang

Although attention mechanisms have been widely used in deep learning for many tasks, they are rarely utilized to solve multiple instance learning (MIL) problems, where only a general category label is given for multiple instances contained in one bag. Additionally, previous deep MIL methods firstly utilize the attention mechanism to learn instance weights and then employ a fully connected layer to predict the bag label, so that the bag prediction is largely determined by the effectiveness of learned instance weights. To alleviate this issue, in this paper, we propose a novel loss based attention mechanism, which simultaneously learns instance weights and predictions, and bag predictions for deep multiple instance learning. Specifically, it calculates instance weights based on the loss function, e. g. softmax+cross-entropy, and shares the parameters with the fully connected layer, which is to predict instance and bag predictions. Additionally, a regularization term consisting of learned weights and crossentropy functions is utilized to boost the recall of instances, and a consistency cost is used to smooth the training process of neural networks for boosting the model generalization performance. Extensive experiments on multiple types of benchmark databases demonstrate that the proposed attention mechanism is a general, effective and efficient framework, which can achieve superior bag and image classification performance over other state-of-the-art MIL methods, with obtaining higher instance precision and recall than previous attention mechanisms. Source codes are available on https: //github. com/xsshi2015/Loss-Attention.

AAAI Conference 2019 Conference Paper

LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts

  • Shuming Ma
  • Lei Cui
  • Damai Dai
  • Furu Wei
  • Xu Sun

We introduce the task of automatic live commenting. Live commenting, which is also called “video barrage”, is an emerging feature on online video sites that allows real-time comments from viewers to fly across the screen like bullets or roll at the right side of the screen. The live comments are a mixture of opinions for the video and the chit chats with other comments. Automatic live commenting requires AI agents to comprehend the videos and interact with human viewers who also make the comments, so it is a good testbed of an AI agent’s ability to deal with both dynamic vision and language. In this work, we construct a large-scale live comment dataset with 2, 361 videos and 895, 929 live comments. Then, we introduce two neural models to generate live comments based on the visual and textual contexts, which achieve better performance than previous neural baselines such as the sequence-to-sequence model. Finally, we provide a retrieval-based evaluation protocol for automatic live commenting where the model is asked to sort a set of candidate comments based on the log-likelihood score, and evaluated on metrics such as mean-reciprocal-rank. Putting it all together, we demonstrate the first “LiveBot”. The datasets and the codes can be found at https: //github. com/lancopku/livebot.

IJCAI Conference 2018 Conference Paper

Attention-Fused Deep Matching Network for Natural Language Inference

  • Chaoqun Duan
  • Lei Cui
  • Xinchi Chen
  • Furu Wei
  • Conghui Zhu
  • Tiejun Zhao

Natural language inference aims to predict whether a premise sentence can infer another hypothesis sentence. Recent progress on this task only relies on a shallow interaction between sentence pairs, which is insufficient for modeling complex relations. In this paper, we present an attention-fused deep matching network (AF-DMN) for natural language inference. Unlike existing models, AF-DMN takes two sentences as input and iteratively learns the attention-aware representations for each side by multi-level interactions. Moreover, we add a self-attention mechanism to fully exploit local context information within each sentence. Experiment results show that AF-DMN achieves state-of-the-art performance and outperforms strong baselines on Stanford natural language inference (SNLI), multi-genre natural language inference (MultiNLI), and Quora duplicate questions datasets.

AAAI Conference 2014 Conference Paper

Machine Translation with Real-Time Web Search

  • Lei Cui
  • Ming Zhou
  • Qiming Chen
  • Dongdong Zhang
  • Mu Li

Contemporary machine translation systems usually rely on offline data retrieved from the web for individual model training, such as translation models and language models. In contrast to existing methods, we propose a novel approach that treats machine translation as a web search task and utilizes the web on the fly to acquire translation knowledge. This end-to-end approach takes advantage of fresh web search results that are capable of leveraging tremendous web knowledge to obtain phrase-level candidates on demand and then compose sentence-level translations. Experimental results show that our web-based machine translation method demonstrates very promising performance in leveraging fresh translation knowledge and making translation decisions. Furthermore, when combined with offline models, it significantly outperforms a state-of-theart phrase-based statistical machine translation system.