Arrow Research search

Author name cluster

Sheng Shen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
1 author row

Possible papers

13

AAAI Conference 2026 Conference Paper

Rethinking Bias in Generative Data Augmentation for Medical AI: A Frequency Recalibration Method

  • Chi Liu
  • Jincheng Liu
  • Congcong Zhu
  • Minghao Wang
  • Sheng Shen
  • Jia Gu
  • Tianqing Zhu
  • Wanlei Zhou

Developing Medical AI relies on large datasets and easily suffers from data scarcity. Generative data augmentation (GDA) using AI generative models offers a solution to synthesize realistic medical images. However, the bias in GDA is often underestimated in medical domains, with concerns about the risk of introducing detrimental features generated by AI and harming downstream tasks. This paper identifies the frequency misalignment between real and synthesized images as one of the key factors underlying unreliable GDA and proposes the Frequency Recalibration (FreRec) method to reduce the frequency distributional discrepancy and thus improve GDA. FreRec involves (1) Statistical High-frequency Replacement (SHR) to roughly align high-frequency components and (2) Reconstructive High-frequency Mapping (RHM) to enhance image quality and reconstruct high-frequency details. Extensive experiments were conducted in various medical datasets, including brain MRIs, chest X-rays, and fundus images. The results show that FreRec significantly improves downstream medical image classification performance compared to uncalibrated AI-synthesized samples. FreRec is a standalone post-processing step that is compatible with any generative model and can integrate seamlessly with common medical GDA pipelines.

IJCAI Conference 2024 Conference Paper

Efficient Screen Content Image Compression via Superpixel-based Content Aggregation and Dynamic Feature Fusion

  • Sheng Shen
  • Huanjing Yue
  • Jingyu Yang

This paper addresses the challenge of efficiently compressing screen content images (SCIs) – computer generated images with unique attributes such as large uniform regions, sharp edges, and limited color palettes, which pose difficulties for conventional compression algorithms. We propose a Superpixel-based Content Aggregation Block (SCAB) to aggregate local pixels into one super-pixel and aggregate non-local information via super-pixel transformer. Such aggregation enables the dynamic assimilation of non-local information while maintaining manageable complexity. Furthermore, we enhance our channel-wise context entropy model with a Dynamic Feature Fusion (DFF) mechanism. This mechanism integrates decoded slices and side information dynamically based on their global correlation, allowing the network to dynamically learn the optimal weights for global information usage. Extensive experiments on three SCI datasets (SCID, CCT, and SIQAD) show our method’s superior RD performance and inference time, making it the first network comparable with the advanced VVC-SCC standard.

NeurIPS Conference 2024 Conference Paper

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

  • Yihe Deng
  • Pan Lu
  • Fan Yin
  • Ziniu Hu
  • Sheng Shen
  • Quanquan Gu
  • James Zou
  • Kai-Wei Chang

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce S elf- T raining on I mage C omprehension ( STIC ), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data and append its self-generated image descriptions to the prompts. We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4. 0% on average while using 70% less supervised fine-tuning data than the current method. Further studies dive into various components of STIC and highlight its potential to leverage vast quantities of unlabeled images for self-training.

NeurIPS Conference 2024 Conference Paper

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

  • Anas Awadalla
  • Le Xue
  • Oscar Lo
  • Manli Shu
  • Hannah Lee
  • Etash Guha
  • Matt Jordan
  • Sheng Shen

Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises of one trillion text tokens and 3. 4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires substantial engineering effort, sharing the data curation process and releasing the dataset greatly benefits the community. Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS. We release our data at https: //github. com/mlfoundations/MINT-1T.

NeurIPS Conference 2023 Conference Paper

Large Language Models are Visual Reasoning Coordinators

  • Liangyu Chen
  • Bo Li
  • Sheng Shen
  • Jingkang Yang
  • Chunyuan Li
  • Kurt Keutzer
  • Trevor Darrell
  • Ziwei Liu

Visual reasoning requires multimodal perception and commonsense cognition of the world. Recently, multiple vision-language models (VLMs) have been proposed with excellent commonsense reasoning ability in various domains. However, how to harness the collective power of these complementary VLMs is rarely explored. Existing methods like ensemble still struggle to aggregate these models with the desired higher-order communications. In this work, we propose Cola, a novel paradigm that coordinates multiple VLMs for visual reasoning. Our key insight is that a large language model (LLM) can efficiently coordinate multiple VLMs by facilitating natural language communication that leverages their distinct and complementary capabilities. Extensive experiments demonstrate that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering (VQA), outside knowledge VQA, visual entailment, and visual spatial reasoning tasks. Moreover, we show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings, without finetuning. Through systematic ablation studies and visualizations, we validate that a coordinator LLM indeed comprehends the instruction prompts as well as the separate functionalities of VLMs; it then coordinates them to enable impressive visual reasoning capabilities.

IJCAI Conference 2023 Conference Paper

Towards Robust Gan-Generated Image Detection: A Multi-View Completion Representation

  • Chi Liu
  • Tianqing Zhu
  • Sheng Shen
  • Wanlei Zhou

GAN-generated image detection now becomes the first line of defense against the malicious uses of machine-synthesized image manipulations such as deepfakes. Although some existing detectors work well in detecting clean, known GAN samples, their success is largely attributable to overfitting unstable features such as frequency artifacts, which will cause failures when facing unknown GANs or perturbation attacks. To overcome the issue, we propose a robust detection framework based on a novel multi-view image completion representation. The framework first learns various view-to-image tasks to model the diverse distributions of genuine images. Frequency-irrelevant features can be represented from the distributional discrepancies characterized by the completion models, which are stable, generalized, and robust for detecting unknown fake patterns. Then, a multi-view classification is devised with elaborated intra- and inter-view learning strategies to enhance view-specific feature representation and cross-view feature aggregation, respectively. We evaluated the generalization ability of our framework across six popular GANs at different resolutions and its robustness against a broad range of perturbation attacks. The results confirm our method's improved effectiveness, generalization, and robustness over various baselines.

NeurIPS Conference 2022 Conference Paper

K-LITE: Learning Transferable Visual Models with External Knowledge

  • Sheng Shen
  • Chunyuan Li
  • Xiaowei Hu
  • Yujia Xie
  • Jianwei Yang
  • Pengchuan Zhang
  • Zhe Gan
  • Lijuan Wang

The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervision ensures high generality and usability of the learned visual models, based on the broad concept coverage achieved through large-scale data collection process. Alternatively, we argue that learning with external knowledge about images is a promising way which leverages a much more structured source of supervision and offers sample efficiency. In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods. Our code is released at https: //github. com/microsoft/klite.

EAAI Journal 2021 Journal Article

A physics-informed deep learning approach for bearing fault detection

  • Sheng Shen
  • Hao Lu
  • Mohammadkazem Sadoughi
  • Chao Hu
  • Venkat Nemani
  • Adam Thelen
  • Keith Webster
  • Matthew Darr

In recent years, advances in computer technology and the emergence of big data have enabled deep learning to achieve impressive successes in bearing condition monitoring and fault detection. While existing deep learning approaches are able to efficiently detect and classify bearing faults, most of these approaches depend exclusively on data and do not incorporate physical knowledge into the learning and prediction processes—or more importantly, embed the physical knowledge of bearing faults into the model training process, which makes the model physically meaningful. To address this challenge, we propose a physics-informed deep learning approach that consists of a simple threshold model and a deep convolutional neural network (CNN) model for bearing fault detection. In the proposed physics-informed deep learning approach, the threshold model first assesses the health classes of bearings based on known physics of bearing faults. Then, the CNN model automatically extracts high-level characteristic features from the input data and makes full use of these features to predict the health class of a bearing. We designed a loss function for training and validating the CNN model that selectively amplifies the effect of the physical knowledge assimilated by the threshold model when embedding this knowledge into the CNN model. The proposed physics-informed deep learning approach was validated using (1) data from 18 bearings on an agricultural machine operating in the field, and (2) data from bearings on a laboratory test stand in the Case Western Reserve University (CWRU) Bearing Data Center.

AAAI Conference 2021 Conference Paper

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

  • Zhewei Yao
  • Amir Gholami
  • Sheng Shen
  • Mustafa Mustafa
  • Kurt Keutzer
  • Michael Mahoney

Incorporating second-order curvature information into machine learning optimization algorithms can be subtle, and doing so naı̈vely can lead to high per-iteration costs associated with forming the Hessian and performing the associated linear system solve. To address this, we introduce ADAHESSIAN, a new stochastic optimization algorithm. ADAHESSIAN directly incorporates approximate curvature information from the loss function, and it includes several novel performance-improving features, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a spatial averaging to reduce the variance of the second derivative; and (iii) a root-mean-square exponential moving average to smooth out variations of the second-derivative across different iterations. We perform extensive tests on NLP, CV, and recommendation system tasks, and ADAHESSIAN achieves state-of-the-art results. In particular, we find that ADAHESSIAN: (i) outperforms AdamW for transformers by 0. 13/0. 33 BLEU score on IWSLT14/WMT14, 2. 7/1. 0 PPL on PTB/Wikitext-103; (ii) outperforms AdamW for Squeeze- Bert by 0. 41 points on GLUE; (iii) achieves 1. 45%/5. 55% higher accuracy on ResNet32/ResNet18 on Cifar10/ImageNet as compared to Adam; and (iv) achieves 0. 032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. The cost per iteration of ADAHESSIAN is comparable to first-order methods, and ADAHESSIAN exhibits improved robustness towards variations in hyperparameter values. The code for ADAHESSIAN is open-sourced and publicly-available (Yao and Gholami 2020).

NeurIPS Conference 2021 Conference Paper

Implicit Transformer Network for Screen Content Image Continuous Super-Resolution

  • Jingyu Yang
  • Sheng Shen
  • Huanjing Yue
  • Kun Li

Nowadays, there is an explosive growth of screen contents due to the wide application of screen sharing, remote cooperation, and online education. To match the limited terminal bandwidth, high-resolution (HR) screen contents may be downsampled and compressed. At the receiver side, the super-resolution (SR)of low-resolution (LR) screen content images (SCIs) is highly demanded by the HR display or by the users to zoom in for detail observation. However, image SR methods mostly designed for natural images do not generalize well for SCIs due to the very different image characteristics as well as the requirement of SCI browsing at arbitrary scales. To this end, we propose a novel Implicit Transformer Super-Resolution Network (ITSRN) for SCISR. For high-quality continuous SR at arbitrary ratios, pixel values at query coordinates are inferred from image features at key coordinates by the proposed implicit transformer and an implicit position encoding scheme is proposed to aggregate similar neighboring pixel values to the query one. We construct benchmark SCI1K and SCI1K-compression datasets withLR and HR SCI pairs. Extensive experiments show that the proposed ITSRN significantly outperforms several competitive continuous and discrete SR methods for both compressed and uncompressed SCIs.

IJCAI Conference 2020 Conference Paper

Emoji-Powered Representation Learning for Cross-Lingual Sentiment Classification (Extended Abstract)

  • Zhenpeng Chen
  • Sheng Shen
  • Ziniu Hu
  • Xuan Lu
  • Qiaozhu Mei
  • Xuanzhe Liu

Sentiment classification typically relies on a large amount of labeled data. In practice, the availability of labels is highly imbalanced among different languages. To tackle this problem, cross-lingual sentiment classification approaches aim to transfer knowledge learned from one language that has abundant labeled examples (i. e. , the source language, usually English) to another language with fewer labels (i. e. , the target language). The source and the target languages are usually bridged through off-the-shelf machine translation tools. Through such a channel, cross-language sentiment patterns can be successfully learned from English and transferred into the target languages. This approach, however, often fails to capture sentiment knowledge specific to the target language. In this paper, we employ emojis, which are widely available in many languages, as a new channel to learn both the cross-language and the language-specific sentiment patterns. We propose a novel representation learning method that uses emoji prediction as an instrument to learn respective sentiment-aware representations for each language. The learned representations are then integrated to facilitate cross-lingual sentiment classification.

AAAI Conference 2020 Conference Paper

On the Generation of Medical Question-Answer Pairs

  • Sheng Shen
  • Yaliang Li
  • Nan Du
  • Xian Wu
  • Yusheng Xie
  • Shen Ge
  • Tao Yang
  • Kai Wang

Question answering (QA) has achieved promising progress recently. However, answering a question in real-world scenarios like the medical domain is still challenging, due to the requirement of external knowledge and the insufficient quantity of high-quality training data. In the light of these challenges, we study the task of generating medical QA pairs in this paper. With the insight that each medical question can be considered as a sample from the latent distribution of questions given answers, we propose an automated medical QA pair generation framework, consisting of an unsupervised key phrase detector that explores unstructured material for validity, and a generator that involves a multi-pass decoder to integrate structural knowledge for diversity. A series of experiments have been conducted on a real-world dataset collected from the National Medical Licensing Examination of China. Both automatic evaluation and human annotation demonstrate the effectiveness of the proposed method. Further investigation shows that, by incorporating the generated QA pairs for training, significant improvement in terms of accuracy can be achieved for the examination QA system. 1

AAAI Conference 2020 Conference Paper

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

  • Sheng Shen
  • Zhen Dong
  • Jiayu Ye
  • Linjian Ma
  • Zhewei Yao
  • Amir Gholami
  • Michael W. Mahoney
  • Kurt Keutzer

Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use Hessian-based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most 2.3% performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to 13× compression of the model parameters, and up to 4× compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD.