Arrow Research search

Author name cluster

Shaojun Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers
2 author rows

Possible papers

17

JBHI Journal 2025 Journal Article

PubLabeler: Enhancing Automatic Classification of Publications in UniProtKB Using Protein Textual Description and PubMedBERT

  • Shaojun Wang
  • Junyi Bian
  • Xiaodi Huang
  • Hong Zhou
  • Shanfeng Zhu

In UniProtKB, each protein is linked to numerous publications covering topics such as sequence, function, and structure, which are annotated manually or through automated methods. Given the vast number of proteins and literature, manual annotation is time-consuming and labour-intensive. Although UniProtKB offers automated annotations, their quality often falls short. Therefore, developing an accurate automated classifier to identify the topics of publications associated with each protein is imperative for advancing biomedical knowledge discovery. Classifying publications in UniProtKB involves protein-publication pairs characterized by multi-label, label co-occurrence, and class imbalance, which increases complexity. This paper proposes a novel method called PubLabeler, which simultaneously considers protein description and scientific literature texts as input. PubLabeler employs the PubMedBERT model to encode input texts and integrates label co-occurrence information into the model parameters. Additionally, it uses focal loss to update parameters, allowing the model to focus more on classes with a few instances. Using newly annotated literature from Swiss-Prot in 2023 as a test set, PubLabeler achieved superior results in both micro and macro metrics, showing a 28. 5% improvement in macro-F1 compared to UniProtKB's automated annotation method, UPCLASS. Furthermore, we validated PubLabeler's effectiveness in TrEMBL annotation, showcasing its comprehensive prediction results compared to TrEMBL's automated annotations. These findings highlight PubLabeler's reliability and potential to advance protein-related information extraction and knowledge discovery.

ICML Conference 2024 Conference Paper

DFlow: A Generative Model Combining Denoising AutoEncoder and Normalizing Flow for High Fidelity Waveform Generation

  • Chenfeng Miao
  • Qingying Zhu
  • Minchuan Chen
  • Wei Hu
  • Zijian Li
  • Shaojun Wang
  • Jing Xiao 0006

In this work, we present DFlow, a novel generative framework that combines Normalizing Flow (NF) with a Denoising AutoEncoder (DAE), for high-fidelity waveform generation. With a tactfully designed structure, DFlow seamlessly integrates the capabilities of both NF and DAE, resulting in a significantly improved performance compared to the standard NF models. Experimental results showcase DFlow’s superiority, achieving the highest MOS score among the existing methods on commonly used datasets and the fastest synthesis speed among all likelihood models. We further demonstrate the generalization ability of DFlow by generating high-quality out-of-distribution audio samples, such as singing and music audio. Additionally, we extend the model capacity of DFlow by scaling up both the model size and training set size. Our large-scale universal vocoder, DFlow-XL, achieves highly competitive performance against the best universal vocoder, BigVGAN.

ICML Conference 2021 Conference Paper

EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

  • Chenfeng Miao
  • Shuang Liang
  • Zhengchen Liu
  • Minchuan Chen
  • Jun Ma 0018
  • Shaojun Wang
  • Jing Xiao 0006

In this work, we address the Text-to-Speech (TTS) task by proposing a non-autoregressive architecture called EfficientTTS. Unlike the dominant non-autoregressive TTS models, which are trained with the need of external aligners, EfficientTTS optimizes all its parameters with a stable, end-to-end training procedure, allowing for synthesizing high quality speech in a fast and efficient manner. EfficientTTS is motivated by a new monotonic alignment modeling approach, which specifies monotonic constraints to the sequence alignment with almost no increase of computation. By combining EfficientTTS with different feed-forward network structures, we develop a family of TTS models, including both text-to-melspectrogram and text-to-waveform networks. We experimentally show that the proposed models significantly outperform counterpart models such as Tacotron 2 and Glow-TTS in terms of speech quality, training efficiency and synthesis speed, while still producing the speeches of strong robustness and great diversity. In addition, we demonstrate that proposed approach can be easily extended to autoregressive models such as Tacotron 2.

AAAI Conference 2020 Conference Paper

An Iterative Polishing Framework Based on Quality Aware Masked Language Model for Chinese Poetry Generation

  • Liming Deng
  • Jie Wang
  • Hangming Liang
  • Hui Chen
  • Zhiqiang Xie
  • Bojin Zhuang
  • Shaojun Wang
  • Jing Xiao

Owing to its unique literal and aesthetical characteristics, automatic generation of Chinese poetry is still challenging in Artificial Intelligence, which can hardly be straightforwardly realized by end-to-end methods. In this paper, we propose a novel iterative polishing framework for highly qualified Chinese poetry generation. In the first stage, an encoder-decoder structure is utilized to generate a poem draft. Afterwards, our proposed Quality-Aware Masked Language Model (QA- MLM) is employed to polish the draft towards higher quality in terms of linguistics and literalness. Based on a multi-task learning scheme, QA-MLM is able to determine whether polishing is needed based on the poem draft. Furthermore, QA- MLM is able to localize improper characters of the poem draft and substitute with newly predicted ones accordingly. Benefited from the masked language model structure, QA- MLM incorporates global context information into the polishing process, which can obtain more appropriate polishing results than the unidirectional sequential decoding. Moreover, the iterative polishing process will be terminated automatically when QA-MLM regards the processed poem as a qualified one. Both human and automatic evaluation have been conducted, and the results demonstrate that our approach is effective to improve the performance of encoder-decoder structure.

IJCAI Conference 2020 Conference Paper

Generating Reasonable Legal Text through the Combination of Language Modeling and Question Answering

  • Weijing Huang
  • Xianfeng Liao
  • Zhiqiang Xie
  • Jiang Qian
  • Bojin Zhuang
  • Shaojun Wang
  • Jing Xiao

Due to the improvement of Language Modeling, the emerging NLP assistant tools aiming for text generation greatly reduce the human workload on writing documents. However, the generation of legal text faces greater challenges than ordinary texts because of its high requirement for keeping logic reasonable, which can not be guaranteed by Language Modeling right now. To generate reasonable legal documents, we propose a novel method CoLMQA, which (1) combines Language Modeling and Question Answering, (2) generates text with slots by Language Modeling, and (3) fills the slots by our proposed Question Answering method named Transformer-based Key-Value Memory Networks. In CoLMQA, the slots represent the text part that needs to be highly constrained by logic, such as the name of the law and the number of the law article. And the Question Answering fills the slots in context with the help of Legal Knowledge Base to keep logic reasonable. The experiment verifies the quality of legal documents generated by CoLMQA, surpassing the documents generated by pure Language Modeling.

AAAI Conference 2018 Conference Paper

Slim Embedding Layers for Recurrent Neural Language Models

  • Zhongliang Li
  • Raymond Kulhanek
  • Shaojun Wang
  • Yunxin Zhao
  • Shuang Wu

Recurrent neural language models are the state-of-the-art models for language modeling. When the vocabulary size is large, the space taken to store the model parameters becomes the bottleneck for the use of recurrent neural language models. In this paper, we introduce a simple space compression method that randomly shares the structured parameters at both the input and output embedding layers of the recurrent neural language models to significantly reduce the size of model parameters, but still compactly represent the original input and output embedding layers. The method is easy to implement and tune. Experiments on several data sets show that the new method can get similar perplexity and BLEU score results while only using a very tiny fraction of parameters.

IJCAI Conference 2015 Conference Paper

A Direct Boosting Approach for Semi-supervised Classification

  • Shaodan Zhai
  • Tian Xia
  • Zhongliang Li
  • Shaojun Wang

We introduce a semi-supervised boosting approach (SSDBoost), which directly minimizes the classification errors and maximizes the margins on both labeled and unlabeled samples, without resorting to any upper bounds or approximations. A twostep algorithm based on coordinate descent/ascent is proposed to implement SSDBoost. Experiments on a number of UCI datasets and synthetic data show that SSDBoost gives competitive or superior results over the state-of-the-art supervised and semi-supervised boosting algorithms in the cases that the labeled data is limited, and it is very robust in noisy cases.

NeurIPS Conference 2013 Conference Paper

Direct 0-1 Loss Minimization and Margin Maximization with Boosting

  • Shaodan Zhai
  • Tian Xia
  • Ming Tan
  • Shaojun Wang

We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classifier of weak classifiers through directly minimizing empirical classification error over labeled training examples; once the training classification error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classifiers to maximize any targeted arbitrarily defined margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives consistently better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n'th order bottom sample margin.

NeurIPS Conference 2009 Conference Paper

A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

  • Yang Wang
  • Gholamreza Haffari
  • Shaojun Wang
  • Greg Mori

We propose a novel information theoretic approach for semi-supervised learning of conditional random fields. Our approach defines a training objective that combines the conditional likelihood on labeled data and the mutual information on unlabeled data. Different from previous minimum conditional entropy semi-supervised discriminative learning methods, our approach can be naturally cast into the rate distortion theory framework in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show that the rate distortion approach outperforms standard $l_2$ regularization and minimum conditional entropy regularization on both multi-class classification and sequence labeling problems.

NeurIPS Conference 2006 Conference Paper

implicit Online Learning with Kernels

  • Li Cheng
  • Dale Schuurmans
  • Shaojun Wang
  • Terry Caelli
  • S. V. N. Vishwanathan

We present two new algorithms for online learning in reproducing kernel Hilbert spaces. Our first algorithm, ILK (implicit online learning with kernels), employs a new, implicit update technique that can be applied to a wide variety of convex loss functions. We then introduce a bounded memory version, SILK (sparse ILK), that maintains a compact representation of the predictor without compromising solution quality, even in non-stationary environments. We prove loss bounds and analyze the convergence rate of both. Experimental evidence shows that our proposed algorithms outperform current methods on synthetic and real data.

NeurIPS Conference 2006 Conference Paper

Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

  • Chi-Hoon Lee
  • Shaojun Wang
  • Feng Jiao
  • Dale Schuurmans
  • Russell Greiner

We present a novel, semi-supervised approach to training discriminative random fields (DRFs) that efficiently exploits labeled and unlabeled training data to achieve improved accuracy in a variety of image processing tasks. We formulate DRF training as a form of MAP estimation that combines conditional loglikelihood on labeled data, given a data-dependent prior, with a conditional entropy regularizer defined on unlabeled data. Although the training objective is no longer concave, we develop an efficient local optimization procedure that produces classifiers that are more accurate than ones based on standard supervised DRF training. We then apply our semi-supervised approach to train DRFs to segment both synthetic and real data sets, and demonstrate significant improvements over supervised DRFs in each case.

UAI Conference 2003 Conference Paper

Boltzmann Machine Learning with the Latent Maximum Entropy Principle

  • Shaojun Wang
  • Dale Schuurmans
  • Fuchun Peng
  • Yunxin Zhao

We present a new statistical learning paradigm for Boltzmann machines based on a new inference principle we have proposed: the latent maximum entropy principle (LME). LME is different both from Jaynes maximum entropy principle and from standard maximum likelihood estimation.We demonstrate the LME principle BY deriving new algorithms for Boltzmann machine parameter estimation, and show how robust and fast new variant of the EM algorithm can be developed.Our experiments show that estimation based on LME generally yields better results than maximum likelihood estimation, particularly when inferring hidden units from small amounts of data.