Arrow Research search

Author name cluster

Wei Bi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers
2 author rows

Possible papers

18

ICLR Conference 2025 Conference Paper

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

  • Shansan Gong
  • Shivam Agarwal
  • Yizhe Zhang 0002
  • Jiacheng Ye
  • Lin Zheng
  • Mukai Li
  • Chenxin An
  • Peilin Zhao

Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR language models, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on language modeling, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts. We release a suite of DLMs (127M-355M-7B) capable of generating fluent text, performing in-context learning, filling in the middle without prompt re-ordering, and following instructions.

NeurIPS Conference 2024 Conference Paper

Diffusion of Thought: Chain-of-Thought Reasoning in Diffusion Language Models

  • Jiacheng Ye
  • Shansan Gong
  • Liheng Chen
  • Lin Zheng
  • Jiahui Gao
  • Han Shi
  • Chuan Wu
  • Xin Jiang

Recently, diffusion models have garnered significant interest in the field of text processing due to their many potential advantages compared to conventional autoregressive models. In this work, we propose Diffusion-of-Thought (DoT), a novel approach that integrates diffusion models with Chain-of-Thought, a well-established technique for improving the reasoning ability of autoregressive language models. In contrast to autoregressive language models that make decisions in a left-to-right, token-by-token manner, DoT allows reasoning steps to diffuse over time through a diffusion language model and offers greater flexibility in trading-off computation for reasoning performance. Our experimental results demonstrate the effectiveness of DoT in multi-digit multiplication, boolean logic, and grade school math problems. In addition to that, DoT showcases promising self-correction abilities and benefits from existing reasoning-enhancing techniques like self-consistency decoding. Our findings contribute to the understanding and development of reasoning with diffusion language models.

NeurIPS Conference 2024 Conference Paper

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

  • Yu Zhang
  • Songlin Yang
  • Ruijie Zhu
  • Yue Zhang
  • Leyang Cui
  • Yiqiao Wang
  • Bolun Wang
  • Freda Shi

Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via $\operatorname{softmax}$, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the $\operatorname{softmax}$ operation is particularly beneficial in ``finetuning pretrained Transformers to RNNs'' (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.

ICLR Conference 2024 Conference Paper

Knowledge Fusion of Large Language Models

  • Fanqi Wan
  • Xinting Huang
  • Deng Cai 0002
  • Xiaojun Quan
  • Wei Bi
  • Shuming Shi 0001

While training large language models (LLMs) from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weights is impractical. In this paper, we introduce the notion of knowledge fusion for LLMs, aimed at combining the capabilities of existing LLMs and transferring them into a single LLM. By leveraging the generative distributions of source LLMs, we externalize their collective knowledge and unique strengths, thereby potentially elevating the capabilities of the target model beyond those of any individual source LLM. We validate our approach using three popular LLMs with different architectures—Llama-2, MPT, and OpenLLaMA—across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/FuseLLM}.

ICLR Conference 2024 Conference Paper

Retrieval is Accurate Generation

  • Bowen Cao
  • Deng Cai 0002
  • Leyang Cui
  • Xuxin Cheng
  • Wei Bi
  • Yuexian Zou
  • Shuming Shi 0001

Standard language models generate text by selecting tokens from a fixed, finite, and standalone vocabulary. We introduce a novel method that selects context-aware phrases from a collection of supporting documents. One of the most significant challenges for this paradigm shift is determining the training oracles, because a string of text can be segmented in various ways and each segment can be retrieved from numerous possible documents. To address this, we propose to initialize the training oracles using linguistic heuristics and, more importantly, bootstrap the oracles through iterative self-reinforcement. Extensive experiments show that our model not only outperforms standard language models on a variety of knowledge-intensive tasks but also demonstrates improved generation quality in open-ended text generation. For instance, compared to the standard language model counterpart, our model raises the accuracy from 23.47% to 36.27% on OpenbookQA, and improves the MAUVE score from 42.61% to 81.58% in open-ended text generation. Remarkably, our model also achieves the best performance and the lowest latency among several retrieval-augmented baselines. In conclusion, we assert that retrieval is more accurate generation and hope that our work will encourage further research on this new paradigm shift.

NeurIPS Conference 2021 Conference Paper

Efficient Training of Visual Transformers with Small Datasets

  • Yahui Liu
  • Enver Sangineto
  • Wei Bi
  • Nicu Sebe
  • Bruno Lepri
  • Marco Nadai

Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data hungry than common CNNs. In fact, some local properties of the visual domain which are embedded in the CNN architectural design, in VTs should be learned from samples. In this paper, we empirically analyse different VTs, comparing their robustness in a small training set regime, and we show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. Moreover, we propose an auxiliary self-supervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data is scarce. Our task is used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the final accuracy of the VTs. Our code is available at: https: //github. com/yhlleo/VTs-Drloc.

AAAI Conference 2021 Conference Paper

Learning from My Friends: Few-Shot Personalized Conversation Systems via Social Networks

  • Zhiliang Tian
  • Wei Bi
  • Zihan Zhang
  • Dongkyu Lee
  • Yiping Song
  • Nevin L. Zhang

Personalized conversation models (PCMs) generate responses according to speaker preferences. Existing personalized conversation tasks typically require models to extract speaker preferences from user descriptions or their conversation histories, which are scarce for newcomers and inactive users. In this paper, we propose a few-shot personalized conversation task with an auxiliary social network. The task requires models to generate personalized responses for a speaker given a few conversations from the speaker and a social network. Existing methods are mainly designed to incorporate descriptions or conversation histories. Those methods can hardly model speakers with so few conversations or connections between speakers. To better cater for newcomers with few resources, we propose a personalized conversation model (PCM) that learns to adapt to new speakers as well as enabling new speakers to learn from resource-rich speakers. Particularly, based on a meta-learning based PCM, we propose a task aggregator (TA) to collect other speakers’ information from the social network. The TA provides prior knowledge of the new speaker in its meta-learning. Experimental results show our methods outperform all baselines in appropriateness, diversity, and consistency with speakers.

JBHI Journal 2021 Journal Article

Trail-Traced Threshold Test (T4) With a Weighted Binomial Distribution for a Psychophysical Test

  • Yuxin Gong
  • Haogang Zhu
  • Marco Miranda
  • David P. Crabb
  • Haolan Yang
  • Wei Bi
  • David F. Garway-Heath

Clinical visual field testing is performed with commercial perimetric devices and employs psychophysical techniques to obtain thresholds of the differential light sensitivity (DLS) at multiple retinal locations. Current thresholding algorithms are relatively inefficient and tough to get satisfied test accuracy, stability concurrently. Thus, we propose a novel Bayesian perimetric threshold method called the Trail-Traced Threshold Test (T4), which can better address the dependence of the initial threshold estimation and achieve significant improvement in the test accuracy and variability while also decreasing the number of presentations compared with Zippy Estimation by Sequential Testing (ZEST) and FT. This study compares T4 with ZEST and FT regarding presentation number, mean absolute difference (MAD between the real Visual field result and the simulate result), and measurement variability. T4 uses the complete response sequence with the spatially weighted neighbor responses to achieve better accuracy and precision than ZEST, FT, SWeLZ, and with significantly fewer stimulus presentations. T4 is also more robust to inaccurate initial threshold estimation than other methods, which is an advantage in subjective methods, such as in clinical perimetry. This method also has the potential for using in other psychophysical tests.

AAAI Conference 2020 Conference Paper

Improving Knowledge-Aware Dialogue Generation via Knowledge Base Question Answering

  • Jian Wang
  • Junhao Liu
  • Wei Bi
  • Xiaojiang Liu
  • Kejing He
  • Ruifeng Xu
  • Min Yang

Neural network models usually suffer from the challenge of incorporating commonsense knowledge into the opendomain dialogue systems. In this paper, we propose a novel knowledge-aware dialogue generation model (called TransDG), which transfers question representation and knowledge matching abilities from knowledge base question answering (KBQA) task to facilitate the utterance understanding and factual knowledge selection for dialogue generation. In addition, we propose a response guiding attention and a multi-step decoding strategy to steer our model to focus on relevant features for response generation. Experiments on two benchmark datasets demonstrate that our model has robust superiority over compared methods in generating informative and fluent dialogues. Our code is available at https: //github. com/siat-nlp/TransDG.

AAAI Conference 2020 Conference Paper

Learning to Select Bi-Aspect Information for Document-Scale Text Content Manipulation

  • Xiaocheng Feng
  • Yawei Sun
  • Bing Qin
  • Heng Gong
  • Yibo Sun
  • Wei Bi
  • Xiaojiang Liu
  • Ting Liu

In this paper, we focus on a new practical task, documentscale text content manipulation, which is the opposite of text style transfer and aims to preserve text styles while altering the content. In detail, the input is a set of structured records and a reference text for describing another recordset. The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference. The task is unsupervised due to lack of parallel data, and is challenging to select suitable records and style words from bi-aspect inputs respectively and generate a high-fidelity long document. To tackle those problems, we first build a dataset based on a basketball game report corpus as our testbed, and present an unsupervised neural model with interactive attention mechanism, which is used for learning the semantic relationship between records and reference texts to achieve better content transfer and better style preservation. In addition, we also explore the effectiveness of the back-translation in our task for constructing some pseudotraining pairs. Empirical results show superiority of our approaches over competitive methods, and the models also yield a new state-of-the-art result on a sentence-level dataset. 1

AAAI Conference 2020 Conference Paper

Relevance-Promoting Language Model for Short-Text Conversation

  • Xin Li
  • Piji Li
  • Wei Bi
  • Xiaojiang Liu
  • Wai Lam

Despite the effectiveness of sequence-to-sequence framework on the task of Short-Text Conversation (STC), the issue of under-exploitation of training data (i. e. , the supervision signals from query text is ignored) still remains unresolved. Also, the adopted maximization-based decoding strategies, inclined to generating the generic responses or responses with repetition, are unsuited to the STC task. In this paper, we propose to formulate the STC task as a language modeling problem and tailor-make a training strategy to adapt a language model for response generation. To enhance generation performance, we design a relevance-promoting transformer language model, which performs additional supervised source attention after the self-attention to increase the importance of informative query tokens in calculating the token-level representation. The model further refines the query representation with relevance clues inferred from its multiple references during training. In testing, we adopt a randomization-overmaximization strategy to reduce the generation of generic responses. Experimental results on a large Chinese STC dataset demonstrate the superiority of the proposed model on relevance metrics and diversity metrics. 1

AAAI Conference 2019 Conference Paper

Better Fine-Tuning via Instance Weighting for Text Classification

  • Zhi Wang
  • Wei Bi
  • Yan Wang
  • Xiaojiang Liu

Transfer learning for deep neural networks has achieved great success in many text classification applications. A simple yet effective transfer learning method is to fine-tune the pretrained model parameters. Previous fine-tuning works mainly focus on the pre-training stage and investigate how to pretrain a set of parameters that can help the target task most. In this paper, we propose an Instance Weighting based Finetuning (IW-Fit) method, which revises the fine-tuning stage to improve the final performance on the target domain. IW-Fit adjusts instance weights at each fine-tuning epoch dynamically to accomplish two goals: 1) identify and learn the specific knowledge of the target domain effectively; 2) well preserve the shared knowledge between the source and the target domains. The designed instance weighting metrics used in IW-Fit are model-agnostic, which are easy to implement for general DNN-based classifiers. Experimental results show that IW-Fit can consistently improve the classification accuracy on the target domain.

AAAI Conference 2019 Conference Paper

Generating Multiple Diverse Responses for Short-Text Conversation

  • Jun Gao
  • Wei Bi
  • Xiaojiang Liu
  • Junhui Li
  • Shuming Shi

Neural generative models have become popular and achieved promising performance on short-text conversation tasks. They are generally trained to build a 1-to-1 mapping from the input post to its output response. However, a given post is often associated with multiple replies simultaneously in real applications. Previous research on this task mainly focuses on improving the relevance and informativeness of the top one generated response for each post. Very few works study generating multiple accurate and diverse responses for the same post. In this paper, we propose a novel response generation model, which considers a set of responses jointly and generates multiple diverse responses simultaneously. A reinforcement learning algorithm is designed to solve our model. Experiments on two short-text conversation tasks validate that the multiple responses generated by our model obtain higher quality and larger diversity compared with various state-ofthe-art generative models.

UAI Conference 2014 Conference Paper

Learning to Predict from Crowdsourced Data

  • Wei Bi
  • Liwei Wang 0009
  • James T. Kwok
  • Zhuowen Tu

Crowdsourcing services like Amazon’s Mechanical Turk have facilitated and greatly expedited the manual labeling process from a large number of human workers. However, spammers are often unavoidable and the crowdsourced labels can be very noisy. In this paper, we explicitly account for four sources for a noisy crowdsourced label: worker’s dedication to the task, his/her expertise, his/her default labeling judgement, and sample difficulty. A novel mixture model is employed for worker annotations, which learns a prediction model directly from samples to labels for efficient out-of-sample testing. Experiments on both simulated and real-world crowdsourced data sets show that the proposed method achieves significant improvements over the state-of-the-art.

AAAI Conference 2014 Conference Paper

Multilabel Classification with Label Correlations and Missing Labels

  • Wei Bi
  • James Kwok

Many real-world applications involve multilabel classification, in which the labels can have strong interdependencies and some of them may even be missing. Existing multilabel algorithms are unable to handle both issues simultaneously. In this paper, we propose a probabilistic model that can automatically learn and exploit multilabel correlations. By integrating out the missing information, it also provides a disciplined approach to the handling of missing labels. The inference procedure is simple, and the optimization subproblems are convex. Experiments on a number of real-world data sets with both complete and missing labels demonstrate that the proposed algorithm can consistently outperform stateof-the-art multilabel classification algorithms.

ICML Conference 2013 Conference Paper

Efficient Multi-label Classification with Many Labels

  • Wei Bi
  • James T. Kwok

Multi-label classification deals with the problem where each instance can be associated with a set of class labels. However, in many real-world applications, the number of class labels can be in the hundreds or even thousands, and existing multi-label classification methods often become computationally inefficient. In recent years, a number of remedies have been proposed. However, they are either based on simple dimension reduction techniques or involve expensive optimization problems. In this paper, we address this problem by selecting a small subset of class labels that can approximately span the original label space. This is performed by randomized sampling where the sampling probability of each class label reflects its importance among all the labels. Theoretical analysis shows that this randomized sampling approach is highly efficient. Experiments on a number of real-world multi-label datasets with many labels demonstrate the appealing performance and efficiency of the proposed algorithm.

NeurIPS Conference 2012 Conference Paper

Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

  • Wei Bi
  • James Kwok

In hierarchical classification, the prediction paths may be required to always end at leaf nodes. This is called mandatory leaf node prediction (MLNP) and is particularly useful when the leaf nodes have much stronger semantic meaning than the internal nodes. However, while there have been a lot of MLNP methods in hierarchical multiclass classification, performing MLNP in hierarchical multilabel classification is much more difficult. In this paper, we propose a novel MLNP algorithm that (i) considers the global hierarchy structure; and (ii) can be used on hierarchies of both trees and DAGs. We show that one can efficiently maximize the joint posterior probability of all the node labels by a simple greedy algorithm. Moreover, this can be further extended to the minimization of the expected symmetric loss. Experiments are performed on a number of real-world data sets with tree- and DAG-structured label hierarchies. The proposed method consistently outperforms other hierarchical and flat multilabel classification methods.