Author name cluster

Can Xu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers

2 author rows

ICLR Conference 2025 Conference Paper

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo
Qingfeng Sun
Can Xu
Pu Zhao 0004
Jian-Guang Lou
Chongyang Tao
Xiubo Geng
Qingwei Lin

Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical reasoning abilities of LLMs, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. Remarkably, WizardMath-Mistral 7B surpasses all other open-source LLMs by a substantial margin. Furthermore, WizardMath 70B even outperforms ChatGPT-3.5, Claude Instant, Gemini Pro and Mistral Medium. Additionally, our preliminary exploration highlights the pivotal role of instruction evolution and process supervision in achieving exceptional math performance.

Details

AAAI Conference 2024 Conference Paper

Fine-Grained Distillation for Long Document Retrieval

Yucheng Zhou
Tao Shen
Xiubo Geng
Chongyang Tao
Jianbing Shen
Guodong Long
Can Xu
Daxin Jiang

Long document retrieval aims to fetch query-relevant documents from a large-scale collection, where knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder. However, in contrast to passages or sentences, retrieval on long documents suffers from the \textit{scope hypothesis} that a long document may cover multiple topics. This maximizes their structure heterogeneity and poses a granular-mismatch issue, leading to an inferior distillation efficacy. In this work, we propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers. While preserving the conventional dense retrieval paradigm, it first produces global-consistent representations crossing different fine granularity and then applies multi-granular aligned distillation merely during training. In experiments, we evaluate our framework on two long-document retrieval benchmarks, which show state-of-the-art performance.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Geometric-Facilitated Denoising Diffusion Model for 3D Molecule Generation

Can Xu
Haosen Wang
Weigang Wang
Pengfei Zheng
Hongyang Chen

Denoising diffusion models have shown great potential in multiple research areas. Existing diffusion-based generative methods on de novo 3D molecule generation face two major challenges. Since majority heavy atoms in molecules allow connections to multiple atoms through single bonds, solely using pair-wise distance to model molecule geometries is insufficient. Therefore, the first one involves proposing an effective neural network as the denoising kernel that is capable to capture complex multi-body interatomic relationships and learn high-quality features. Due to the discrete nature of graphs, mainstream diffusion-based methods for molecules heavily rely on predefined rules and generate edges in an indirect manner. The second challenge involves accommodating molecule generation to diffusion and accurately predicting the existence of bonds. In our research, we view the iterative way of updating molecule conformations in diffusion process is consistent with molecular dynamics and introduce a novel molecule generation method named Geometric-Facilitated Molecular Diffusion (GFMDiff). For the first challenge, we introduce a Dual-track Transformer Network (DTN) to fully excevate global spatial relationships and learn high quality representations which contribute to accurate predictions of features and geometries. As for the second challenge, we design Geometric-facilitated Loss (GFLoss) which intervenes the formation of bonds during the training period, instead of directly embedding edges into the latent space. Comprehensive experiments on current benchmarks demonstrate the superiority of GFMDiff.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Training-free Multi-objective Diffusion Model for 3D Molecule Generation

Xu Han
Caihua Shan
Yifei Shen 0004
Can Xu
Han Yang
Xiang Li 0067
Dongsheng Li 0002

Searching for novel and diverse molecular candidates is a critical undertaking in drug and material discovery. Existing approaches have successfully adapted the diffusion model, the most effective generative model in image generation, to create 1D SMILES strings, 2D chemical graphs, or 3D molecular conformers. However, these methods are not efficient and flexible enough to generate 3D molecules with multiple desired properties, as they require additional training for the models for each new property or even a new combination of existing properties. Moreover, some properties may potentially conflict, making it impossible to find a molecule that satisfies all of them simultaneously. To address these challenges, we present a training-free conditional 3D molecular generation algorithm based on off-the-shelf unconditional diffusion models and property prediction models. The key techniques include modeling the loss of property prediction models as energy functions, considering the property relation between multiple conditions as a probabilistic graph, and developing a stable posterior estimation for computing the conditional score function. We conducted experiments on both single-objective and multi-objective 3D molecule generation, focusing on quantum properties, and compared our approach with the trained or fine-tuned diffusion models. Our proposed model achieves superior performance in generating molecules that meet the conditions, without any additional training cost.

Details

NeurIPS Conference 2024 Conference Paper

WizardArena: Post-training Large Language Models via Simulated Offline Chatbot Arena

Haipeng Luo
Qingfeng Sun
Can Xu
Pu Zhao
Qingwei Lin
Jianguang Lou
Shifeng Chen
Yansong Tang

Recent work demonstrates that, post-training large language models with open-domain instruction following data have achieved colossal success. Simultaneously, human Chatbot Arena has emerged as one of the most reasonable benchmarks for model evaluation and developmental guidance. However, the processes of manually curating high-quality training data and utilizing online human evaluation platforms are both expensive and limited. To mitigate the manual and temporal costs associated with post-training, this paper introduces a Simulated Chatbot Arena named WizardArena, which is fully based on and powered by open-source LLMs. For evaluation scenario, WizardArena can efficiently predict accurate performance rankings among different models based on offline test set. For training scenario, we simulate arena battles among various state-of-the-art models on a large scale of instruction data, subsequently leveraging the battle results to constantly enhance target model in both the supervised fine-tuning and reinforcement learning. Experimental results demonstrate that our WizardArena aligns closely with the online human arena rankings, and our models trained on offline extensive battle data exhibit significant performance improvements during SFT, DPO, and PPO stages.

PDF Details DOI

ICLR Conference 2024 Conference Paper

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Ziyang Luo
Can Xu
Pu Zhao 0004
Qingfeng Sun
Xiubo Geng
Wenxiang Hu
Chongyang Tao
Jing Ma 0004

Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated remarkable performance in various code-related tasks. However, different from their counterparts in the general language modeling field, the technique of instruction fine-tuning remains relatively under-researched in this domain. In this paper, we present Code Evol-Instruct, a novel approach that adapts the Evol-Instruct method to the realm of code, enhancing Code LLMs to create novel models, WizardCoder. Through comprehensive experiments on five prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, DS-1000, and MultiPL-E, our models showcase outstanding performance. They consistently outperform all other open-source Code LLMs by a significant margin. Remarkably, WizardCoder 15B even surpasses the well-known closed-source LLMs, including Anthropic's Claude and Google's Bard, on the HumanEval and HumanEval+ benchmarks. Additionally, WizardCoder 34B not only achieves a HumanEval score comparable to GPT3.5 (ChatGPT) but also surpasses it on the HumanEval+ benchmark. Furthermore, our preliminary exploration highlights the pivotal role of instruction complexity in achieving exceptional coding performance.

Details

ICLR Conference 2024 Conference Paper

WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions

Can Xu
Qingfeng Sun
Kai Zheng 0021
Xiubo Geng
Pu Zhao 0004
Jiazhan Feng
Chongyang Tao
Qingwei Lin

Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Both automatic and human evaluations consistently indicate that WizardLM outperforms baselines such as Alpaca (trained from Self-Instruct) and Vicuna (trained from human-created instructions). The experimental results demonstrate that the quality of instruction-following dataset crafted by Evol-Instruct can significantly improve the performance of LLMs.

Details

ICLR Conference 2023 Conference Paper

HypeR: Multitask Hyper-Prompted Training Enables Large-Scale Retrieval Generalization

Zefeng Cai
Chongyang Tao
Tao Shen 0001
Can Xu
Xiubo Geng
Xin Alex Lin
Liang He 0001
Daxin Jiang

Recently, large-scale text retrieval has made impressive progress, facilitating both information retrieval and downstream knowledge-intensive tasks (e.g., open-domain QA and dialogue). With a moderate amount of data, a neural text retriever can outperform traditional methods such as BM25 by a large step. However, while being applied to out-of-domain data, the performance of a neural retriever degrades considerably. Therefore, how to enable a retriever to perform more robustly across different domains or tasks and even show strong zero-shot transfer ability is critical for building scalable IR systems. To this end, we propose HypeR, a hyper-prompted training mechanism to enable uniform retrieval across tasks of different domains. Specifically, our approach jointly trains the query encoder with a shared prompt-based parameter pool and a prompt synthesizer that dynamically composes hyper-prompt for encoding each query from different tasks or domains. Besides, to avoid the mode collapse of prompt attention distribution for different queries, we design a contrastive prompt regularization that promotes the mode of prompt attention to be aligned and uniform. Through multi-task hyper-prompted training, our retriever can master the ability to dynamically represent different types of queries and transfer knowledge across different domains and tasks. Extensive experiments show our model attains better retrieval performance across different tasks and better zero-shot transfer ability compared with various previous methods.

Details

ECAI Conference 2023 Conference Paper

Investigating the Learning Behaviour of In-Context Learning: A Comparison with Supervised Learning

Xindi Wang 0001
Yufei Wang 0003
Can Xu
Xiubo Geng
Bowen Zhang
Chongyang Tao
Frank Rudzicz
Robert E. Mercer

Large language models (LLMs) have shown remarkable capacity for in-context learning (ICL), where learning a new task from just a few training examples is done without being explicitly pre-trained. However, despite the success of LLMs, there has been little understanding of how ICL learns the knowledge from the given prompts. In this paper, to make progress toward understanding the learning behaviour of ICL, we train the same LLMs with the same demonstration examples via ICL and supervised learning (SL), respectively, and investigate their performance under label perturbations (i. e. , noisy labels and label imbalance) on a range of classification tasks. First, via extensive experiments, we find that gold labels have significant impacts on the downstream in-context performance, especially for large language models; however, imbalanced labels matter little to ICL across all model sizes. Second, when comparing with SL, we show empirically that ICL is less sensitive to label perturbations than SL, and ICL gradually attains comparable performance to SL as the model size increases.

Details

ICLR Conference 2023 Conference Paper

KnowDA: All-in-One Knowledge Mixture Model for Data Augmentation in Low-Resource NLP

Yufei Wang 0003
Jiayi Zheng
Can Xu
Xiubo Geng
Tao Shen 0001
Chongyang Tao
Daxin Jiang

This paper focuses on data augmentation for low-resource NLP tasks where the training set is limited. The existing solutions either leverage task-independent heuristic rules (e.g., Synonym Replacement) or fine-tune general-purpose pre-trained language models (e.g., GPT2) using the limited training instances to produce new synthetic data. Consequently, they have trivial task-specific knowledge and are limited to yielding low-quality synthetic data. To combat this issue, we propose Knowledge Mixture Data Augmentation Model (KnowDA), a Seq2Seq language model pretrained on a mixture of diverse NLP tasks under a novel framework of Knowledge Mixture Training (KoMT). The goal of KoMT is to condense diverse NLP task-specific knowledge into the single KnowDA model (i.e., all-in-one). The resulting KnowDA could utilize these knowledge to quickly grasp the inherent synthesis law of the target task through limited training instances. Specifically, KoMT reformulates input examples from various heterogeneous NLP tasks into a unified text-to-text format and employs denoising training objectives in different granularity to learn to reconstruct partial or complete samples. To the best of our knowledge, we are the first to attempt to apply 100+ NLP multi-task training for data augmentation. Extensive experiments show that i) the synthetic data produced by KnowDA successfully improves the performance of the strong pre-trained language models (i.e., Bert, ALBert and Deberta) by a large margin on the low-resource NLP benchmark FewGLUE, CoNLL’03 and WikiAnn; ii) KnowDA successful transfer the task knowledge to NLP tasks whose types are seen and unseen in KoMT.

Details

ICLR Conference 2023 Conference Paper

LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

Tao Shen 0001
Xiubo Geng
Chongyang Tao
Can Xu
Xiaolong Huang
Binxing Jiao
Linjun Yang
Daxin Jiang

In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained language models, a crucial gap remains between language modeling and lexicon-weighting retrieval -- the former preferring certain or low-entropy words whereas the latter favoring pivot or high-entropy words -- becoming the main barrier to lexicon-weighting performance for large-scale retrieval. To bridge this gap, we propose a brand-new pre-training framework, lexicon-bottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations. Essentially, we present a lexicon-bottlenecked module between a normal language modeling encoder and a weakened decoder, where a continuous bag-of-words bottleneck is constructed to learn a lexicon-importance distribution in an unsupervised fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting retrieval via fine-tuning. On the ad-hoc retrieval benchmark, MS-Marco, it achieves 42.6% MRR@10 with 45.8 QPS for the passage dataset and 44.4% MRR@100 with 134.8 QPS for the document dataset, by a CPU machine. And LexMAE shows state-of-the-art zero-shot transfer capability on BEIR benchmark with 12 datasets.

Details

NeurIPS Conference 2021 Conference Paper

Neural Rule-Execution Tracking Machine For Transformer-Based Text Generation

Yufei Wang
Can Xu
Huang Hu
Chongyang Tao
Stephen Wan
Mark Dras
Mark Johnson
Daxin Jiang

Sequence-to-Sequence (Seq2Seq) neural text generation models, especially the pre-trained ones (e. g. , BART and T5), have exhibited compelling performance on various natural language generation tasks. However, the black-box nature of these models limits their application in tasks where specific rules (e. g. , controllable constraints, prior knowledge) need to be executed. Previous works either design specific model structures (e. g. , Copy Mechanism corresponding to the rule "the generated output should include certain words in the source input'') or implement specialized inference algorithms (e. g. , Constrained Beam Search) to execute particular rules through the text generation. These methods require the careful design case-by-case and are difficult to support multiple rules concurrently. In this paper, we propose a novel module named Neural Rule-Execution Tracking Machine (NRETM) that can be equipped into various transformer-based generators to leverage multiple rules simultaneously to guide the neural generation model for superior generation performance in an unified and scalable way. Extensive experiments on several benchmarks verify the effectiveness of our proposed model in both controllable and general text generation tasks.

PDF Details

AAAI Conference 2021 Conference Paper

Open Domain Dialogue Generation with Latent Images

Ze Yang
Wei Wu
Huang Hu
Can Xu
Wei Wang
Zhoujun Li

We consider grounding open domain dialogues with images. Existing work assumes that both an image and a textual context are available, but image-grounded dialogues by nature are more difficult to obtain than textual dialogues. Thus, we propose learning a response generation model with both image-grounded dialogues and textual dialogues by assuming that the visual scene information at the time of a conversation can be represented by an image, and trying to recover the latent images of the textual dialogues through text-to-image generation techniques. The likelihood of the two types of dialogues is then formulated by a response generator and an image reconstructor that are learned within a conditional variational auto-encoding framework. Empirical studies are conducted in both image-grounded conversation and text-based conversation. In the first scenario, image-grounded dialogues, especially under a low-resource setting, can be effectively augmented by textual dialogues with latent images; while in the second scenario, latent images can enrich the content of responses and at the same time keep them relevant to contexts.

PDF Details

ICLR Conference 2020 Conference Paper

Low-Resource Knowledge-Grounded Dialogue Generation

Xueliang Zhao
Wei Wu 0014
Chongyang Tao
Can Xu
Dongyan Zhao 0001
Rui Yan 0001

Responding with knowledge has been recognized as an important capability for an intelligent conversational agent. Yet knowledge-grounded dialogues, as training data for learning such a response generation model, are difficult to obtain. Motivated by the challenge in practice, we consider knowledge-grounded dialogue generation under a natural assumption that only limited training examples are available. In such a low-resource setting, we devise a disentangled response decoder in order to isolate parameters that depend on knowledge-grounded dialogues from the entire generation model. By this means, the major part of the model can be learned from a large number of ungrounded dialogues and unstructured documents, while the remaining small parameters can be well fitted using the limited training examples. Evaluation results on two benchmarks indicate that with only $1/8$ training data, our model can achieve the state-of-the-art performance and generalize well on out-of-domain knowledge.

Details

NeurIPS Conference 2020 Conference Paper

Zero-Resource Knowledge-Grounded Dialogue Generation

Linxiao Li
Can Xu
Wei Wu
Yufan Zhao
Xueliang Zhao
Chongyang Tao

While neural conversation models have shown great potentials towards generating informative and engaging responses via introducing external knowledge, learning such a model often requires knowledge-grounded dialogues that are difficult to obtain. To overcome the data challenge and reduce the cost of building a knowledge-grounded dialogue system, we explore the problem under a zero-resource setting by assuming no context-knowledge-response triples are needed for training. To this end, we propose representing the knowledge that bridges a context and a response and the way that the knowledge is expressed as latent variables, and devise a variational approach that can effectively estimate a generation model from independent dialogue corpora and knowledge corpora. Evaluation results on three benchmarks of knowledge-grounded dialogue generation indicate that our model can achieve comparable performance with state-of-the-art methods that rely on knowledge-grounded dialogues for training, and exhibits a good generalization ability over different datasets.

PDF Details

IJCAI Conference 2019 Conference Paper

A Document-grounded Matching Network for Response Selection in Retrieval-based Chatbots

Xueliang Zhao
Chongyang Tao
Wei Wu
Can Xu
Dongyan Zhao
Rui Yan

We present a document-grounded matching network (DGMN) for response selection that can power a knowledge-aware retrieval-based chatbot system. The challenges of building such a model lie in how to ground conversation contexts with background documents and how to recognize important information in the documents for matching. To overcome the challenges, DGMN fuses information in a document and a context into representations of each other, and dynamically determines if grounding is necessary and importance of different parts of the document and the context through hierarchical interaction with a response at the matching step. Empirical studies on two public data sets indicate that DGMN can significantly improve upon state-of-the-art methods and at the same time enjoys good interpretability.

PDF Details

AAAI Conference 2018 Conference Paper

Knowledge Enhanced Hybrid Neural Network for Text Matching

Yu Wu
Wei Wu
Can Xu
Zhoujun Li

Long text brings a big challenge to neural network based text matching approaches due to their complicated structures. To tackle the challenge, we propose a knowledge enhanced hybrid neural network (KEHNN) that leverages prior knowledge to identify useful information and ﬁlter out noise in long text and performs matching from multiple perspectives. The model fuses prior knowledge into word representations by knowledge gates and establishes three matching channels with words, sequential structures of text given by Gated Recurrent Units (GRUs), and knowledge enhanced representations. The three channels are processed by a convolutional neural network to generate high level features for matching, and the features are synthesized as a matching score by a multilayer perceptron. In this paper, we focus on exploring the use of taxonomy knowledge for text matching. Evaluation results from extensive experiments on public data sets of question answering and conversation show that KEHNN can signiﬁcantly outperform state-of-the-art matching models and particularly improve matching accuracy on pairs with long text.

PDF Details

AAAI Conference 2018 Conference Paper

Neural Response Generation With Dynamic Vocabularies

Yu Wu
Wei Wu
Dejian Yang
Can Xu
Zhoujun Li

We study response generation for open domain conversation in chatbots. Existing methods assume that words in responses are generated from an identical vocabulary regardless of their inputs, which not only makes them vulnerable to generic patterns and irrelevant noise, but also causes a high cost in decoding. We propose a dynamic vocabulary sequence-tosequence (DVS2S) model which allows each input to possess their own vocabulary in decoding. In training, vocabulary construction and response generation are jointly learned by maximizing a lower bound of the true objective with a Monte Carlo sampling method. In inference, the model dynamically allocates a small vocabulary for an input with the word prediction model, and conducts decoding only with the small vocabulary. Because of the dynamic vocabulary mechanism, DVS2S eludes many generic patterns and irrelevant words in generation, and enjoys efﬁcient decoding at the same time. Experimental results on both automatic metrics and human annotations show that DVS2S can signiﬁcantly outperform state-of-the-art methods in terms of response quality, but only requires 60% decoding time compared to the most efﬁcient baseline.

PDF Details

NeurIPS Conference 2016 Conference Paper

Large Margin Discriminant Dimensionality Reduction in Prediction Space

Mohammad Saberian
Jose Costa Pereira
Can Xu
Jian Yang
Nuno Nvasconcelos

In this paper we establish a duality between boosting and SVM, and use this to derive a novel discriminant dimensionality reduction algorithm. In particular, using the multiclass formulation of boosting and SVM we note that both use a combination of mapping and linear classification to maximize the multiclass margin. In SVM this is implemented using a pre-defined mapping (induced by the kernel) and optimizing the linear classifiers. In boosting the linear classifiers are pre-defined and the mapping (predictor) is learned through combination of weak learners. We argue that the intermediate mapping, e. g. boosting predictor, is preserving the discriminant aspects of the data and by controlling the dimension of this mapping it is possible to achieve discriminant low dimensional representations for the data. We use the aforementioned duality and propose a new method, Large Margin Discriminant Dimensionality Reduction (LADDER) that jointly learns the mapping and the linear classifiers in an efficient manner. This leads to a data-driven mapping which can embed data into any number of dimensions. Experimental results show that this embedding can significantly improve performance on tasks such as hashing and image/scene classification.

PDF Details