Arrow Research search

Author name cluster

Daxin Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

35 papers
2 author rows

Possible papers

35

AAAI Conference 2026 Conference Paper

GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

  • Ziyi Ni
  • Huacan Wang
  • Shuo Zhang
  • Shuo Lu
  • Ziyang He
  • WangYou
  • Zhenheng Tang
  • Sen Hu

Beyond scratch coding, exploiting large-scale code repositories (e.g., GitHub) for practical tasks is vital in real-world software development, yet current benchmarks rarely evaluate code agents in such authentic, workflow-driven scenarios. To bridge this gap, we introduce GitTaskBench, a benchmark designed to systematically assess this capability via 54 realistic tasks across 7 modalities and 7 domains. Each task pairs a relevant repository with an automated, human-curated evaluation harness specifying practical success criteria. Beyond measuring execution and task success, we also propose the alpha-value metric to quantify the economic benefit of agent performance, which integrates task success rates, token cost, and average developer salaries. Experiments across three state-of-the-art agent frameworks with multiple advanced LLMs show that leveraging code repositories for complex task solving remains challenging: even the best-performing system, OpenHands+Claude 3.7, solves only 48.15% of tasks. Error analysis attributes over half of failures to seemingly mundane yet critical steps like environment setup and dependency resolution, highlighting the need for more robust workflow management and increased timeout preparedness. By releasing GitTaskBench, we aim to drive progress and attention toward repository-aware code reasoning, execution, and deployment---moving agents closer to solving complex, end-to-end real-world tasks.

AAAI Conference 2026 Conference Paper

Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation

  • Yeqin Zhang
  • Yizheng Zhao
  • Chen Hu
  • Binxing Jiao
  • Daxin Jiang
  • Ruihang Miao
  • Cam-Tu Nguyen

Text representation plays a critical role in tasks like clustering, retrieval, and other downstream applications. With the emergence of large language models (LLMs), there is increasing interest in harnessing their capabilities for this purpose. However, most of the LLMs are inherently causal and optimized for next-token prediction, making them suboptimal for producing holistic representations. To address this, recent studies introduced pretext tasks to adapt LLMs for text representation. Most of these tasks, however, rely on token-level prediction objectives, such as the masked next-token prediction (MNTP) used in LLM2Vec. In this work, we explore the untapped potential of context compression as a pretext task for unsupervised adaptation of LLMs. During compression pre-training, the model learns to generate compact memory tokens, which substitute the whole context for downstream sequence prediction. Experiments demonstrate that a well-designed compression objective can significantly enhance LLM-based text representations, outperforming models trained with token-level pretext tasks. Further improvements through contrastive learning produce a strong representation model (LLM2Comp) that outperforms contemporary LLM-based text encoders on a wide range of tasks while being more sample-efficient, requiring significantly less training data.

TMLR Journal 2025 Journal Article

A Survey on Future Frame Synthesis: Bridging Deterministic and Generative Approaches

  • Ruibo Ming
  • Zhewei Huang
  • Jingwei Wu
  • Zhuoxuan Ju
  • Daxin Jiang
  • Jianming Hu
  • Lihui Peng
  • Shuchang Zhou

Future Frame Synthesis (FFS), the task of generating subsequent video frames from context, represents a core challenge in machine intelligence and a cornerstone for developing predictive world models. This survey provides a comprehensive analysis of the FFS landscape, charting its critical evolution from deterministic algorithms focused on pixel-level accuracy to modern generative paradigms that prioritize semantic coherence and dynamic plausibility. We introduce a novel taxonomy organized by algorithmic stochasticity, which not only categorizes existing methods but also reveals the fundamental drivers—advances in architectures, datasets, and computational scale—behind this paradigm shift. Critically, our analysis identifies a bifurcation in the field's trajectory: one path toward efficient, real-time prediction, and another toward large-scale, generative world simulation. By pinpointing key challenges and proposing concrete research questions for both frontiers, this survey serves as an essential guide for researchers aiming to advance the frontiers of visual dynamic modeling.

NeurIPS Conference 2025 Conference Paper

GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

  • Haolong Yan
  • Yeqing Shen
  • Xin Huang
  • Jia Wang
  • Kaijun Tan
  • Zhixuan Liang
  • Hongxin Li
  • Zheng Ge

With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.

NeurIPS Conference 2025 Conference Paper

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

  • Yana Wei
  • Liang Zhao
  • Jianjian Sun
  • Kangheng Lin
  • jisheng yin
  • Jingcheng Hu
  • Yinmin Zhang
  • En Yu

The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2. 5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1, 000 steps—surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95. 3% on MATH500, 51. 8% on MathVision and 54. 6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.

NeurIPS Conference 2025 Conference Paper

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

  • Jingcheng Hu
  • Yinmin Zhang
  • Qi Han
  • Daxin Jiang
  • Xiangyu Zhang
  • Heung-Yeung Shum

We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length, replicating the scaling phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance across AIME2024, MATH500, and GPQA Diamond, while demonstrating remarkable efficiency—requiring only 1/10 of the training steps compared to the DeepSeek-R1-Zero pipeline. We validate that this recipe generalizes well across diverse training domains and different model families without algorithmic modifications. Moreover, our analysis not only covers training dynamics and ablation for critical design choices, but also quantitatively show how the learned critic in Reasoner-Zero training effectively identifies and devalues repetitive response patterns, yielding more robust advantage estimations and enhancing training stability. Embracing the principles of open-source, we release our source code, parameter settings, training data, and model weights across various sizes, fostering reproducibility and encouraging further exploration of the properties of related models.

NeurIPS Conference 2025 Conference Paper

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

  • En Yu
  • Kangheng Lin
  • Liang Zhao
  • jisheng yin
  • Yana Wei
  • Yuang Peng
  • Haoran Wei
  • Jianjian Sun

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in the context of visual perception. In this work, we return to the fundamentals and explore the effects of RL on different perception tasks. We observe that the perceptual perplexity is a major factor in determining the effectiveness of RL. We also observe that reward design plays a crucial role in further approaching the upper limit of model perception. To leverage these findings, we propose Perception-R1, a scalable RL framework using GRPO during MLLM post-training. With a standard Qwen2-VL-2B-Instruct, Perception-R1 achieves +4. 2% on RefCOCO+, +17. 9% on PixMo-Count, +4. 2% on PageOCR, and notably, 31. 9% AP on COCO2017 val for the first time, establishing a strong baseline for perception policy learning.

NeurIPS Conference 2025 Conference Paper

Predictable Scale (Part II) --- Farseer: A Refined Scaling Law in LLMs

  • Houyi Li
  • Wenzhen Zheng
  • Qiufeng Wang
  • Zhenyu Ding
  • Haoying Wang
  • Zili Wang
  • Shijie Xuyang
  • Ning Ding

Training Large Language Models (LLMs) is prohibitively expensive, creating a critical scaling gap where insights from small-scale experiments often fail to transfer to resource-intensive production systems, thereby hindering efficient innovation. To bridge this, we introduce Farseer, a novel and refined scaling law offering enhanced predictive accuracy across scales. By systematically constructing a model loss surface $L(N, D)$, Farseer achieves a significantly better fit to empirical data than prior laws (e. g. , \Chinchilla's law). Our methodology yields accurate, robust, and highly generalizable predictions, demonstrating excellent extrapolation capabilities, outperforming Chinchilla's law, whose extrapolation error is 433\% higher. This allows for the reliable evaluation of competing training strategies across all $(N, D)$ settings, enabling conclusions from small-scale ablation studies to be confidently extrapolated to predict large-scale performance. Furthermore, Farseer provides new insights into optimal compute allocation, better reflecting the nuanced demands of modern LLM training. To validate our approach, we trained an extensive suite of approximately 1, 000 LLMs across diverse scales and configurations, consuming roughly 3 million NVIDIA H100 GPU hours. To foster further research, we are comprehensively open-sourcing all code, data, results (https: //github. com/Farseer-Scaling-Law/Farseer), all training logs (https: //wandb. ai/billzid/Farseer? nw=nwuserbillzid), all models used in scaling law fitting (https: //huggingface. co/Farseer-Scaling-Law).

AAAI Conference 2024 Conference Paper

Fine-Grained Distillation for Long Document Retrieval

  • Yucheng Zhou
  • Tao Shen
  • Xiubo Geng
  • Chongyang Tao
  • Jianbing Shen
  • Guodong Long
  • Can Xu
  • Daxin Jiang

Long document retrieval aims to fetch query-relevant documents from a large-scale collection, where knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder. However, in contrast to passages or sentences, retrieval on long documents suffers from the \textit{scope hypothesis} that a long document may cover multiple topics. This maximizes their structure heterogeneity and poses a granular-mismatch issue, leading to an inferior distillation efficacy. In this work, we propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers. While preserving the conventional dense retrieval paradigm, it first produces global-consistent representations crossing different fine granularity and then applies multi-granular aligned distillation merely during training. In experiments, we evaluate our framework on two long-document retrieval benchmarks, which show state-of-the-art performance.

ICLR Conference 2024 Conference Paper

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

  • Ziyang Luo
  • Can Xu
  • Pu Zhao 0004
  • Qingfeng Sun
  • Xiubo Geng
  • Wenxiang Hu
  • Chongyang Tao
  • Jing Ma 0004

Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated remarkable performance in various code-related tasks. However, different from their counterparts in the general language modeling field, the technique of instruction fine-tuning remains relatively under-researched in this domain. In this paper, we present Code Evol-Instruct, a novel approach that adapts the Evol-Instruct method to the realm of code, enhancing Code LLMs to create novel models, WizardCoder. Through comprehensive experiments on five prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, DS-1000, and MultiPL-E, our models showcase outstanding performance. They consistently outperform all other open-source Code LLMs by a significant margin. Remarkably, WizardCoder 15B even surpasses the well-known closed-source LLMs, including Anthropic's Claude and Google's Bard, on the HumanEval and HumanEval+ benchmarks. Additionally, WizardCoder 34B not only achieves a HumanEval score comparable to GPT3.5 (ChatGPT) but also surpasses it on the HumanEval+ benchmark. Furthermore, our preliminary exploration highlights the pivotal role of instruction complexity in achieving exceptional coding performance.

ICLR Conference 2024 Conference Paper

WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions

  • Can Xu
  • Qingfeng Sun
  • Kai Zheng 0021
  • Xiubo Geng
  • Pu Zhao 0004
  • Jiazhan Feng
  • Chongyang Tao
  • Qingwei Lin

Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Both automatic and human evaluations consistently indicate that WizardLM outperforms baselines such as Alpaca (trained from Self-Instruct) and Vicuna (trained from human-created instructions). The experimental results demonstrate that the quality of instruction-following dataset crafted by Evol-Instruct can significantly improve the performance of LLMs.

AAAI Conference 2023 Conference Paper

A Graph Fusion Approach for Cross-Lingual Machine Reading Comprehension

  • Zenan Xu
  • Linjun Shou
  • Jian Pei
  • Ming Gong
  • Qinliang Su
  • Xiaojun Quan
  • Daxin Jiang

Although great progress has been made for Machine Reading Comprehension (MRC) in English, scaling out to a large number of languages remains a huge challenge due to the lack of large amounts of annotated training data in non-English languages. To address this challenge, some recent efforts of cross-lingual MRC employ machine translation to transfer knowledge from English to other languages, through either explicit alignment or implicit attention. For effective knowledge transition, it is beneficial to leverage both semantic and syntactic information. However, the existing methods fail to explicitly incorporate syntax information in model learning. Consequently, the models are not robust to errors in alignment and noises in attention. In this work, we propose a novel approach, which jointly models the cross-lingual alignment information and the mono-lingual syntax information using a graph. We develop a series of algorithms, including graph construction, learning, and pre-training. The experiments on two benchmark datasets for cross-lingual MRC show that our approach outperforms all strong baselines, which verifies the effectiveness of syntax information for cross-lingual MRC.

ICLR Conference 2023 Conference Paper

HypeR: Multitask Hyper-Prompted Training Enables Large-Scale Retrieval Generalization

  • Zefeng Cai
  • Chongyang Tao
  • Tao Shen 0001
  • Can Xu
  • Xiubo Geng
  • Xin Alex Lin
  • Liang He 0001
  • Daxin Jiang

Recently, large-scale text retrieval has made impressive progress, facilitating both information retrieval and downstream knowledge-intensive tasks (e.g., open-domain QA and dialogue). With a moderate amount of data, a neural text retriever can outperform traditional methods such as BM25 by a large step. However, while being applied to out-of-domain data, the performance of a neural retriever degrades considerably. Therefore, how to enable a retriever to perform more robustly across different domains or tasks and even show strong zero-shot transfer ability is critical for building scalable IR systems. To this end, we propose HypeR, a hyper-prompted training mechanism to enable uniform retrieval across tasks of different domains. Specifically, our approach jointly trains the query encoder with a shared prompt-based parameter pool and a prompt synthesizer that dynamically composes hyper-prompt for encoding each query from different tasks or domains. Besides, to avoid the mode collapse of prompt attention distribution for different queries, we design a contrastive prompt regularization that promotes the mode of prompt attention to be aligned and uniform. Through multi-task hyper-prompted training, our retriever can master the ability to dynamically represent different types of queries and transfer knowledge across different domains and tasks. Extensive experiments show our model attains better retrieval performance across different tasks and better zero-shot transfer ability compared with various previous methods.

ECAI Conference 2023 Conference Paper

Investigating the Learning Behaviour of In-Context Learning: A Comparison with Supervised Learning

  • Xindi Wang 0001
  • Yufei Wang 0003
  • Can Xu
  • Xiubo Geng
  • Bowen Zhang
  • Chongyang Tao
  • Frank Rudzicz
  • Robert E. Mercer

Large language models (LLMs) have shown remarkable capacity for in-context learning (ICL), where learning a new task from just a few training examples is done without being explicitly pre-trained. However, despite the success of LLMs, there has been little understanding of how ICL learns the knowledge from the given prompts. In this paper, to make progress toward understanding the learning behaviour of ICL, we train the same LLMs with the same demonstration examples via ICL and supervised learning (SL), respectively, and investigate their performance under label perturbations (i. e. , noisy labels and label imbalance) on a range of classification tasks. First, via extensive experiments, we find that gold labels have significant impacts on the downstream in-context performance, especially for large language models; however, imbalanced labels matter little to ICL across all model sizes. Second, when comparing with SL, we show empirically that ICL is less sensitive to label perturbations than SL, and ICL gradually attains comparable performance to SL as the model size increases.

ICLR Conference 2023 Conference Paper

KnowDA: All-in-One Knowledge Mixture Model for Data Augmentation in Low-Resource NLP

  • Yufei Wang 0003
  • Jiayi Zheng
  • Can Xu
  • Xiubo Geng
  • Tao Shen 0001
  • Chongyang Tao
  • Daxin Jiang

This paper focuses on data augmentation for low-resource NLP tasks where the training set is limited. The existing solutions either leverage task-independent heuristic rules (e.g., Synonym Replacement) or fine-tune general-purpose pre-trained language models (e.g., GPT2) using the limited training instances to produce new synthetic data. Consequently, they have trivial task-specific knowledge and are limited to yielding low-quality synthetic data. To combat this issue, we propose Knowledge Mixture Data Augmentation Model (KnowDA), a Seq2Seq language model pretrained on a mixture of diverse NLP tasks under a novel framework of Knowledge Mixture Training (KoMT). The goal of KoMT is to condense diverse NLP task-specific knowledge into the single KnowDA model (i.e., all-in-one). The resulting KnowDA could utilize these knowledge to quickly grasp the inherent synthesis law of the target task through limited training instances. Specifically, KoMT reformulates input examples from various heterogeneous NLP tasks into a unified text-to-text format and employs denoising training objectives in different granularity to learn to reconstruct partial or complete samples. To the best of our knowledge, we are the first to attempt to apply 100+ NLP multi-task training for data augmentation. Extensive experiments show that i) the synthetic data produced by KnowDA successfully improves the performance of the strong pre-trained language models (i.e., Bert, ALBert and Deberta) by a large margin on the low-resource NLP benchmark FewGLUE, CoNLL’03 and WikiAnn; ii) KnowDA successful transfer the task knowledge to NLP tasks whose types are seen and unseen in KoMT.

ICLR Conference 2023 Conference Paper

LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

  • Tao Shen 0001
  • Xiubo Geng
  • Chongyang Tao
  • Can Xu
  • Xiaolong Huang
  • Binxing Jiao
  • Linjun Yang
  • Daxin Jiang

In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained language models, a crucial gap remains between language modeling and lexicon-weighting retrieval -- the former preferring certain or low-entropy words whereas the latter favoring pivot or high-entropy words -- becoming the main barrier to lexicon-weighting performance for large-scale retrieval. To bridge this gap, we propose a brand-new pre-training framework, lexicon-bottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations. Essentially, we present a lexicon-bottlenecked module between a normal language modeling encoder and a weakened decoder, where a continuous bag-of-words bottleneck is constructed to learn a lexicon-importance distribution in an unsupervised fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting retrieval via fine-tuning. On the ad-hoc retrieval benchmark, MS-Marco, it achieves 42.6% MRR@10 with 45.8 QPS for the passage dataset and 44.4% MRR@100 with 134.8 QPS for the document dataset, by a CPU machine. And LexMAE shows state-of-the-art zero-shot transfer capability on BEIR benchmark with 12 datasets.

ICLR Conference 2023 Conference Paper

Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval

  • Shunyu Zhang
  • Yaobo Liang
  • Ming Gong 0001
  • Daxin Jiang
  • Nan Duan 0001

Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives. Comprehensive experiments on four cross-lingual retrieval tasks show MSM significantly outperforms existing advanced pre-training models, demonstrating the effectiveness and stronger cross-lingual retrieval capabilities of our approach.

AAAI Conference 2023 Conference Paper

WIERT: Web Information Extraction via Render Tree

  • Zimeng Li
  • Bo Shao
  • Linjun Shou
  • Ming Gong
  • Gen Li
  • Daxin Jiang

Web information extraction (WIE) is a fundamental problem in web document understanding, with a significant impact on various applications. Visual information plays a crucial role in WIE tasks as the nodes containing relevant information are often visually distinct, such as being in a larger font size or having a brighter color, from the other nodes. However, rendering visual information of a web page can be computationally expensive. Previous works have mainly focused on the Document Object Model (DOM) tree, which lacks visual information. To efficiently exploit visual information, we propose leveraging the render tree, which combines the DOM tree and Cascading Style Sheets Object Model (CSSOM) tree, and contains not only content and layout information but also rich visual information at a little additional acquisition cost compared to the DOM tree. In this paper, we present WIERT, a method that effectively utilizes the render tree of a web page based on a pretrained language model. We evaluate WIERT on the Klarna product page dataset, a manually labeled dataset of renderable e-commerce web pages, demonstrating its effectiveness and robustness.

AAAI Conference 2022 System Paper

FORCE: A Framework of Rule-Based Conversational Recommender System

  • Jun Quan
  • Ze Wei
  • Qiang Gan
  • Jingqi Yao
  • Jingyi Lu
  • Yuchen Dong
  • Yiming Liu
  • Yi Zeng

The conversational recommender systems (CRSs) have received extensive attention in recent years. However, most of the existing works focus on various deep learning models, which are largely limited by the requirement of large-scale human-annotated datasets. Such methods are not able to deal with the cold-start scenarios in industrial products. To alleviate the problem, we propose FORCE, a Framework Of Rulebased Conversational rEcommender system that helps developers to quickly build CRS bots by simple configuration. We conduct experiments on two datasets in different languages and domains to verify its effectiveness and usability.

IJCAI Conference 2022 Conference Paper

Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval

  • Ning Wu
  • Yaobo Liang
  • Houxing Ren
  • Linjun Shou
  • Nan Duan
  • Ming Gong
  • Daxin Jiang

Recent research demonstrates the effectiveness of using pretrained language models (PLM) to improve dense retrieval and multilingual dense retrieval. In this work, we present a simple but effective monolingual pretraining task called contrastive context prediction (CCP) to learn sentence representation by modeling sentence level contextual relation. By pushing the embedding of sentences in a local context closer and pushing random negative samples away, different languages could form isomorphic structure, then sentence pairs in two different languages will be automatically aligned. Our experiments show that model collapse and information leakage are very easy to happen during contrastive training of language model, but language-specific memory bank and asymmetric batch normalization operation play an essential role in preventing collapsing and information leakage, respectively. Besides, a post-processing for sentence embedding is also very effective to achieve better retrieval performance. On the multilingual sentence retrieval task Tatoeba, our model achieves new SOTA results among methods without using bilingual data. Our model also shows larger gain on Tatoeba when transferring between non-English pairs. On two multi-lingual query-passage retrieval tasks, XOR Retrieve and Mr. TYDI, our model even achieves two SOTA results in both zero-shot and supervised setting among all pretraining models using bilingual data.

IJCAI Conference 2021 Conference Paper

A Survey on Response Selection for Retrieval-based Dialogues

  • Chongyang Tao
  • Jiazhan Feng
  • Rui Yan
  • Wei Wu
  • Daxin Jiang

Building an intelligent dialogue system capable of naturally and coherently conversing with humans has been a long-standing goal of artificial intelligence. In the past decade, with the development of machine/deep learning technology and the explosive growth of available conversation data in social media, numerous neural models have been developed for context-response matching tasks in retrieval-based dialogue systems, with more fluent and informative responses compared with generative models. This paper presents a comprehensive survey of recent advances in response selection for retrieval-based dialogues. In particular, we first formulate the problem of response selection and review state-of-the-art context-response matching models categorized by their architecture. Then we summarize some recent advances on the research of response selection, including incorporation with extra knowledge and exploration on more effective model learning. Finally, we highlight the challenges which are not yet well addressed in this task and present future research directions.

NeurIPS Conference 2021 Conference Paper

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

  • Shuai Lu
  • Daya Guo
  • Shuo Ren
  • Junjie Huang
  • Alexey Svyatkovskiy
  • Ambrosio Blanco
  • Colin Clement
  • Dawn Drain

Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform. The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems.

ICLR Conference 2021 Conference Paper

Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning

  • Ruozi Huang
  • Huang Hu
  • Wei Wu 0014
  • Kei Sawada
  • Mi Zhang
  • Daxin Jiang

Dancing to music is one of human's innate abilities since ancient times. In machine learning research, however, synthesizing dance movements from music is a challenging problem. Recently, researchers synthesize human motion sequences through autoregressive models like recurrent neural network (RNN). Such an approach often generates short sequences due to an accumulation of prediction errors that are fed back into the neural network. This problem becomes even more severe in the long motion sequence generation. Besides, the consistency between dance and music in terms of style, rhythm and beat is yet to be taken into account during modeling. In this paper, we formalize the music-driven dance generation as a sequence-to-sequence learning problem and devise a novel seq2seq architecture to efficiently process long sequences of music features and capture the fine-grained correspondence between music and dance. Furthermore, we propose a novel curriculum learning strategy to alleviate error accumulation of autoregressive models in long motion sequence generation, which gently changes the training process from a fully guided teacher-forcing scheme using the previous ground-truth movements, towards a less guided autoregressive scheme mostly using the generated movements instead. Extensive experiments show that our approach significantly outperforms the existing state-of-the-arts on automatic metrics and human evaluation. We also make a demo video to demonstrate the superior performance of our proposed approach at https://www.youtube.com/watch?v=lmE20MEheZ8.

ICLR Conference 2021 Conference Paper

GraphCodeBERT: Pre-training Code Representations with Data Flow

  • Daya Guo
  • Shuo Ren 0002
  • Shuai Lu
  • Zhangyin Feng
  • Duyu Tang
  • Shujie Liu 0001
  • Long Zhou
  • Nan Duan 0001

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.

AAAI Conference 2021 System Paper

Integrating Pre-trained Model into Rule-based Dialogue Management

  • Jun Quan
  • Meng Yang
  • Qiang Gan
  • Deyi Xiong
  • Yiming Liu
  • Yuchen Dong
  • Fangxin Ouyang
  • Jun Tian

Rule-based dialogue management is still the most popular solution for industrial task-oriented dialogue systems for their interpretablility. However, it is hard for developers to maintain the dialogue logic when the scenarios get more and more complex. On the other hand, data-driven dialogue systems, usually with end-to-end structures, are popular in academic research and easier to deal with complex conversations, but such methods require plenty of training data and the behaviors are less interpretable. In this paper, we propose a method to leverages the strength of both rule-based and data-driven dialogue managers (DM). We firstly introduce the DM of Carina Dialog System (CDS, an advanced industrial dialogue system built by Microsoft). Then we propose the “modeltrigger” design to make the DM trainable thus scalable to scenario changes. Furthermore, we integrate pre-trained models and empower the DM with few-shot capability. The experimental results demonstrate the effectiveness and strong fewshot capability of our method.

AAAI Conference 2021 Conference Paper

Learning an Effective Context-Response Matching Model with Self-Supervised Tasks for Retrieval-based Dialogues

  • Ruijian Xu
  • Chongyang Tao
  • Daxin Jiang
  • Xueliang Zhao
  • Dongyan Zhao
  • Rui Yan

Building an intelligent dialogue system with the ability to select a proper response according to a multi-turn context is a great challenging task. Existing studies focus on building a context-response matching model with various neural architectures or pretrained language models (PLMs) and typically learning with a single response prediction task. These approaches overlook many potential training signals contained in dialogue data, which might be beneficial for context understanding and produce better features for response prediction. Besides, the response retrieved from existing dialogue systems supervised by the conventional way still faces some critical challenges, including incoherence and inconsistency. To address these issues, in this paper, we propose learning a context-response matching model with auxiliary selfsupervised tasks designed for the dialogue data based on pretrained language models. Specifically, we introduce four selfsupervised tasks including next session prediction, utterance restoration, incoherence detection and consistency discrimination, and jointly train the PLM-based response selection model with these auxiliary tasks in a multi-task manner. By this means, the auxiliary tasks can guide the learning of the matching model to achieve a better local optimum and select a more proper response. Experiment results on two benchmarks indicate that the proposed auxiliary self-supervised tasks bring significant improvement for multi-turn response selection in retrieval-based dialogues, and our model achieves new state-of-the-art results on both datasets.

NeurIPS Conference 2021 Conference Paper

Neural Rule-Execution Tracking Machine For Transformer-Based Text Generation

  • Yufei Wang
  • Can Xu
  • Huang Hu
  • Chongyang Tao
  • Stephen Wan
  • Mark Dras
  • Mark Johnson
  • Daxin Jiang

Sequence-to-Sequence (Seq2Seq) neural text generation models, especially the pre-trained ones (e. g. , BART and T5), have exhibited compelling performance on various natural language generation tasks. However, the black-box nature of these models limits their application in tasks where specific rules (e. g. , controllable constraints, prior knowledge) need to be executed. Previous works either design specific model structures (e. g. , Copy Mechanism corresponding to the rule "the generated output should include certain words in the source input'') or implement specialized inference algorithms (e. g. , Constrained Beam Search) to execute particular rules through the text generation. These methods require the careful design case-by-case and are difficult to support multiple rules concurrently. In this paper, we propose a novel module named Neural Rule-Execution Tracking Machine (NRETM) that can be equipped into various transformer-based generators to leverage multiple rules simultaneously to guide the neural generation model for superior generation performance in an unified and scalable way. Extensive experiments on several benchmarks verify the effectiveness of our proposed model in both controllable and general text generation tasks.

AAAI Conference 2021 Conference Paper

Reinforced Multi-Teacher Selection for Knowledge Distillation

  • Fei Yuan
  • Linjun Shou
  • Jian Pei
  • Wutao Lin
  • Ming Gong
  • Yan Fu
  • Daxin Jiang

In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage remain the bottleneck of applying pre-trained deep models in production. As a popular method for model compression, knowledge distillation transfers knowledge from one or multiple large (teacher) models to a small (student) model. When multiple teacher models are available in distillation, the state-of-the-art methods assign a fixed weight to a teacher model in the whole distillation. Furthermore, most of the existing methods allocate an equal weight to every teacher model. In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled. We systematically develop a reinforced method to dynamically assign weights to teacher models for different training instances and optimize the performance of student model. Our extensive experimental results on several NLP tasks clearly verify the feasibility and effectiveness of our approach.

IJCAI Conference 2020 Conference Paper

Effective Search of Logical Forms for Weakly Supervised Knowledge-Based Question Answering

  • Tao Shen
  • Xiubo Geng
  • Guodong Long
  • Jing Jiang
  • Chengqi Zhang
  • Daxin Jiang

Many algorithms for Knowledge-Based Question Answering (KBQA) depend on semantic parsing, which translates a question to its logical form. When only weak supervision is provided, it is usually necessary to search valid logical forms for model training. However, a complex question typically involves a huge search space, which creates two main problems: 1) the solutions limited by computation time and memory usually reduce the success rate of the search, and 2) spurious logical forms in the search results degrade the quality of training data. These two problems lead to a poorly-trained semantic parsing model. In this work, we propose an effective search method for weakly supervised KBQA based on operator prediction for questions. With search space constrained by predicted operators, sufficient search paths can be explored, more valid logical forms can be derived, and operators possibly causing spurious logical forms can be avoided. As a result, a larger proportion of questions in a weakly supervised training set are equipped with logical forms, and fewer spurious logical forms are generated. Such high-quality training data directly contributes to a better semantic parsing model. Experimental results on one of the largest KBQA datasets (i. e. , CSQA) verify the effectiveness of our approach and deliver a new state-of-the-art performance.

AAAI Conference 2020 Conference Paper

Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering

  • Shangwen Lv
  • Daya Guo
  • Jingjing Xu
  • Duyu Tang
  • Nan Duan
  • Ming Gong
  • Linjun Shou
  • Daxin Jiang

Commonsense question answering aims to answer questions which require background knowledge that is not explicitly expressed in the question. The key challenge is how to obtain evidence from external knowledge and make predictions based on the evidence. Recent studies either learn to generate evidence from human-annotated evidence which is expensive to collect, or extract evidence from either structured or unstructured knowledge bases which fails to take advantages of both sources simultaneously. In this work, we propose to automatically extract evidence from heterogeneous knowledge sources, and answer questions based on the extracted evidence. Specifically, we extract evidence from both structured knowledge base (i. e. ConceptNet) and Wikipedia plain texts. We construct graphs for both sources to obtain the relational structures of evidence. Based on these graphs, we propose a graph-based approach consisting of a graph-based contextual word representation learning module and a graph-based inference module. The first module utilizes graph structural information to re-define the distance between words for learning better contextual word representations. The second module adopts graph convolutional network to encode neighbor information into the representations of nodes, and aggregates evidence with graph attention mechanism for predicting the final answer. Experimental results on CommonsenseQA dataset illustrate that our graph-based approach over both knowledge sources brings improvement over strong baselines. Our approach achieves the state-of-the-art accuracy (75. 3%) on the CommonsenseQA dataset.

AAAI Conference 2020 Conference Paper

Neural Semantic Parsing in Low-Resource Settings with Back-Translation and Meta-Learning

  • Yibo Sun
  • Duyu Tang
  • Nan Duan
  • Yeyun Gong
  • Xiaocheng Feng
  • Bing Qin
  • Daxin Jiang

Neural semantic parsing has achieved impressive results in recent years, yet its success relies on the availability of large amounts of supervised data. Our goal is to learn a neural semantic parser when only prior knowledge about a limited number of simple rules is available, without access to either annotated programs or execution results. Our approach is initialized by rules, and improved in a back-translation paradigm using generated question-program pairs from the semantic parser and the question generator. A phrase table with frequent mapping patterns is automatically derived, also updated as training progresses, to measure the quality of generated instances. We train the model with model-agnostic meta-learning to guarantee the accuracy and stability on examples covered by rules, and meanwhile acquire the versatility to generalize well on examples uncovered by rules. Results on three benchmark datasets with different domains and programs show that our approach incrementally improves the accuracy. On WikiSQL, our best model is comparable to the state-of-the-art system learned from denotations.

AAAI Conference 2020 Conference Paper

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

  • Gen Li
  • Nan Duan
  • Yuejian Fang
  • Ming Gong
  • Daxin Jiang

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pretrained models, such as XLM (Lample and Conneau 2019) and Unicoder (Huang et al. 2019), both visual and linguistic contents are fed into a multi-layer Transformer (Vaswani et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling(MLM), Masked Object Classification(MOC) and Visual-linguistic Matching(VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the crossmodal pre-training.

AAAI Conference 2018 Conference Paper

Assertion-Based QA With Question-Aware Open Information Extraction

  • Zhao Yan
  • Duyu Tang
  • Nan Duan
  • Shujie Liu
  • Wendi Wang
  • Daxin Jiang
  • Ming Zhou
  • Zhoujun Li

We present assertion based question answering (ABQA), an open domain question answering task that takes a question and a passage as inputs, and outputs a semi-structured assertion consisting of a subject, a predicate and a list of arguments. An assertion conveys more evidences than a short answer span in reading comprehension, and it is more concise than a tedious passage in passage-based QA. These advantages make ABQA more suitable for human-computer interaction scenarios such as voice-controlled speakers. Further progress towards improving ABQA requires richer supervised dataset and powerful models of text understanding. To remedy this, we introduce a new dataset called WebAssertions, which includes hand-annotated QA labels for 358, 427 assertions in 55, 960 web passages. To address ABQA, we develop both generative and extractive approaches. The backbone of our generative approach is sequence to sequence learning. In order to capture the structure of the output assertion, we introduce a hierarchical decoder that first generates the structure of the assertion and then generates the words of each field. The extractive approach is based on learning to rank. Features at different levels of granularity are designed to measure the semantic relevance between a question and an assertion. Experimental results show that our approaches have the ability to infer question-aware assertions from a passage. We further evaluate our approaches by incorporating the ABQA results as additional features in passage-based QA. Results on two datasets show that ABQA features significantly improve the accuracy on passage-based QA.

TIST Journal 2013 Journal Article

Mining search and browse logs for web search

  • Daxin Jiang
  • Jian Pei
  • Hang Li

Huge amounts of search log data have been accumulated at Web search engines. Currently, a popular Web search engine may receive billions of queries and collect terabytes of records about user search behavior daily. Beside search log data, huge amounts of browse log data have also been collected through client-side browser plugins. Such massive amounts of search and browse log data provide great opportunities for mining the wisdom of crowds and improving Web search. At the same time, designing effective and efficient methods to clean, process, and model log data also presents great challenges. In this survey, we focus on mining search and browse log data for Web search. We start with an introduction to search and browse log data and an overview of frequently-used data summarizations in log mining. We then elaborate how log mining applications enhance the five major components of a search engine, namely, query understanding, document understanding, document ranking, user understanding, and monitoring and feedback. For each aspect, we survey the major tasks, fundamental principles, and state-of-the-art methods.

TIST Journal 2011 Journal Article

Mining Concept Sequences from Large-Scale Search Logs for Context-Aware Query Suggestion

  • Zhen Liao
  • Daxin Jiang
  • Enhong Chen
  • Jian Pei
  • Huanhuan Cao
  • Hang Li

Query suggestion plays an important role in improving usability of search engines. Although some recently proposed methods provide query suggestions by mining query patterns from search logs, none of them models the immediately preceding queries as context systematically, and uses context information effectively in query suggestions. Context-aware query suggestion is challenging in both modeling context and scaling up query suggestion using context. In this article, we propose a novel context-aware query suggestion approach. To tackle the challenges, our approach consists of two stages. In the first, offline model-learning stage, to address data sparseness, queries are summarized into concepts by clustering a click-through bipartite. A concept sequence suffix tree is then constructed from session data as a context-aware query suggestion model. In the second, online query suggestion stage, a user’s search context is captured by mapping the query sequence submitted by the user to a sequence of concepts. By looking up the context in the concept sequence suffix tree, we suggest to the user context-aware queries. We test our approach on large-scale search logs of a commercial search engine containing 4.0 billion Web queries, 5.9 billion clicks, and 1.87 billion search sessions. The experimental results clearly show that our approach outperforms three baseline methods in both coverage and quality of suggestions.