Arrow Research search

Author name cluster

Fei Huang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

57 papers
2 author rows

Possible papers

57

JBHI Journal 2026 Journal Article

3D-CNN Enhanced Multiscale Progressive Vision Transformer for AD Diagnosis

  • Fei Huang
  • Nanguang Chen
  • Anqi Qiu

Vision Transformer (ViT) applied to structural magnetic resonance images has demonstrated success in the diagnosis of Alzheimer’s disease (AD) and mild cognitive impairment (MCI). However, three key challenges have yet to be well addressed: 1) ViT requires a large labeled dataset to mitigate overfitting while most of the current AD-related sMRI data fall short in the sample sizes. 2) ViT neglects the within-patch feature learning, e. g. , local brain atrophy, which is crucial for AD diagnosis. 3) While ViT can enhance capturing local features by reducing the patch size and increasing the number of patches, the computational complexity of ViT quadratically increases with the number of patches with unbearable overhead. To this end, this paper proposes a 3D-convolutional neural network (CNN) Enhanced Multiscale Progressive ViT (3D-CNN-MPVT). First, a 3D-CNN is pre-trained on sMRI data to extract detailed local image features and alleviate overfitting. Second, an MPVT module is proposed with an inner CNN module to explicitly characterize the within-patch interactions that are conducive to AD diagnosis. Third, a stitch operation is proposed to merge cross-patch features and progressively reduce the number of patches. The inner CNN alongside the stitch operation in the MPTV module enhances local feature characterization while mitigating computational costs. Evaluations using the Alzheimer’s Disease Neuroimaging Initiative dataset with 6610 scans and the Open Access Series of Imaging Studies-3 with 1866 scans demonstrated its superior performance. With minimal preprocessing, our approach achieved an impressive 90% accuracy and 80% in AD classification and MCI conversion prediction, surpassing recent baselines.

AAAI Conference 2026 Conference Paper

Efficient and Effective In-context Demonstration Selection with Coreset

  • Zihua Wang
  • Jiarui Wang
  • Haiyang Xu
  • Ming Yan
  • Fei Huang
  • Xu Yang
  • Xiu-Shen Wei
  • Siya Mi

In-context learning (ICL) has emerged as a powerful paradigm for Large Visual Language Models (LVLMs), enabling them to leverage a few examples directly from input contexts. However, the effectiveness of this approach is heavily reliant on the selection of demonstrations, a process that is NP-hard. Traditional strategies, including random, similarity-based sampling and infoscore-based sampling, often lead to inefficiencies or suboptimal performance, struggling to balance both efficiency and effectiveness in demonstration selection. In this paper, we propose a novel demonstration selection framework named Coreset-based Dual Retrieval (CoDR). We show that samples within a diverse subset achieve a higher expected mutual information. To implement this, we introduce a cluster-pruning method to construct a diverse coreset that aligns more effectively with the query while maintaining diversity. Additionally, we develop a dual retrieval mechanism that enhances the selection process by achieving global demonstration selection while preserving efficiency. Experimental results demonstrate that our method significantly improves the ICL performance compared to the existing strategies, providing a robust solution for effective and efficient demonstration selection.

AAAI Conference 2026 Conference Paper

Selective Weak-to-Strong Generalization

  • Hao Lang
  • Fei Huang
  • Yongbin Li

Future superhuman models will surpass the ability of humans and humans will only be able to \textit{weakly} supervise superhuman models. To alleviate the issue of lacking high-quality data for model alignment, some works on weak-to-strong generalization (W2SG) finetune a strong pretrained model with a weak supervisor so that it can generalize beyond weak supervision. However, the invariable use of weak supervision in existing methods exposes issues in robustness, with a proportion of weak labels proving harmful to models. In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary. We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment. We further refine weak labels with a graph smoothing method. Extensive experiments on three benchmarks show that our method consistently outperforms competitive baselines. Further analyses show that P(IK) can generalize across tasks and difficulties, which indicates selective W2SG can help superalignment.

NeurIPS Conference 2025 Conference Paper

CARE: Decoding-Time Safety Alignment via Rollback and Introspection Intervention

  • Xiaomeng Hu
  • Fei Huang
  • Chenhan Yuan
  • Junyang Lin
  • Tsung-Yi Ho

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring the safety of their outputs during decoding has become a critical challenge. However, existing decoding-time interventions, such as Contrastive Decoding, often force a severe trade-off between safety and response quality. In this work, we propose CARE, a novel framework for decoding-time safety alignment that integrates three key components: (1) a guard model for real-time safety monitoring, enabling detection of potentially unsafe content; (2) a rollback mechanism with a token buffer to correct unsafe outputs efficiently at an earlier stage without disrupting the user experience; and (3) a novel introspection-based intervention strategy, where the model generates self-reflective critiques of its previous outputs and incorporates these reflections into the context to guide subsequent decoding steps. The framework achieves a superior safety-quality trade-off by using its guard model for precise interventions, its rollback mechanism for timely corrections, and our novel introspection method for effective self-correction. Experimental results demonstrate that our framework achieves a superior balance of safety, quality, and efficiency, attaining a low harmful response rate and minimal disruption to the user experience while maintaining high response quality.

AAAI Conference 2025 Conference Paper

Debate Helps Weak-to-Strong Generalization

  • Hao Lang
  • Fei Huang
  • Yongbin Li

Common methods for aligning already-capable models with desired behavior rely on the ability of humans to provide supervision. However, future superhuman models will surpass the capability of humans. Therefore, humans will only be able to weakly supervise superhuman models. This expected deficiency of human evaluation would weaken the safety of future AI systems. Scalable oversight and weak-to-strong generalization are two complementary approaches to tackle this issue. In this paper, we attempt to combine the strengths of these two approaches to further improve alignment. Specifically, we investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. To make iterative empirical progress, we consider an analogy: can we use a strong model to improve weak model supervision and then use it to supervise the strong model? We empirically test it by finetuning a small weak model on ground truth labels with the additional help from a large strong model, and then finetuning the strong model on labels generated by the weak model. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model, which provides leverage as context on samples when training a weak model. We also show that an ensemble of weak models helps exploit long arguments generated by strong model debaters and obtain a more robust supervision estimate. Extensive experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment, which indicates that debate has the potential to help weak-to-strong generalization.

NeurIPS Conference 2025 Conference Paper

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

  • Zihan Qiu
  • Zekun Wang
  • Bo Zheng
  • Zeyu Huang
  • Kaiyue Wen
  • Songlin Yang
  • Rui Men
  • Le Yu

Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1. 7B dense models trained on a 3. 5 trillion token dataset. Our central finding is that a simple modification—applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)—consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates massive activation, attention sink and enhances long-context extrapolation performance. We also release related codes (https: //github. com/qiuzh20/gated attention}) and models (https: //huggingface. co/QwQZh/gated attention) to facilitate future research. Furthermore, the most effective SDPA output gating is used in the Qwen3-Next models (https: //huggingface. co/collections/Qwen/qwen3-next).

NeurIPS Conference 2025 Conference Paper

Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation

  • Yuyang Wanyan
  • Xi Zhang
  • Haiyang Xu
  • Haowei Liu
  • Junyang Wang
  • Jiabo Ye
  • Yutong Kou
  • Ming Yan

In recent years, Multimodal Large Language Models (MLLMs) have been extensively utilized for multimodal reasoning tasks, including Graphical User Interface (GUI) automation. Unlike general offline multimodal tasks, GUI automation is executed in online interactive environments, necessitating step-by-step decision-making based on the real-time status of the environment. This task has a lower tolerance for decision-making errors at each step, as any mistakes may cumulatively disrupt the process and potentially lead to irreversible outcomes like deletions or payments. To address these issues, we introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution, by reasoning about the potential outcome and correctness of actions. Specifically, we propose a Suggestion-aware Group Relative Policy Optimization (S-GRPO) strategy to construct our pre-operative critic model GUI-Critic-R1, incorporating a novel suggestion reward to enhance the reliability of the model's feedback. Furthermore, we develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test, filling existing gaps in GUI critic data. Static experiments on the GUI-Critic-Test across both mobile and web domains reveal that our GUI-Critic-R1 offers significant advantages in critic accuracy compared to current MLLMs. Dynamic evaluation on GUI automation benchmark further highlights the effectiveness and superiority of our model, as evidenced by improved success rates and operational efficiency. The code is available at https: //github. com/X-PLUG/MobileAgent/tree/main/GUI-Critic-R1.

NeurIPS Conference 2025 Conference Paper

OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-time Emotional Speech Synthesis

  • Run Luo
  • Ting-En Lin
  • Haonan Zhang
  • Yuchuan Wu
  • Xiong Liu
  • Yongbin Li
  • Longze Chen
  • Jiaming Li

Recent advancements in omnimodal learning have significantly improved understanding and generation across images, text, and speech, yet these developments remain predominantly confined to proprietary models. The lack of high-quality omnimodal datasets and the challenges of real-time emotional speech synthesis have notably hindered progress in open-source research. To address these limitations, we introduce OpenOmni, a two-stage training framework that integrates omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pretrained speech model undergoes further training on image-text tasks, enabling (near) zero-shot generalization from vision to speech, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder is trained on speech tasks with direct preference optimization, which enables real-time emotional speech synthesis with high fidelity. Extensive experiments demonstrate that OpenOmni surpasses state-of-the-art models across omnimodal, vision-language, and speech-language benchmarks. It achieves a 4-point absolute improvement on OmniBench over the leading open-source model VITA, despite using 5$\times$ fewer training examples and a smaller model size (7B vs. 7$\times$8B). Besides, OpenOmni achieves real-time speech generation with less than 1 second latency at non-autoregressive mode, reducing inference time by 5$\times$ compared to autoregressive methods, and improves emotion classification accuracy by 7. 7\%. The codebase is available at https: //github. com/RainBowLuoCS/OpenOmni.

NeurIPS Conference 2025 Conference Paper

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

  • Yiming Wang
  • Pei Zhang
  • Jialong Tang
  • Hao-Ran Wei
  • Baosong Yang
  • Rui Wang
  • Chenshu Sun
  • Feitong Sun

In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced LLMs and find that even Qwen-3-235B-A22B-Thinking and Gemini-2. 5-pro, achieve only 54. 6 and 52. 2 benchmark scores, with about 40% accuracy under the highest level. From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning performance varies widely across languages for current LLMs; (2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance; (3) The thinking length differs significantly by language for current LLMs. Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.

NeurIPS Conference 2025 Conference Paper

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

  • Yiming Wang
  • Pei Zhang
  • Siyuan Huang
  • Baosong Yang
  • Zhuosheng Zhang
  • Fei Huang
  • Rui Wang

Test-time scaling enhances large language model performance by allocating additional compute resources during decoding. Best-of-$N$ (BoN) sampling serves as a common sampling-based scaling technique, broadening the search space in parallel to find better solutions from the model distribution. However, its cost–performance trade-off is still underexplored. Two main challenges limit the efficiency of BoN sampling: (1) Generating $N$ full samples consumes substantial GPU memory, reducing inference capacity under limited resources. (2) Reward models add extra memory and latency overhead, and training strong reward models introduces potential training data costs. Although some studies have explored efficiency improvements, none have addressed both challenges at once. To address this gap, we propose **Self-Truncation Best-of-$N$ (ST-BoN)**, a decoding method that avoids fully generating all $N$ samples and eliminates the need for reward models. It leverages early sampling consistency in the model’s internal states to identify the most promising path and truncate suboptimal ones. In terms of cost, ST-BoN reduces dynamic GPU memory usage by over 80% and inference latency by 50%. In terms of cost–performance trade-off, ST-BoN achieves the same performance as Full-BoN while saving computational cost by 70%–80%, and under the same cost, it can improve accuracy by 3–4 points.

NeurIPS Conference 2025 Conference Paper

VLM-R³: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought

  • Chaoya Jiang
  • Yongrui Heng
  • Wei Ye
  • Haiyang Xu
  • Ming Yan
  • Ji Zhang
  • Fei Huang
  • Shikun Zhang

Recently, reasoning-based MLLMs have achieved a degree of success in generating long-form textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on and revisiting of visual regions to achieve precise grounding of textual reasoning in visual evidence. We introduce VLM-R³ (Visual Language Model with Region Recognition, Reasoning, and Refinement ), a framework that equips an MLLM with the ability to (i) decide when additional visual evidence is needed, (ii) determine where to ground within the image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved chain-of-thought. The core of our method is \textbf{Region-Conditioned Reinforcement Policy Optimization (R-GRPO)}, a training paradigm that rewards the model for selecting informative regions, formulating appropriate transformations (e. g. crop, zoom), and integrating the resulting visual context into subsequent reasoning steps. To bootstrap this policy, we compile a modest but carefully curated Visuo-Lingual Interleaved Rationale (VLIR) corpus that provides step-level supervision on region selection and textual justification. Extensive experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$^3$ sets a new state of the art in zero-shot and few-shot settings, with the largest gains appearing on questions demanding subtle spatial reasoning or fine-grained visual cue extraction.

NeurIPS Conference 2025 Conference Paper

VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

  • Qiuchen Wang
  • Ruixue Ding
  • Yu Zeng
  • Zehui Chen
  • Lin Chen
  • Shihang Wang
  • Pengjun Xie
  • Fei Huang

Effectively retrieving, reasoning and understanding visually rich information remains a challenge for traditional Retrieval-Augmented Generation (RAG) methods. On the one hand, traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As reinforcement learning (RL) has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. Extensive experiments on diverse and challenging benchmarks show that our VRAG-RL outperforms existing methods by 20\% (Qwen2. 5-VL-7B) and 30\% (Qwen2. 5-VL-3B), demonstrating the effectiveness of our approach. The code is available at https: //github. com/Alibaba-NLP/VRAG.

NeurIPS Conference 2025 Conference Paper

WebDancer: Towards Autonomous Information Seeking Agency

  • Jialong Wu
  • Baixuan Li
  • Runnan Fang
  • Wenbiao Yin
  • Liwen Zhang
  • Zhenglin Wang
  • Zhengwei Tao
  • Ding-Chu Zhang

Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi-step research. In this work, we present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct format, WebDancer. Empirical evaluations on the challenging GAIA and WebWalkerQA benchmarks demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models.

NeurIPS Conference 2025 Conference Paper

WritingBench: A Comprehensive Benchmark for Generative Writing

  • Yuning Wu
  • Jiahao Mei
  • Ming Yan
  • Chenliang Li
  • Shaopeng Lai
  • Yuran Ren
  • Zijia Wang
  • Ji Zhang

Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text generation or limited in writing tasks, failing to capture the diverse requirements of high-quality written contents across various domains. To bridge this gap, we present WritingBench, a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains. We further propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length. The framework's validity is further demonstrated by its data curation capability, which enables a 7B-parameter model to outperform the performance of GPT-4o in writing. We open-source the benchmark, along with evaluation tools and modular framework components, to advance the development of LLMs in writing.

TMLR Journal 2024 Journal Article

A Survey on Out-of-Distribution Detection in NLP

  • Hao Lang
  • Yinhe Zheng
  • Yixuan Li
  • Jian Sun
  • Fei Huang
  • Yongbin Li

Out-of-distribution (OOD) detection is essential for the reliable and safe deployment of machine learning systems in the real world. Great progress has been made over the past years. This paper presents the first review of recent advances in OOD detection with a particular focus on natural language processing approaches. First, we provide a formal definition of OOD detection and discuss several related fields. We then categorize recent algorithms into three classes according to the data they used: (1) OOD data available, (2) OOD data unavailable + in-distribution (ID) label available, and (3) OOD data unavailable + ID label unavailable. Third, we introduce datasets, applications, and metrics. Finally, we summarize existing work and present potential future research topics.

NeurIPS Conference 2024 Conference Paper

Agent Planning with World Knowledge Model

  • Shuofei Qiao
  • Runnan Fang
  • Ningyu Zhang
  • Yuqi Zhu
  • Xiang Chen
  • Shumin Deng
  • Yong Jiang
  • Pengjun Xie

Recent endeavors towards directly using large language models (LLMs) as agent models to execute interactive planning tasks have shown commendable results. Despite their achievements, however, they still struggle with brainless trial-and-error in global planning and generating hallucinatory actions in local planning due to their poor understanding of the "real" physical world. Imitating humans' mental world knowledge model which provides global prior knowledge before the task and maintains local dynamic knowledge during the task, in this paper, we introduce parametric World Knowledge Model (WKM) to facilitate agent planning. Concretely, we steer the agent model to self-synthesize knowledge from both expert and sampled trajectories. Then we develop WKM, providing prior task knowledge to guide the global planning and dynamic state knowledge to assist the local planning. Experimental results on three real-world simulated datasets with Mistral-7B, Gemma-7B, and Llama-3-8B demonstrate that our method can achieve superior performance compared to various strong baselines. Besides, we analyze to illustrate that our WKM can effectively alleviate the blind trial-and-error and hallucinatory action issues, providing strong support for the agent's understanding of the world. Other interesting findings include: 1) our instance-level task knowledge can generalize better to unseen tasks, 2) weak WKM can guide strong agent model planning, and 3) unified WKM training has promising potential for further development.

AAAI Conference 2024 Conference Paper

EcomGPT: Instruction-Tuning Large Language Models with Chain-of-Task Tasks for E-commerce

  • Yangning Li
  • Shirong Ma
  • Xiaobin Wang
  • Shen Huang
  • Chengyue Jiang
  • Hai-Tao Zheng
  • Pengjun Xie
  • Fei Huang

Recently, instruction-following Large Language Models (LLMs), represented by ChatGPT, have exhibited exceptional performance in general Natural Language Processing (NLP) tasks. However, the unique characteristics of E-commerce data pose significant challenges to general LLMs. An LLM tailored specifically for E-commerce scenarios, possessing robust cross-dataset/task generalization capabilities, is a pressing necessity. To solve this issue, in this work, we proposed the first E-commerce instruction dataset EcomInstruct, with a total of 2.5 million instruction data. EcomInstruct scales up the data size and task diversity by constructing atomic tasks with E-commerce basic data types, such as product information, user reviews. Atomic tasks are defined as intermediate tasks implicitly involved in solving a final task, which we also call Chain-of-Task tasks. We developed EcomGPT with different parameter scales by training the backbone model BLOOMZ with the EcomInstruct. Benefiting from the fundamental semantic understanding capabilities acquired from the Chain-of-Task tasks, EcomGPT exhibits excellent zero-shot generalization capabilities. Extensive experiments and human evaluations demonstrate that EcomGPT outperforms ChatGPT in term of cross-dataset/task generalization on E-commerce tasks. The EcomGPT will be public at https://github.com/Alibaba-NLP/EcomGPT.

JBHI Journal 2024 Journal Article

Ensemble Vision Transformer for Dementia Diagnosis

  • Fei Huang
  • Anqi Qiu

In recent years, deep learning has gained momentum in computer-aided Alzheimer's Disease (AD) diagnosis. This study introduces a novel approach, Monte Carlo Ensemble Vision Transformer (MC-ViT), which develops an ensemble approach with Vision transformer (ViT). Instead of using traditional ensemble methods that deploy multiple learners, our approach employs a single vision transformer learner. By harnessing Monte Carlo sampling, this method produces a broad spectrum of classification decisions, enhancing the MC-ViT performance. This novel technique adeptly overcomes the limitation of 3D patch convolutional neural networks that only characterize partial of the whole brain anatomy, paving the way for a neural network adept at discerning 3D inter-feature correlations. Evaluations using the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset with 7199 scans and Open Access Series of Imaging Studies-3 (OASIS-3) with 1992 scans showcased its performance. With minimal preprocessing, our approach achieved an impressive 90% accuracy in AD classification, surpassing both 2D-slice CNNs and 3D CNNs.

NeurIPS Conference 2024 Conference Paper

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

  • Jia Li
  • Ge Li
  • Xuanming Zhang
  • YunFei Zhao
  • Yihong Dong
  • Zhi Jin
  • Binhua Li
  • Fei Huang

How to evaluate Large Language Models (LLMs) in code generation remains an open question. Many benchmarks have been proposed, but they have two limitations, i. e. , data leakage and lack of domain-specific evaluation. The former hurts the fairness of benchmarks, and the latter hinders practitioners from selecting superior LLMs for specific programming domains. To address these two limitations, we propose a new benchmark - EvoCodeBench, which has the following advances: (1) Evolving data. EvoCodeBench will be dynamically updated every period (e. g. , 6 months) to avoid data leakage. This paper releases the first version - EvoCodeBench-2403, containing 275 samples from 25 repositories. (2) A domain taxonomy and domain labels. Based on the statistics of open-source communities, we design a programming domain taxonomy consisting of 10 popular domains. Based on the taxonomy, we annotate each sample in EvoCodeBench with a domain label. EvoCodeBench provides a broad platform for domain-specific evaluations. (3) Domain-specific evaluations. Besides the Pass@k, we compute the Domain-Specific Improvement (DSI) and define LLMs' comfort and strange domains. These evaluations help practitioners select superior LLMs in specific domains and discover the shortcomings of existing LLMs. Besides, EvoCodeBench is collected by a rigorous pipeline and aligns with real-world repositories in multiple aspects (e. g. , code distributions). We evaluate 8 popular LLMs (e. g. , gpt-4, DeepSeek Coder, StarCoder 2) on EvoCodeBench and summarize some insights. EvoCodeBench reveals the actual abilities of these LLMs in real-world repositories. For example, the highest Pass@1 of gpt-4 on EvoCodeBench-2403 is only 20. 74%. Besides, we evaluate LLMs in different domains and discover their comfort and strange domains. For example, gpt-4 performs best in most domains but falls behind others in the Internet domain. StarCoder 2-15B unexpectedly performs well in the Database domain and even outperforms 33B LLMs. We release EvoCodeBench, all prompts, and LLMs' completions for further community analysis.

IJCAI Conference 2024 Conference Paper

FactCHD: Benchmarking Fact-Conflicting Hallucination Detection

  • Xiang Chen
  • Duanzheng Song
  • Honghao Gui
  • Chenxi Wang
  • Ningyu Zhang
  • Yong Jiang
  • Fei Huang
  • Chengfei Lyu

Despite their impressive generative capabilities, LLMs are hindered by fact-conflicting hallucinations in real-world applications. The accurate identification of hallucinations in texts generated by LLMs, especially in complex inferential scenarios, is a relatively unexplored area. To address this gap, we present FactCHD, a dedicated benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. A distinctive element of FactCHD is its integration of fact-based evidence chains, significantly enhancing the depth of evaluating the detectors' explanations. Experiments on different LLMs expose the shortcomings of current approaches in detecting factual errors accurately. Furthermore, we introduce TRUTH-TRIANGULATOR which synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2, aiming to yield more credible detection through the amalgamation of predictive results and evidence.

NeurIPS Conference 2024 Conference Paper

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

  • Chaoya Jiang
  • Hongrui Jia
  • Haiyang Xu
  • Wei Ye
  • Mengfan Dong
  • Ming Yan
  • Ji Zhang
  • Fei Huang

This paper presents MaVEn, an innovative Multi-granularity Visual Encoding framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. Current MLLMs primarily focus on single-image visual understanding, limiting their ability to interpret and integrate information across multiple images. MaVEn addresses this limitation by combining discrete visual symbol sequences, which abstract coarse-grained semantic concepts, with traditional continuous representation sequences that model fine-grained features. This dual approach bridges the semantic gap between visual and textual data, thereby improving the model's ability to process and interpret information from multiple images effectively. Additionally, we design a dynamic reduction mechanism by for long-sequence continuous features to enhance multi-image processing efficiency. Experimental results demonstrate that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.

NeurIPS Conference 2024 Conference Paper

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

  • Junyang Wang
  • Haiyang Xu
  • Haitao Jia
  • Xi Zhang
  • Ming Yan
  • Weizhou Shen
  • Ji Zhang
  • Fei Huang

Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks — task progress navigation and focus content navigation — are difficult to effectively solve under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent condenses lengthy, interleaved image-text history operations and screens summaries into a pure-text task progress, which is then passed on to the decision agent. This reduction in context length makes it easier for decision agent to navigate the task progress. To retain focus content, we design a memory unit that updates with task progress by decision agent. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistake accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https: //github. com/X-PLUG/MobileAgent.

AAAI Conference 2024 Conference Paper

Preference Ranking Optimization for Human Alignment

  • Feifan Song
  • Bowen Yu
  • Minghao Li
  • Haiyang Yu
  • Fei Huang
  • Yongbin Li
  • Houfeng Wang

Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. However, it encompasses two main drawbacks: (1) RLHF exhibits complexity, instability, and sensitivity to hyperparameters in contrast to SFT. (2) Despite massive trial-and-error, multiple sampling is reduced to pair-wise contrast, thus lacking contrasts from a macro perspective. In this paper, we propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to directly fine-tune LLMs for human alignment. PRO extends the pair-wise contrast to accommodate preference rankings of any length. By iteratively contrasting candidates, PRO instructs the LLM to prioritize the best response while progressively ranking the rest responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations.

NeurIPS Conference 2024 Conference Paper

Self-Retrieval: End-to-End Information Retrieval with One Large Language Model

  • Qiaoyu Tang
  • Jiawei Chen
  • Zhuoqun Li
  • Bowen Yu
  • Yaojie Lu
  • Cheng Fu
  • Haiyang Yu
  • Hongyu Lin

The rise of large language models (LLMs) has significantly transformed both the construction and application of information retrieval (IR) systems. However, current interactions between IR systems and LLMs remain limited, with LLMs merely serving as part of components within IR systems, and IR systems being constructed independently of LLMs. This separated architecture restricts knowledge sharing and deep collaboration between them. In this paper, we introduce Self-Retrieval, a novel end-to-end LLM-driven information retrieval architecture. Self-Retrieval unifies all essential IR functions within a single LLM, leveraging the inherent capabilities of LLMs throughout the IR process. Specifically, Self-Retrieval internalizes the retrieval corpus through self-supervised learning, transforms the retrieval process into sequential passage generation, and performs relevance assessment for reranking. Experimental results demonstrate that Self-Retrieval not only outperforms existing retrieval approaches by a significant margin, but also substantially enhances the performance of LLM-driven downstream applications like retrieval-augmented generation.

AAAI Conference 2024 Conference Paper

SeqGPT: An Out-of-the-Box Large Language Model for Open Domain Sequence Understanding

  • Tianyu Yu
  • Chengyue Jiang
  • Chao Lou
  • Shen Huang
  • Xiaobin Wang
  • Wei Liu
  • Jiong Cai
  • Yangning Li

Large language models (LLMs) have shown impressive abilities for open-domain NLP tasks. However, LLMs are sometimes too footloose for natural language understanding (NLU) tasks which always have restricted output and input format. Their performances on NLU tasks are highly related to prompts or demonstrations and are shown to be poor at performing several representative NLU tasks, such as event extraction and entity typing. To this end, we present SeqGPT, a bilingual (i.e., English and Chinese) open-source autoregressive model specially enhanced for open-domain natural language understanding. We express all NLU tasks with two atomic tasks, which define fixed instructions to restrict the input and output format but still ``open'' for arbitrarily varied label sets. The model is first instruction-tuned with extremely fine-grained labeled data synthesized by ChatGPT and then further fine-tuned by 233 different atomic tasks from 152 datasets across various domains. The experimental results show that SeqGPT has decent classification and extraction ability, and is capable of performing language understanding tasks on unseen domains. We also conduct empirical studies on the scaling of data and model size as well as on the transfer across tasks. Our models are accessible at https://github.com/Alibaba-NLP/SeqGPT.

NeurIPS Conference 2024 Conference Paper

WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models

  • Peng Wang
  • Zexi Li
  • Ningyu Zhang
  • Ziwen Xu
  • Yunzhi Yao
  • Yong Jiang
  • Pengjun Xie
  • Fei Huang

Large language models (LLMs) need knowledge updates to meet the ever-growing world facts and correct the hallucinated responses, facilitating the methods of lifelong model editing. Where the updated knowledge resides in memories is a fundamental question for model editing. In this paper, we find that editing either long-term memory (direct model parameters) or working memory (non-parametric knowledge of neural network activations/representations by retrieval) will result in an impossible triangle---reliability, generalization, and locality can not be realized together in the lifelong editing settings. For long-term memory, directly editing the parameters will cause conflicts with irrelevant pretrained knowledge or previous edits (poor reliability and locality). For working memory, retrieval-based activations can hardly make the model understand the edits and generalize (poor generalization). Therefore, we propose WISE to bridge the gap between memories. In WISE, we design a dual parametric memory scheme, which consists of the main memory for the pretrained knowledge and a side memory for the edited knowledge. We only edit the knowledge in the side memory and train a router to decide which memory to go through when given a query. For continual editing, we devise a knowledge-sharding mechanism where different sets of edits reside in distinct subspaces of parameters, and are subsequently merged into a shared memory without conflicts. Extensive experiments show that WISE can outperform previous model editing methods and overcome the impossible triangle under lifelong model editing of question answering, hallucination, and out-of-distribution settings across trending LLM architectures, e. g. , GPT, LLaMA, and Mistral.

EAAI Journal 2023 Journal Article

A general motion controller based on deep reinforcement learning for an autonomous underwater vehicle with unknown disturbances

  • Fei Huang
  • Jian Xu
  • Di Wu
  • Yunfei Cui
  • Zheping Yan
  • Wen Xing
  • Xun Zhang

This paper studies the application of deep Reinforcement Learning (RL) in the motion control of an underactuated autonomous underwater vehicle (AUV) with unknown disturbances. Firstly, a general state space, action space and reward function are designed for motion control problems rather than each specific motion control task, which ensures the generality of our method. Furthermore, a virtual AUV model with partial random disturbances is established, and on this basis, a simulation training method is developed to solve the problems of extremely high risk and extremely low efficiency caused by training in actual experiments. Then, in order to directly deploy the optimal control policy obtained through simulation training to an actual AUV, we employ Extended State Observers (ESOs) to estimate the unknown disturbances in five degrees of freedom, and give a deployment method using the estimated values as the disturbance state vector and compensation vector. Combining the above training method and deployment method, a novel general motion controller is proposed. Finally, four different AUV motion control simulations are carried out, and the results confirm the generality and effectiveness of our proposed controller.

AAAI Conference 2023 Conference Paper

Adversarial Self-Attention for Language Understanding

  • Hongqiu Wu
  • Ruixue Ding
  • Hai Zhao
  • Pengjun Xie
  • Fei Huang
  • Min Zhang

Deep neural models (e.g. Transformer) naturally learn spurious features, which create a ``shortcut'' between the labels and inputs, thus impairing the generalization and robustness. This paper advances self-attention mechanism to its robust variant for Transformer-based pre-trained language models (e.g. BERT). We propose Adversarial Self-Attention mechanism (ASA), which adversarially biases the attentions to effectively suppress the model reliance on features (e.g. specific keywords) and encourage its exploration of broader semantics. We conduct comprehensive evaluation across a wide range of tasks for both pre-training and fine-tuning stages. For pre-training, ASA unfolds remarkable performance gain compared to naive training for longer steps. For fine-tuning, ASA-empowered models outweigh naive models by a large margin considering both generalization and robustness.

NeurIPS Conference 2023 Conference Paper

Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs

  • Jinyang Li
  • Binyuan Hui
  • Ge Qu
  • Jiaxi Yang
  • Binhua Li
  • Bowen Li
  • Bailin Wang
  • Bowen Qin

Text-to-SQL parsing, which aims at converting natural language instructions into executable SQLs, has gained increasing attention in recent years. In particular, GPT-4 and Claude-2 have shown impressive results in this task. However, most of the prevalent benchmarks, i. e. , Spider, and WikiSQL, focus on database schema with few rows of database contents leaving the gap between academic study and real-world applications. To mitigate this gap, we present BIRD, a BIg benchmark for laRge-scale Database grounded in text-to-SQL tasks, containing 12, 751 pairs of text-to-SQL data and 95 databases with a total size of 33. 4 GB, spanning 37 professional domains. Our emphasis on database values highlights the new challenges of dirty database contents, external knowledge between NL questions and database contents, and SQL efficiency, particularly in the context of massive databases. To solve these problems, text-to-SQL models must feature database value comprehension in addition to semantic parsing. The experimental results demonstrate the significance of database values in generating accurate text-to-SQLs for big databases. Furthermore, even the most popular and effective text-to-SQL models, i. e. GPT-4, only achieve 54. 89% in execution accuracy, which is still far from the human result of 92. 96%, proving that challenges still stand. We also provide an efficiency analysis to offer insights into generating text-to-efficient-SQLs that are beneficial to industries. We believe that BIRD will contribute to advancing real-world applications of text-to-SQL research. The leaderboard and source code are available: https: //bird-bench. github. io/.

NeurIPS Conference 2023 Conference Paper

Debiased and Denoised Entity Recognition from Distant Supervision

  • Haobo Wang
  • Yiwen Dong
  • Ruixuan Xiao
  • Fei Huang
  • Gang Chen
  • Junbo Zhao

While distant supervision has been extensively explored and exploited in NLP tasks like named entity recognition, a major obstacle stems from the inevitable noisy distant labels tagged unsupervisedly. A few past works approach this problem by adopting a self-training framework with a sample-selection mechanism. In this work, we innovatively identify two types of biases that were omitted by prior work, and these biases lead to inferior performance of the distant-supervised NER setup. First, we characterize the noise concealed in the distant labels as highly structural rather than fully randomized. Second, the self-training framework would ubiquitously introduce an inherent bias that causes erroneous behavior in both sample selection and eventually prediction. To cope with these problems, we propose a novel self-training framework, dubbed DesERT. This framework augments the conventional NER predicative pathway to a dual form that effectively adapts the sample-selection process to conform to its innate distributional-bias structure. The other crucial component of DesERT composes a debiased module aiming to enhance the token representations, hence the quality of the pseudo-labels. Extensive experiments are conducted to validate the DesERT. The results show that our framework establishes a new state-of-art performance, it achieves a +2. 22% average F1 score improvement on five standardized benchmarking datasets. Lastly, DesERT demonstrates its effectiveness under a new DSNER benchmark where additional distant supervision comes from the ChatGPT model.

NeurIPS Conference 2023 Conference Paper

EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual Representation Learning

  • Ping Guo
  • Xiangpeng Wei
  • Yue Hu
  • Baosong Yang
  • Dayiheng Liu
  • Fei Huang
  • Jun Xie

Expressing universal semantics common to all languages is helpful to understand the meanings of complex and culture-specific sentences. The research theme underlying this scenario focuses on learning universal representations across languages with the usage of massive parallel corpora. However, due to the sparsity and scarcity of parallel data, there is still a big challenge in learning authentic ``universals'' for any two languages. In this paper, we propose Emma-X: an EM-like Multilingual pre-training Algorithm, to learn Cross-lingual universals with the aid of excessive multilingual non-parallel data. Emma-X unifies the cross-lingual representation learning task and an extra semantic relation prediction task within an EM framework. Both the extra semantic classifier and the cross-lingual sentence encoder approximate the semantic relation of two sentences, and supervise each other until convergence. To evaluate Emma-X, we conduct experiments on xrete, a newly introduced benchmark containing 12 widely studied cross-lingual tasks that fully depend on sentence-level representations. Results reveal that Emma-X achieves state-of-the-art performance. Further geometric analysis of the built representation space with three requirements demonstrates the superiority of Emma-X over advanced models.

AAAI Conference 2023 Conference Paper

Graphix-T5: Mixing Pre-trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing

  • Jinyang Li
  • Binyuan Hui
  • Reynold Cheng
  • Bowen Qin
  • Chenhao Ma
  • Nan Huo
  • Fei Huang
  • Wenyu Du

The task of text-to-SQL parsing, which aims at converting natural language questions into executable SQL queries, has garnered increasing attention in recent years. One of the major challenges in text-to-SQL parsing is domain generalization, i.e., how to generalize well to unseen databases. Recently, the pre-trained text-to-text transformer model, namely T5, though not specialized for text-to-SQL parsing, has achieved state-of-the-art performance on standard benchmarks targeting domain generalization. In this work, we explore ways to further augment the pre-trained T5 model with specialized components for text-to-SQL parsing. Such components are expected to introduce structural inductive bias into text-to-SQL parsers thus improving the model’s capacity on (potentially multi-hop) reasoning, which is critical for generating structure-rich SQLs. To this end, we propose a new architecture GRAPHIX-T5, a mixed model with the standard pre-trained transformer model augmented by specially-designed graph-aware layers. Extensive experiments and analysis demonstrate the effectiveness of GRAPHIX-T5 across four text-to-SQL benchmarks: SPIDER, SYN, REALISTIC and DK. GRAPHIX-T5 surpasses all other T5-based parsers with a significant margin, achieving new state-of-the-art performance. Notably, GRAPHIX-T5-large reaches performance superior to the original T5-large by 5.7% on exact match (EM) accuracy and 6.6% on execution accuracy (EX). This even outperforms the T5-3B by 1.2% on EM and 1.5% on EX

IJCAI Conference 2023 Conference Paper

One Model for All Domains: Collaborative Domain-Prefix Tuning for Cross-Domain NER

  • Xiang Chen
  • Lei Li
  • Shuofei Qiao
  • Ningyu Zhang
  • Chuanqi Tan
  • Yong Jiang
  • Fei Huang
  • Huajun Chen

Cross-domain NER is a challenging task to address the low-resource problem in practical scenarios. Previous typical solutions mainly obtain a NER model by pre-trained language models (PLMs) with data from a rich-resource domain and adapt it to the target domain. Owing to the mismatch issue among entity types in different domains, previous approaches normally tune all parameters of PLMs, ending up with an entirely new NER model for each domain. Moreover, current models only focus on leveraging knowledge in one general source domain while failing to successfully transfer knowledge from multiple sources to the target. To address these issues, we introduce Collaborative Domain-Prefix Tuning for cross-domain NER (CP-NER) based on text-to-text generative PLMs. Specifically, we present text-to-text generation grounding domain-related instructors to transfer knowledge to new domain NER tasks without structural modifications. We utilize frozen PLMs and conduct collaborative domain-prefix tuning to stimulate the potential of PLMs to handle NER tasks across various domains. Experimental results on the Cross-NER benchmark show that the proposed approach has flexible transfer ability and performs better on both one-source and multiple-source cross-domain NER tasks.

NeurIPS Conference 2023 Conference Paper

RRHF: Rank Responses to Align Language Models with Human Feedback

  • Hongyi Yuan
  • Zheng Yuan
  • Chuanqi Tan
  • Wei Wang
  • Songfang Huang
  • Fei Huang

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and models. InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). However, PPO is sensitive to hyperparameters and requires multiple models in its standard implementation, making it hard to train and scale up to larger parameter counts. In contrast, we propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities and learns to align these probabilities with human preferences through ranking loss. RRHF can leverage sampled responses from various sources including the model responses from itself, other large language model responses, and human expert responses to learn to rank them. RRHF only needs 1 to 2 models during tuning and can efficiently align language models with human preferences robustly without complex hyperparameter tuning. Additionally, RRHF can be considered an extension of SFT and reward model training while being simpler than PPO in terms of coding, model counts, and hyperparameters. We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling. Extensive experiments show that the performance of RRHF is highly related to sampling quality which suggests RRHF is a best-of-$n$ learner.

NeurIPS Conference 2023 Conference Paper

SPA: A Graph Spectral Alignment Perspective for Domain Adaptation

  • Zhiqing Xiao
  • Haobo Wang
  • Ying Jin
  • Lei Feng
  • Gang Chen
  • Fei Huang
  • Junbo Zhao

Unsupervised domain adaptation (UDA) is a pivotal form in machine learning to extend the in-domain model to the distinctive target domains where the data distributions differ. Most prior works focus on capturing the inter-domain transferability but largely overlook rich intra-domain structures, which empirically results in even worse discriminability. In this work, we introduce a novel graph SPectral Alignment (SPA) framework to tackle the tradeoff. The core of our method is briefly condensed as follows: (i)-by casting the DA problem to graph primitives, SPA composes a coarse graph alignment mechanism with a novel spectral regularizer towards aligning the domain graphs in eigenspaces; (ii)-we further develop a fine-grained message propagation module --- upon a novel neighbor-aware self-training mechanism --- in order for enhanced discriminability in the target domain. On standardized benchmarks, the extensive experiments of SPA demonstrate that its performance has surpassed the existing cutting-edge DA methods. Coupled with dense model analysis, we conclude that our approach indeed possesses superior efficacy, robustness, discriminability, and transferability. Code and data are available at: https: //github. com/CrownX/SPA.

NeurIPS Conference 2023 Conference Paper

SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

  • Shuzheng Si
  • Wentao Ma
  • Haoyu Gao
  • Yuchuan Wu
  • Ting-En Lin
  • Yinpei Dai
  • Hangyu Li
  • Rui Yan

Task-oriented dialogue (TOD) models have made significant progress in recent years. However, previous studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and real-world spoken con- versation scenarios. While several small-scale spoken TOD datasets are proposed to address robustness issues such as ASR errors, they ignore the unique challenges in spoken conversation. To tackle the limitations, we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, containing 8 domains, 203k turns, 5. 7k dialogues and 249 hours of audios from human-to-human spoken conversations. SpokenWOZ further incorporates common spoken characteristics such as word-by-word processing and reasoning in spoken language. Based on these characteristics, we present cross-turn slot and reasoning slot detection as new challenges. We conduct experiments on various baselines, including text-modal models, newly proposed dual-modal models, and LLMs, e. g. , ChatGPT. The results show that the current models still have substantial room for improvement in spoken conversation, where the most advanced dialogue state tracker only achieves 25. 65% in joint goal accuracy and the SOTA end-to-end model only correctly completes the user request in 52. 1% of dialogues. Our dataset, code, and leaderboard are available at https: //spokenwoz. github. io/SpokenWOZ-github. io/.

NeurIPS Conference 2022 Conference Paper

Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning

  • Xiang Chen
  • Lei Li
  • Ningyu Zhang
  • Xiaozhuan Liang
  • Shumin Deng
  • Chuanqi Tan
  • Fei Huang
  • Luo Si

Prompt learning approaches have made waves in natural language processing by inducing better few-shot performance while they still follow a parametric-based learning paradigm; the oblivion and rote memorization problems in learning may encounter unstable generalization issues. Specifically, vanilla prompt learning may struggle to utilize atypical instances by rote during fully-supervised training or overfit shallow patterns with low-shot data. To alleviate such limitations, we develop RetroPrompt with the motivation of decoupling knowledge from memorization to help the model strike a balance between generalization and memorization. In contrast with vanilla prompt learning, RetroPrompt constructs an open-book knowledge-store from training instances and implements a retrieval mechanism during the process of input, training and inference, thus equipping the model with the ability to retrieve related contexts from the training corpus as cues for enhancement. Extensive experiments demonstrate that RetroPrompt can obtain better performance in both few-shot and zero-shot settings. Besides, we further illustrate that our proposed RetroPrompt can yield better generalization abilities with new datasets. Detailed analysis of memorization indeed reveals RetroPrompt can reduce the reliance of language models on memorization; thus, improving generalization for downstream tasks. Code is available in https: //github. com/zjunlp/PromptKG/tree/main/research/RetroPrompt.

AAAI Conference 2022 Conference Paper

From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression

  • Runxin Xu
  • Fuli Luo
  • Chengyu Wang
  • Baobao Chang
  • Jun Huang
  • Songfang Huang
  • Fei Huang

Pre-trained Language Models (PLMs) have achieved great success in various Natural Language Processing (NLP) tasks under the pre-training and fine-tuning paradigm. With large quantities of parameters, PLMs are computation-intensive and resource-hungry. Hence, model pruning has been introduced to compress large-scale PLMs. However, most prior approaches only consider task-specific knowledge towards downstream tasks, but ignore the essential task-agnostic knowledge during pruning, which may cause catastrophic forgetting problem and lead to poor generalization ability. To maintain both task-agnostic and task-specific knowledge in our pruned model, we propose ContrAstive Pruning (CAP) under the paradigm of pre-training and fine-tuning. It is designed as a general framework, compatible with both structured and unstructured pruning. Unified in contrastive learning, CAP enables the pruned model to learn from the pretrained model for task-agnostic knowledge, and fine-tuned model for task-specific knowledge. Besides, to better retain the performance of the pruned model, the snapshots (i. e. , the intermediate models at each pruning iteration) also serve as effective supervisions for pruning. Our extensive experiments show that adopting CAP consistently yields significant improvements, especially in extremely high sparsity scenarios. With only 3% model parameters reserved (i. e. , 97% sparsity), CAP successfully achieves 99. 2% and 96. 3% of the original BERT performance in QQP and MNLI tasks. In addition, our probing experiments demonstrate that the model pruned by CAP tends to achieve better generalization ability.

AAAI Conference 2022 Conference Paper

GALAXY: A Generative Pre-trained Model for Task-Oriented Dialog with Semi-supervised Learning and Explicit Policy Injection

  • Wanwei He
  • Yinpei Dai
  • Yinhe Zheng
  • Yuchuan Wu
  • Zheng Cao
  • Dermot Liu
  • Peng Jiang
  • Min Yang

Pre-trained models have proved to be powerful in enhancing task-oriented dialog systems. However, current pre-training methods mainly focus on enhancing dialog understanding and generation tasks while neglecting the exploitation of dialog policy. In this paper, we propose GALAXY, a novel pre-trained dialog model that explicitly learns dialog policy from limited labeled dialogs and large-scale unlabeled dialog corpora via semi-supervised learning. Specifically, we introduce a dialog act prediction task for policy optimization during pre-training and employ a consistency regularization term to refine the learned representation with the help of unlabeled dialogs. We also implement a gating mechanism to weigh suitable unlabeled dialog samples. Empirical results show that GALAXY substantially improves the performance of task-oriented dialog systems, and achieves new state-of-the-art results on benchmark datasets: In-Car, MultiWOZ2. 0 and Multi- WOZ2. 1, improving their end-to-end combined scores by 2. 5, 5. 3 and 5. 5 points, respectively. We also show that GALAXY has a stronger few-shot ability than existing models under various low-resource settings. For reproducibility, we release the code and data at https: //github. com/siat-nlp/GALAXY.

AAAI Conference 2022 Short Paper

Learning to Ask for Data-Efficient Event Argument Extraction (Student Abstract)

  • Hongbin Ye
  • Ningyu Zhang
  • Zhen Bi
  • Shumin Deng
  • Chuanqi Tan
  • Hui Chen
  • Fei Huang
  • Huajun Chen

Event argument extraction (EAE) is an important task for information extraction to discover specific argument roles. In this study, we cast EAE as a question-based cloze task and empirically analyze fixed discrete token template performance. As generating human-annotated question templates is often time-consuming and labor-intensive, we further propose a novel approach called “Learning to Ask, ” which can learn optimized question templates for EAE without human annotations. Experiments using the ACE-2005 dataset demonstrate that our method based on optimized questions achieves state-of-the-art performance in both the few-shot and supervised settings.

IJCAI Conference 2022 Conference Paper

Meta-Learning Based Knowledge Extrapolation for Knowledge Graphs in the Federated Setting

  • Mingyang Chen
  • Wen Zhang
  • Zhen Yao
  • Xiangnan Chen
  • Mengxiao Ding
  • Fei Huang
  • Huajun Chen

We study the knowledge extrapolation problem to embed new components (i. e. , entities and relations) that come with emerging knowledge graphs (KGs) in the federated setting. In this problem, a model trained on an existing KG needs to embed an emerging KG with unseen entities and relations. To solve this problem, we introduce the meta-learning setting, where a set of tasks are sampled on the existing KG to mimic the link prediction task on the emerging KG. Based on sampled tasks, we meta-train a graph neural network framework that can construct features for unseen components based on structural information and output embeddings for them. Experimental results show that our proposed method can effectively embed unseen components and outperforms models that consider inductive settings for KGs and baselines that directly use conventional KG embedding methods.

IJCAI Conference 2021 Conference Paper

Automatically Paraphrasing via Sentence Reconstruction and Round-trip Translation

  • Zilu Guo
  • Zhongqiang Huang
  • Kenny Q. Zhu
  • Guandan Chen
  • Kaibo Zhang
  • Boxing Chen
  • Fei Huang

Paraphrase generation plays key roles in NLP tasks such as question answering, machine translation, and information retrieval. In this paper, we propose a novel framework for paraphrase generation. It simultaneously decodes the output sentence using a pretrained wordset-to-sequence model and a round-trip translation model. We evaluate this framework on Quora, WikiAnswers, MSCOCO and Twitter, and show its advantage over previous state-of-the-art unsupervised methods and distantly-supervised methods by significant margins on all datasets. For Quora and WikiAnswers, our framework even performs better than some strongly supervised methods with domain adaptation. Further, we show that the generated paraphrases can be used to augment the training data for machine translation to achieve substantial improvements.

AAAI Conference 2021 Conference Paper

Bridging the Domain Gap: Improve Informal Language Translation via Counterfactual Domain Adaptation

  • Ke Wang
  • Guandan Chen
  • Zhongqiang Huang
  • Xiaojun Wan
  • Fei Huang

Despite the near-human performances already achieved on formal texts such as news articles, neural machine translation still has difficulty in dealing with ”user-generated” texts that have diverse linguistic phenomena but lack large-scale high-quality parallel corpora. To address this problem, we propose a counterfactual domain adaptation method to better leverage both large-scale source-domain data (formal texts) and small-scale target-domain data (informal texts). Specifically, by considering effective counterfactual conditions (the concatenations of source-domain texts and the target-domain tag), we construct the counterfactual representations to fill the sparse latent space of the target domain caused by a small amount of data, that is, bridging the gap between the sourcedomain data and the target-domain data. Experiments on English-to-Chinese and Chinese-to-English translation tasks show that our method outperforms the base model that is trained only on the informal corpus by a large margin, and consistently surpasses different baseline methods by +1. 12 ∼ 4. 34 BLEU points on different datasets. Furthermore, we also show that our method achieves competitive performances on cross-domain language translation on four language pairs.

AAAI Conference 2021 Conference Paper

Contrastive Triple Extraction with Generative Transformer

  • Hongbin Ye
  • Ningyu Zhang
  • Shumin Deng
  • Mosha Chen
  • Chuanqi Tan
  • Fei Huang
  • Huajun Chen

Triple extraction is an essential task in information extraction for natural language processing and knowledge graph construction. In this paper, we revisit the end-to-end triple extraction task for sequence generation. Since generative triple extraction may struggle to capture long-term dependencies and generate unfaithful triples, we introduce a novel model, contrastive triple extraction with a generative transformer. Specifically, we introduce a single shared transformer module for encoder-decoder-based generation. To generate faithful results, we propose a novel triplet contrastive training object. Moreover, we introduce two mechanisms to further improve model performance (i. e. , batch-wise dynamic attentionmasking and triple-wise calibration). Experimental results on three datasets (i. e. , NYT, WebNLG, and MIE) show that our approach achieves better performance than that of baselines.

IJCAI Conference 2021 Conference Paper

Document-level Relation Extraction as Semantic Segmentation

  • Ningyu Zhang
  • Xiang Chen
  • Xin Xie
  • Shumin Deng
  • Chuanqi Tan
  • Mosha Chen
  • Fei Huang
  • Luo Si

Document-level relation extraction aims to extract relations among multiple entity pairs from a document. Previously proposed graph-based or transformer-based models utilize the entities independently, regardless of global information among relational triples. This paper approaches the problem by predicting an entity-level relation matrix to capture local and global information, parallel to the semantic segmentation task in computer vision. Herein, we propose a Document U-shaped Network for document-level relation extraction. Specifically, we leverage an encoder module to capture the context information of entities and a U-shaped segmentation module over the image-style feature map to capture global interdependency among triples. Experimental results show that our approach can obtain state-of-the-art performance on three benchmark datasets DocRED, CDR, and GDA.

AAAI Conference 2021 Conference Paper

Dynamic Hybrid Relation Exploration Network for Cross-Domain Context-Dependent Semantic Parsing

  • Binyuan Hui
  • Ruiying Geng
  • Qiyu Ren
  • Binhua Li
  • Yongbin Li
  • Jian Sun
  • Fei Huang
  • Luo Si

Semantic parsing has long been a fundamental problem in natural language processing. Recently, cross-domain contextdependent semantic parsing has become a new focus of research. Central to the problem is the challenge of leveraging contextual information of both natural language utterance and database schemas in the interaction history. In this paper, we present a dynamic graph framework that is capable of effectively modelling contextual utterances, tokens, database schemas, and their complicated interaction as the conversation proceeds. The framework employs a dynamic memory decay mechanism that incorporates inductive bias to integrate enriched contextual relation representation, which is further enhanced with a powerful reranking model. At the time of writing, we demonstrate that the proposed framework outperforms all existing models by large margins, achieving new state-of-the-art performance on two large-scale benchmarks, the SParC and CoSQL datasets. Specifically, the model attains a 55. 8% question-match and 30. 8% interaction-match accuracy on SParC, and a 46. 8% question-match and 17. 0% interaction-match accuracy on CoSQL.

AAAI Conference 2021 Conference Paper

Knowledge-aware Named Entity Recognition with Alleviating Heterogeneity

  • Binling Nie
  • Ruixue Ding
  • Pengjun Xie
  • Fei Huang
  • Chen Qian
  • Luo Si

Named Entity Recognition (NER) is a fundamental and important research topic for many downstream NLP tasks, aiming at detecting and classifying named entities (NEs) mentioned in unstructured text into pre-defined categories. Learning from labeled data only is far from enough when it comes to domain-specific or temporally-evolving entities (e. g. medical terminologies or restaurant names). Luckily, open-source Knowledge Bases (KBs) (e. g. Wikidata and Freebase) contain NEs that are manually labeled with predefined types in different domains, which is potentially beneficial to identify entity boundaries and recognize entity types more accurately. However, the type system of a domain-specific NER task is typically independent of that of current KBs and thus exhibits heterogeneity issue inevitably, which makes matching between the original NER and KB types (e. g. Person in NER potentially matches President in KBs) less likely, or introduces unintended noises without considering domainspecific knowledge (e. g. Band in NER should be mapped to Out of Entity Types in the restaurant-related task). To better incorporate and denoise the abundant knowledge in KBs, we propose a new KB-aware NER framework (KaNa), which utilizes type-heterogeneous knowledge to improve NER. Specifically, for an entity mention along with a set of candidate entities that are linked from KBs, KaNa first uses a type projection mechanism that maps the mention type and entity types into a shared space to homogenize the heterogeneous entity types. Then, based on projected types, a noise detector filters out certain less-confident candidate entities in an unsupervised manner. Finally, the filtered mention-entity pairs are injected into a NER model as a graph to predict answers. The experimental results demonstrate KaNa’s state-ofthe-art performance on five public benchmark datasets from different domain.

AAAI Conference 2021 Conference Paper

Nested Named Entity Recognition with Partially-Observed TreeCRFs

  • Yao Fu
  • Chuanqi Tan
  • Mosha Chen
  • Songfang Huang
  • Fei Huang

Named entity recognition (NER) is a well-studied task in natural language processing. However, the widely-used sequence labeling framework is difficult to detect entities with nested structures. In this work, we view nested NER as constituency parsing with partially-observed trees and model it with partially-observed TreeCRFs. Specifically, we view all labeled entity spans as observed nodes in a constituency tree, and other spans as latent nodes. With the TreeCRF we achieve a uniform way to jointly model the observed and the latent nodes. To compute the probability of partial trees with partial marginalization, we propose a variant of the Inside algorithm, the MASKED INSIDE algorithm, that supports different inference operations for different nodes (evaluation for the observed, marginalization for the latent, and rejection for nodes incompatible with the observed) with efficient parallelized implementation, thus significantly speeding up training and inference. Experiments show that our approach achieves the state-of-the-art (SOTA) F1 scores on the ACE2004, ACE2005 dataset, and shows comparable performance to SOTA models on the GENIA dataset. We release the code at https: //github. com/FranxYao/Partially-Observed-TreeCRFs.

AAAI Conference 2021 Conference Paper

Unsupervised Learning of Deterministic Dialogue Structure with Edge-Enhanced Graph Auto-Encoder

  • Yajing Sun
  • Yong Shan
  • Chengguang Tang
  • Yue Hu
  • Yinpei Dai
  • Jing Yu
  • Jian Sun
  • Fei Huang

It is important for task-oriented dialogue systems to discover the dialogue structure (i. e. the general dialogue flow) from dialogue corpora automatically. Previous work models dialogue structure by extracting latent states for each utterance first and then calculating the transition probabilities among states. These two-stage methods ignore the contextual information when calculating the probabilities, which makes the transitions between the states ambiguous. This paper proposes a conversational graph (CG) to represent deterministic dialogue structure where nodes and edges represent the utterance and context information respectively. An unsupervised Edge- Enhanced Graph Auto-Encoder (EGAE) architecture is designed to model local-contextual and global-structural information for conversational graph learning. Furthermore, a selfsupervised objective is introduced with the response selection task to guide the unsupervised learning of the dialogue structure. Experimental results on several public datasets demonstrate that the novel model outperforms several alternatives in aggregating utterances with similar semantics. The effectiveness of the learned dialogue structured is also verified by more than 5% joint accuracy improvement in the downstream task of low resource dialogue state tracking.

AAAI Conference 2020 Conference Paper

Boundary Enhanced Neural Span Classification for Nested Named Entity Recognition

  • Chuanqi Tan
  • Wei Qiu
  • Mosha Chen
  • Rui Wang
  • Fei Huang

Named entity recognition (NER) is a well-studied task in natural language processing. However, the widely-used sequence labeling framework is usually difficult to detect entities with nested structures. The span-based method that can easily detect nested entities in different subsequences is naturally suitable for the nested NER problem. However, previous span-based methods have two main issues. First, classifying all subsequences is computationally expensive and very inefficient at inference. Second, the span-based methods mainly focus on learning span representations but lack of explicit boundary supervision. To tackle the above two issues, we propose a boundary enhanced neural span classification model. In addition to classifying the span, we propose incorporating an additional boundary detection task to predict those words that are boundaries of entities. The two tasks are jointly trained under a multitask learning framework, which enhances the span representation with additional boundary supervision. In addition, the boundary detection model has the ability to generate high-quality candidate spans, which greatly reduces the time complexity during inference. Experiments show that our approach outperforms all existing methods and achieves 85. 3, 83. 9, and 78. 3 scores in terms of F1 on the ACE2004, ACE2005, and GENIA datasets, respectively.

AAAI Conference 2020 Conference Paper

Knowing What, How and Why: A Near Complete Solution for Aspect-Based Sentiment Analysis

  • Haiyun Peng
  • Lu Xu
  • Lidong Bing
  • Fei Huang
  • Wei Lu
  • Luo Si

Target-based sentiment analysis or aspect-based sentiment analysis (ABSA) refers to addressing various sentiment analysis tasks at a fine-grained level, which includes but is not limited to aspect extraction, aspect sentiment classification, and opinion extraction. There exist many solvers of the above individual subtasks or a combination of two subtasks, and they can work together to tell a complete story, i. e. the discussed aspect, the sentiment on it, and the cause of the sentiment. However, no previous ABSA research tried to provide a complete solution in one shot. In this paper, we introduce a new subtask under ABSA, named aspect sentiment triplet extraction (ASTE). Particularly, a solver of this task needs to extract triplets (What, How, Why) from the inputs, which show WHAT the targeted aspects are, HOW their sentiment polarities are and WHY they have such polarities (i. e. opinion reasons). For instance, one triplet from “Waiters are very friendly and the pasta is simply average” could be (‘Waiters’, positive, ‘friendly’). We propose a two-stage framework to address this task. The first stage predicts what, how and why in a unified model, and then the second stage pairs up the predicted what (how) and why from the first stage to output triplets. In the experiments, our framework has set a benchmark performance in this novel triplet extraction task. Meanwhile, it outperforms a few strong baselines adapted from state-of-the-art related methods.

IJCAI Conference 2020 Conference Paper

Learning with Noise: Improving Distantly-Supervised Fine-grained Entity Typing via Automatic Relabeling

  • Haoyu Zhang
  • Dingkun Long
  • Guangwei Xu
  • Muhua Zhu
  • Pengjun Xie
  • Fei Huang
  • Ji Wang

Fine-grained entity typing (FET) is a fundamental task for various entity-leveraging applications. Although great success has been made, existing systems still have challenges in handling noisy samples in training data introduced by distant supervision methods. To address these noise, previous studies either focus on processing the clean samples (i, e. , have only one label) and noisy samples (i, e. , have multiple labels) with different strategies or filtering the noisy labels based on the assumption that the distantly-supervised label set certainly contains the correct type label. In this paper, we propose a probabilistic automatic relabeling method which treats all training samples uniformly. Our method aims to estimate the pseudo-truth label distribution of each sample, and the pseudo-truth distribution will be treated as part of trainable parameters which are jointly updated during the training process. The proposed approach does not rely on any prerequisite or extra supervision, making it effective on real applications. Experiments on several benchmarks show that our method outperforms previous approaches and alleviates the noisy labeling problem.

TCS Journal 2019 Journal Article

Minimum degree condition for proper connection number 2

  • Fei Huang
  • Xueliang Li
  • Zhongmei Qin
  • Colton Magnant

A path in an edge-colored graph is called a proper path if no two adjacent edges of the path receive the same color. For a connected graph G, the proper connection number p c ( G ) of G is defined as the minimum number of colors needed to color its edges, so that every pair of distinct vertices of G is connected by at least one proper path in G. Recently, Li and Magnant in [8] posed the following conjecture: If G is a connected noncomplete graph of order n ≥ 5 and minimum degree δ ( G ) ≥ n / 4, then p c ( G ) = 2. In this paper, we show that this conjecture is true except for two small graphs on 7 and 8 vertices, respectively. As a byproduct we obtain that if G is a connected bipartite graph of order n ≥ 4 with δ ( G ) ≥ n + 6 8, then p c ( G ) = 2.

TCS Journal 2019 Journal Article

Paths and trails in edge-colored weighted graphs

  • Runjie Miao
  • Jinjiang Yuan
  • Fei Huang

Let ( G, c, w ) be an edge-colored weighted graph, where G is a nontrivial connected graph, c is an edge-coloring of G, and w is an edge-weighting of G. A path, a trail, a cycle, or a closed trail of G, say F, is called proper under the edge-coloring c if every two consecutive edges of F receive different colors in c. Let s and t be two specified nonadjacent vertices in G. In this paper, we study the problems for finding, in ( G, c, w ), the minimum weighted proper s-t-path, the minimum weighted proper s-t-trail, the minimum weighted proper cycle, the minimum weighted proper closed trail, the maximum weighted proper s-t-path, and the maximum weighted proper s-t-trail. When the minimization problems are considered we assume that ( G, c, w ) has no negative proper cycle, and when the maximization problems are considered we assume that ( G, c, w ) has no proper closed trail. We show that all these problems are solvable in polynomial time.

ECAI Conference 2016 Conference Paper

A Novel Cross-Modal Topic Correlation Model for Cross-Media Retrieval

  • Yong Cheng
  • Fei Huang
  • Cheng Jin 0001
  • Yuejie Zhang
  • Tao Zhang 0022

A novel cross-modal topic correlation model CMTCM is developed in this paper to facilitate more effective cross-modal analysis and cross-media retrieval for large-scale multimodal document collections. It can be modeled as a cross-modal topic correlation model which explores the inter-related correlation distribution over the deep representations of multimodal documents. It integrates the deep multimodal document representation, relational topic correlation modeling, and cross-modal topic correlation learning, which aims to characterize the correlations between the heterogeneous topic distributions of inter-related visual images and semantic texts, and measure their association degree more precisely. Very positive results were obtained in our experiments using a large quantity of public data.

ECAI Conference 2016 Conference Paper

Enhancing Sketch-Based Image Retrieval via Deep Discriminative Representation

  • Fei Huang
  • Yong Cheng
  • Cheng Jin 0001
  • Yuejie Zhang
  • Tao Zhang 0022

In this paper we aim to employ deep learning to enhance SBIR via deep discriminative representation. Our main contributions focus on: 1) The deep discriminative representation is established to bridge both the visual appearance gap and the semantic gap between sketches and images; 2) The deep learning pattern is applied to our SBIR model through training on our transformed sketch-like images to overcome the rarity of training sketches. Our experiments on a large number of public sketch and image data have obtained very positive results.

AAAI Conference 2005 Conference Paper

Clustering and Classifying Person Names by Origin

  • Fei Huang

In natural language processing, information about a person’s geographical origin is an important feature for name entity transliteration and question answering. We propose a language-independent name origin clustering and classification framework. Provided with a small amount of bilingual name translation pairs with labeled origins, we measure origin similarities based on the perplexities of name character language and translation models. We group similar origins into clusters, then train a Bayesian classifier with different features. It achieves 84% classification accuracy with source names only, and 91% with both source and target name pairs. We apply the origin clustering and classification technique to a name transliteration task. The cluster-specific transliteration model dramatically improves the transliteration accuracy from 3. 8% to 55%, reducing the transliteration character error rate from 50. 3 to 13. 5. Adding more unlabeled name pairs to the cluster-specific name transliteration model further improves the transliteration accuracy.