Arrow Research search

Author name cluster

Benyou Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers
2 author rows

Possible papers

30

ICLR Conference 2025 Conference Paper

Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts

  • Guorui Zheng
  • Xidong Wang
  • Juhao Liang
  • Nuo Chen 0002
  • Yuping Zheng
  • Benyou Wang

Adapting medical Large Language Models to local languages can reduce barriers to accessing healthcare services, but data scarcity remains a significant challenge, particularly for low-resource languages. To address this, we first construct a high-quality medical dataset and conduct analysis to ensure its quality. In order to leverage the generalization capability of multilingual LLMs to efficiently scale to more resource-constrained languages, we explore the internal information flow of LLMs from a multilingual perspective using Mixture of Experts (MoE) modularity. Technically, we propose a novel MoE routing method that employs language-specific experts and cross-lingual routing. Inspired by circuit theory, our routing analysis revealed a \textit{``Spread Out in the End``} information flow mechanism: while earlier layers concentrate cross-lingual information flow, the later layers exhibit language-specific divergence. This insight directly led to the development of the Post-MoE architecture, which applies sparse routing only in the later layers while maintaining dense others. Experimental results demonstrate that this approach enhances the generalization of multilingual models to other languages while preserving interpretability. Finally, to efficiently scale the model to 50 languages, we introduce the concept of \textit{language family} experts, drawing on linguistic priors, which enables scaling the number of languages without adding additional parameters.

ICRA Conference 2025 Conference Paper

NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants

  • Yiran Qin
  • Ao Sun
  • Yuze Hong
  • Benyou Wang
  • Ruimao Zhang

Navigating unfamiliar environments presents significant challenges for household robots, requiring the ability to recognize and reason about novel decoration and layout. Existing reinforcement learning methods cannot be directly transferred to new environments, as they typically rely on extensive mapping and exploration, leading to time-consuming and inefficient. To address these challenges, we try to transfer the logical knowledge and the generalization ability of pretrained foundation models to zero-shot navigation. By integrating a large vision-language model with a diffusion network, our approach named NavigateDiff constructs a visual predictor that continuously predicts the agent's potential observations in the next step which can assist robots generate robust actions. Furthermore, to adapt the temporal property of navigation, we introduce temporal historical information to ensure that the predicted image is aligned with the navigation scene. We then carefully designed an information fusion framework that embeds the predicted future frames as guidance into goalreaching policy to solve downstream image navigation tasks. This approach enhances navigation control and generalization across both simulated and real-world environments. Through extensive experimentation, we demonstrate the robustness and versatility of our method, showcasing its potential to improve the efficiency and effectiveness of robotic navigation in diverse settings. Project Page: https://21styouth.github.io/NavigateDiff/.

ICLR Conference 2025 Conference Paper

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark for Large Language Models

  • Bofei Gao
  • Feifan Song 0001
  • Zhe Yang 0013
  • Zefan Cai
  • Yibo Miao
  • Qingxiu Dong
  • Lei Li 0039
  • Chenghao Ma

Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. These problems are meticulously categorized into over 33 sub-domains and span more than 10 distinct difficulty levels, enabling a holistic assessment of model performance in Olympiad-mathematical reasoning. Furthermore, we conducted an in-depth analysis based on this benchmark. Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.

UAI Conference 2025 Conference Paper

Periodical Moving Average Accelerates Gradient Accumulation for Post-Training

  • Yumou Liu
  • An Li
  • Chaojie Li
  • Fei Yu 0017
  • Benyou Wang

High gradient variance presents a significant obstacle to efficient post-training of large language models (LLMs) on memory-constrained devices. Existing practical strategies-such as reducing batch sizes or adopting gradient accumulation (GA)-suffer from an inherent trade-off: smaller batches exacerbate convergence issues due to increased gradient noise, while GA substantially prolongs training time owing to its sequential processing. In this work, we reveal that the Exponential Moving Average (EMA) in momentum-based optimizers exponentially discounts historical gradients, thereby limiting their effectiveness in stabilizing parameter updates, especially during post-training when parameter drift is minimal. Motivated by this, we propose integrating the core idea of GA directly into momentum updates via a novel Periodical Moving Average (PMA) mechanism, which structures training into fixed periods and replaces EMA with a uniform moving average within each period. We instantiate PMA within AdamW and Lion, resulting in the AdamW-PMA and Lion-PMA optimizers. Theoretical analysis establishes that AdamW-PMA matches the convergence guarantees of standard Adam. Extensive empirical evaluation on supervised fine-tuning and direct preference optimization tasks demonstrates that PMA-based methods achieve approximately $2\times$ faster training compared to GA, while yielding consistently better performance on downstream evaluations.

NeurIPS Conference 2025 Conference Paper

QFFT, Question-Free Fine-Tuning for Adaptive Reasoning

  • Wanlong Liu
  • Junxiao Xu
  • Fei Yu
  • Yukang Lin
  • Ke Ji
  • Wenyu Chen
  • Lifeng Shang
  • Yasheng Wang

Recent advancements in Long Chain-of-Thought (CoT) reasoning models have improved performance on complex tasks, but they suffer from overthinking, which generates redundant reasoning steps, especially for simple questions. This paper revisits the reasoning patterns of Long and Short CoT models, observing that the Short CoT patterns offer concise reasoning efficiently, while the Long CoT patterns excel in challenging scenarios where the Short CoT patterns struggle. To enable models to leverage both patterns, we propose Question-Free Fine-Tuning (QFFT), a fine-tuning approach that removes the input question during training and learns exclusively from Long CoT responses. This approach enables the model to adaptively employ both reasoning patterns: it prioritizes the Short CoT patterns and activates the Long CoT patterns only when necessary. Experiments on various mathematical datasets demonstrate that QFFT reduces average response length by more than 50\%, while achieving performance comparable to Supervised Fine-Tuning (SFT). Additionally, QFFT exhibits superior performance compared to SFT in noisy, out-of-domain, and low-resource scenarios.

TMLR Journal 2025 Journal Article

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

  • Haoran Li
  • Qingxiu Dong
  • Zhengyang Tang
  • Chaojun Wang
  • Xingxing Zhang
  • Haoyang Huang
  • Shaohan Huang
  • Xiaolong Huang

We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction-tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy. While promising, our approach may inherit biases or inaccuracies from LLM-generated data as in other synthetic data work and is primarily evaluated on exam-style benchmarks. Broader evaluations and data quality control are left for future work.

NeurIPS Conference 2025 Conference Paper

Teaching Language Models to Reason with Tools

  • Chengpeng Li
  • Zhengyang Tang
  • Ziniu Li
  • Mingfeng Xue
  • Keqin Bao
  • Tian Ding
  • Ruoyu Sun
  • Benyou Wang

Large reasoning models (LRMs) like OpenAI-o1 have shown impressive capabilities in natural language reasoning. However, these models frequently demonstrate inefficiencies or inaccuracies when tackling complex mathematical operations. While integrating computational tools such as Code Interpreters (CIs) offers a promising solution, it introduces a critical challenge: a conflict between the model's internal, probabilistic reasoning and the external, deterministic knowledge provided by the CI, which often leads models to unproductive deliberation. To overcome this, we introduce CoRT (Code-Optimized Reasoning Training), a post-training framework designed to teach LRMs to effectively utilize CIs. We propose Hint-Engineering, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths. This approach generates high-quality, code-integrated reasoning data specifically tailored to optimize LRM-CI interaction. Using this method, we have synthesized 30 high-quality samples to post-train models ranging from 1. 5B to 32B parameters through supervised fine-tuning. CoRT further refines the multi-round interleaving of external CI usage and internal thinking by employing rejection sampling and reinforcement learning. Our experimental evaluations demonstrate CoRT's effectiveness, yielding absolute improvements of 4\% and 8\% on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1. 5B, respectively, across five challenging mathematical reasoning datasets. Moreover, CoRT significantly enhances efficiency, reducing token usage by approximately 30\% for the 32B model and 50\% for the 1. 5B model compared to pure natural language reasoning baselines. The models and code are available at: this url.

NeurIPS Conference 2025 Conference Paper

The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models

  • Ke Ji
  • Jiahao Xu
  • Tian Liang
  • Qiuzhi Liu
  • Zhiwei He
  • Xiaoyuan Liu
  • Xingyu Chen
  • Junying Chen

Improving the reasoning capabilities of large language models (LLMs) typically requires supervised fine-tuning with labeled data or computationally expensive sampling. We introduce Unsupervised Prefix Fine-Tuning (UPFT), which leverages the observation of Prefix Self-Consistency -- the shared initial reasoning steps across diverse solution trajectories -- to enhance LLM reasoning efficiency. By training exclusively on the initial prefix substrings (as few as 8 tokens), UPFT removes the need for labeled data or exhaustive sampling. Experiments on reasoning benchmarks show that UPFT matches the performance of supervised methods such as Rejection Sampling Fine-Tuning, while reducing training time by 75\% and sampling cost by 99\%. Further analysis reveals that errors tend to appear in later stages of the reasoning process and that prefix-based training preserves the model’s structural knowledge. This work demonstrates how minimal unsupervised fine-tuning can unlock substantial reasoning gains in LLMs, offering a scalable and resource-efficient alternative to conventional approaches.

ICML Conference 2025 Conference Paper

Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis

  • Xu Wang 0033
  • Yan Hu
  • Wenyu Du
  • Reynold Cheng
  • Benyou Wang
  • Difan Zou

Fine-tuning significantly improves the performance of Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. This paper aims to provide an in-depth interpretation of the fine-tuning process through circuit analysis, a popular tool in Mechanistic Interpretability (MI). Unlike previous studies (Prakash et al. 2024, Chhabra et al. 2024) that focus on tasks where pre-trained models already perform well, we develop a set of mathematical tasks where fine-tuning yields substantial performance gains, bringing the setup closer to real-world scenarios. In our experiments, we identify circuits at various checkpoints during fine-tuning and examine the interplay between circuit analysis, fine-tuning methods, and task complexities. First, we find that while circuits maintain high node similarity before and after fine-tuning, their edges undergo significant changes, contrasting with previous work (Prakash et al. 2024, Chhabra et al. 2024) that reported only small circuit additions after fine-tuning. Based on these observations, we develop a circuit-aware Low-Rank Adaptation (LoRA) method that assigns ranks to layers according to edge changes in the circuits. Experimental results demonstrate that our circuit-based LoRA achieves an average improvement of 2. 46% over standard LoRA with comparable parameter sizes. Furthermore, we explore how combining circuits from subtasks can enhance fine-tuning in compositional tasks, offering new insights into task design and deepening our understanding of circuit dynamics and fine-tuning mechanisms.

NeurIPS Conference 2025 Conference Paper

TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets

  • Yuzhe Yang
  • Yifei Zhang
  • Minghao Wu
  • Kaidi Zhang
  • Yunmiao Zhang
  • Honghai Yu
  • Yan Hu
  • Benyou Wang

The study of social emergence has long been a central focus in social science. Traditional modeling approaches, such as rule-based Agent-Based Models (ABMs), struggle to capture the diversity and complexity of human behavior, particularly the irrational factors emphasized in behavioral economics. Recently, large language model (LLM) agents have gained traction as simulation tools for modeling human behavior in social science and role-playing applications. Studies suggest that LLMs can account for cognitive biases, emotional fluctuations, and other non-rational influences, enabling more realistic simulations of socio-economic dynamics. In this work, we introduce TwinMarket, a novel multi-agent framework that leverages LLMs to simulate socio-economic systems. Specifically, we examine how individual behaviors, through interactions and feedback mechanisms, give rise to collective dynamics and emergent phenomena. Through experiments in a simulated stock market environment, we demonstrate how individual actions can trigger group behaviors, leading to emergent outcomes such as financial bubbles and recessions. Our approach provides valuable insights into the complex interplay between individual decision-making and collective socio-economic patterns.

NeurIPS Conference 2025 Conference Paper

Video-R1: Reinforcing Video Reasoning in MLLMs

  • Kaituo Feng
  • Kaixiong Gong
  • Bohao Li
  • Zonghao Guo
  • Yibing Wang
  • Tianshuo Peng
  • Junfei Wu
  • Xiaoying Zhang

Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 37. 1\% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All code, models, and data will be released.

NeurIPS Conference 2024 Conference Paper

Alignment at Pre-training! Towards Native Alignment for Arabic LLMs

  • Juhao Liang
  • Zhenyang Cai
  • Jianqing Zhu
  • Huang Huang
  • Kewei Zong
  • Bang An
  • Abdulmohsen Alharthi
  • Juncai He

The alignment of large language models (LLMs) is critical for developing effective and safe language models. Traditional approaches focus on aligning models during the instruction tuning or reinforcement learning stages, referred to in this paper as `\textit{post alignment}'. We argue that alignment during the pre-training phase, which we term 'native alignment', warrants investigation. Native alignment aims to prevent unaligned content from the beginning, rather than relying on post-hoc processing. This approach leverages extensively aligned pre-training data to enhance the effectiveness and usability of pre-trained models. Our study specifically explores the application of native alignment in the context of Arabic LLMs. We conduct comprehensive experiments and ablation studies to evaluate the impact of native alignment on model performance and alignment stability. Additionally, we release open-source Arabic LLMs that demonstrate state-of-the-art performance on various benchmarks, providing significant benefits to the Arabic LLM community.

JBHI Journal 2024 Journal Article

Few-Shot Class-Incremental Learning for Medical Time Series Classification

  • Le Sun
  • Mingyang Zhang
  • Benyou Wang
  • Prayag Tiwari

Continuously analyzing medical time series as new classes emerge is meaningful for health monitoring and medical decision-making. Few-shot class-incremental learning (FSCIL) explores the classification of few-shot new classes without forgetting old classes. However, little of the existing research on FSCIL focuses on medical time series classification, which is more challenging to learn due to its large intra-class variability. In this paper, we propose a framework, the Meta self-Attention Prototype Incrementer (MAPIC) to address these problems. MAPIC contains three main modules: an embedding encoder for feature extraction, a prototype enhancement module for increasing inter-class variation, and a distance-based classifier for reducing intra-class variation. To mitigate catastrophic forgetting, MAPIC adopts a parameter protection strategy in which the parameters of the embedding encoder module are frozen at incremental stages after being trained in the base stage. The prototype enhancement module is proposed to enhance the expressiveness of prototypes by perceiving inter-class relations using a self-attention mechanism. We design a composite loss function containing the sample classification loss, the prototype non-overlapping loss, and the knowledge distillation loss, which work together to reduce intra-class variations and resist catastrophic forgetting. Experimental results on three different time series datasets show that MAPIC significantly outperforms state-of-the-art approaches by 27. 99%, 18. 4%, and 3. 95%, respectively.

NeurIPS Conference 2024 Conference Paper

FinBen: A Holistic Financial Benchmark for Large Language Models

  • Qianqian Xie
  • Weiguang Han
  • Zhengyu Chen
  • Ruoyu Xiang
  • Xiao Zhang
  • Yueru He
  • Mengxi Xiao
  • Dong Li

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 42 datasets spanning 24 financial tasks, covering eight critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, decision-making, and bilingual (English and Spanish). FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and two novel datasets for regulations and stock trading. Our evaluation of 21 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovations in financial LLMs. All datasets and code are publicly available for the research community, with results shared and updated regularly on the Open Financial LLM Leaderboard.

NeurIPS Conference 2024 Conference Paper

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

  • Pengcheng Chen
  • Jin Ye
  • Guoan Wang
  • Yanjun Li
  • Zhongying Deng
  • Wei Li
  • Tianbin Li
  • Haodong Duan

Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs' effectiveness in various medical applications. Current benchmarks are often built upon specific academic literature, mainly focusing on a single domain, and lacking varying perceptual granularities. Thus, they face specific challenges, including limited clinical relevance, incomplete evaluations, and insufficient guidance for interactive LVLMs. To address these limitations, we developed the GMAI-MMBench, the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format. Additionally, we implemented a lexical tree structure that allows users to customize evaluation tasks, accommodating various assessment needs and substantially supporting medical AI research and applications. We evaluated 50 LVLMs, and the results show that even the advanced GPT-4o only achieves an accuracy of 53. 96\%, indicating significant room for improvement. Moreover, we identified five key insufficiencies in current cutting-edge LVLMs that need to be addressed to advance the development of better medical applications. We believe that GMAI-MMBench will stimulate the community to build the next generation of LVLMs toward GMAI.

ICML Conference 2024 Conference Paper

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

  • Zhengyang Tang
  • Xingxing Zhang 0002
  • Benyou Wang
  • Furu Wei

Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e. g. , GPT-3. 5). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct MWPBench, a benchmark of Math Word Problems, which is a collection of 9 datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e. g. , LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on MWPBench, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42. 8% in micro average accuracy and 43. 6% in macro average accuracy, respectively.

TMLR Journal 2024 Journal Article

Mixture of Latent Experts Using Tensor Products

  • Zhan Su
  • Fengran Mo
  • Prayag Tiwari
  • Benyou Wang
  • Qiuchi Li
  • Jian-Yun Nie
  • Jakob Grue Simonsen

In multi-task learning, the conventional approach involves training a model on multiple tasks simultaneously. However, the training signals from different tasks can interfere with one another, potentially leading to \textit{negative transfer}. To mitigate this, we propose a novel \textit{latent-expert} approach (\texttt{TensorPoly}), that balances parameter efficiency with nuanced routing methods. For \textit{experts}, we reparameterize Low-Rank Adaptation (\texttt{LoRA}) by employing an entangled tensor through the use of tensor product operations and name the resulting approach \texttt{TLoRA}. For \textit{routing function}, we tailor two innovative routing functions according to the granularity: \texttt{TensorPoly-I} which directs to each rank within the entangled tensor while \texttt{TensorPoly-II} offers a finer-grained routing approach targeting each order of the entangled tensor. The experimental results from the multi-task T0-benchmark demonstrate that: 1) all latent-expert approaches surpass the corresponding dense approaches, highlighting the potential of modular language models to mitigate negative inference in multi-task learning and deliver superior outcomes. 2) \texttt{TensorPoly-I} achieves higher parameter efficiency in adaptation and outperforms other modular LMs, which shows the potential of our approach in multi-task transfer learning \footnote{The code is released: \url{https://github.com/microsoft/mttl}}.

ICLR Conference 2024 Conference Paper

Rethinking the Uniformity Metric in Self-Supervised Learning

  • Xianghong Fang
  • Jian Li 0054
  • Qiang Sun
  • Benyou Wang

Uniformity plays an important role in evaluating learned representations, providing insights into self-supervised learning. In our quest for effective uniformity metrics, we pinpoint four principled properties that such metrics should possess. Namely, an effective uniformity metric should remain invariant to instance permutations and sample replications while accurately capturing feature redundancy and dimensional collapse. Surprisingly, we find that the uniformity metric proposed by \citet{Wang2020UnderstandingCR} fails to satisfy the majority of these properties. Specifically, their metric is sensitive to sample replications, and can not account for feature redundancy and dimensional collapse correctly. To overcome these limitations, we introduce a new uniformity metric based on the Wasserstein distance, which satisfies all the aforementioned properties. Integrating this new metric in existing self-supervised learning methods effectively mitigates dimensional collapse and consistently improves their performance on downstream tasks involving CIFAR-10 and CIFAR-100 datasets. Code is available at \url{https://github.com/statsle/WassersteinSSL}.

NeurIPS Conference 2023 Conference Paper

CMMA: Benchmarking Multi-Affection Detection in Chinese Multi-Modal Conversations

  • Yazhou Zhang
  • Yang Yu
  • Qing Guo
  • Benyou Wang
  • Dongming Zhao
  • Sagar Uprety
  • Dawei Song
  • Qiuchi Li

Human communication has a multi-modal and multi-affection nature. The inter-relatedness of different emotions and sentiments poses a challenge to jointly detect multiple human affections with multi-modal clues. Recent advances in this field employed multi-task learning paradigms to render the inter-relatedness across tasks, but the scarcity of publicly available resources sets a limit to the potential of works. To fill this gap, we build the first Chinese Multi-modal Multi-Affection conversation (CMMA) dataset, which contains 3, 000 multi-party conversations and 21, 795 multi-modal utterances collected from various styles of TV-series. CMMA contains a wide variety of affection labels, including sentiment, emotion, sarcasm and humor, as well as the novel inter-correlations values between certain pairs of tasks. Moreover, it provides the topic and speaker information in conversations, which promotes better modeling of conversational context. On the dataset, we empirically analyze the influence of different data modalities and conversational contexts on different affection analysis tasks, and exhibit the practical benefit of inter-task correlations. The full dataset will be publicly available for research\footnote{https: //github. com/annoymity2022/Chinese-Dataset}

AAAI Conference 2023 Conference Paper

Effective Open Intent Classification with K-center Contrastive Learning and Adjustable Decision Boundary

  • Xiaokang Liu
  • Jianquan Li
  • Jingjing Mu
  • Min Yang
  • Ruifeng Xu
  • Benyou Wang

Open intent classification, which aims to correctly classify the known intents into their corresponding classes while identifying the new unknown (open) intents, is an essential but challenging task in dialogue systems. In this paper, we introduce novel K-center contrastive learning and adjustable decision boundary learning (CLAB) to improve the effectiveness of open intent classification. First, we pre-train a feature encoder on the labeled training instances, which transfers knowledge from known intents to unknown intents. Specifically, we devise a K-center contrastive learning algorithm to learn discriminative and balanced intent features, improving the generalization of the model for recognizing open intents. Second, we devise an adjustable decision boundary learning method with expanding and shrinking (ADBES) to determine the suitable decision conditions. Concretely, we learn a decision boundary for each known intent class, which consists of a decision center and the radius of the decision boundary. We then expand the radius of the decision boundary to accommodate more in-class instances if the out-of-class instances are far from the decision boundary; otherwise, we shrink the radius of the decision boundary. Extensive experiments on three benchmark datasets clearly demonstrate the effectiveness of our method for open intent classification.For reproducibility, we submit the code at: https://github.com/lxk00/CLAP

NeurIPS Conference 2023 Conference Paper

Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias

  • Zhongwei Wan
  • Che Liu
  • Mi Zhang
  • Jie Fu
  • Benyou Wang
  • Sibo Cheng
  • Lei Ma
  • César Quilodrán-Casas

The scarcity of data presents a critical obstacle to the efficacy of medical vision-language pre-training (VLP). A potential solution lies in the combination of datasets from various language communities. Nevertheless, the main challenge stems from the complexity of integrating diverse syntax and semantics, language-specific medical terminology, and culture-specific implicit knowledge. Therefore, one crucial aspect to consider is the presence of community bias caused by different languages. This paper presents a novel framework named Unifying Cross-Lingual Medical Vision-Language Pre-Training (\textbf{Med-UniC}), designed to integrate multi-modal medical data from the two most prevalent languages, English and Spanish. Specifically, we propose \textbf{C}ross-lingual \textbf{T}ext Alignment \textbf{R}egularization (\textbf{CTR}) to explicitly unify cross-lingual semantic representations of medical reports originating from diverse language communities. \textbf{CTR} is optimized through latent language disentanglement, rendering our optimization objective to not depend on negative samples, thereby significantly mitigating the bias from determining positive-negative sample pairs within analogous medical reports. Furthermore, it ensures that the cross-lingual representation is not biased toward any specific language community. \textbf{Med-UniC} reaches superior performance across 5 medical image tasks and 10 datasets encompassing over 30 diseases, offering a versatile framework for unifying multi-modal medical data within diverse linguistic communities. The experimental outcomes highlight the presence of community bias in cross-lingual VLP. Reducing this bias enhances the performance not only in vision-language tasks but also in uni-modal visual tasks.

ICLR Conference 2022 Conference Paper

Exploring extreme parameter compression for pre-trained language models

  • Benyou Wang
  • Yuxin Ren
  • Lifeng Shang
  • Xin Jiang 0002
  • Qun Liu 0001

Recent work explored the potential of large-scale Transformer-based pre-trained models, especially Pre-trained Language Models (PLMs) in natural language processing. This raises many concerns from various perspectives, e.g., financial costs and carbon emissions. Compressing PLMs like BERT with negligible performance loss for faster inference and cheaper deployment has attracted much attention. In this work, we aim to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one. By comparing existing decomposition methods, Tucker decomposition is found to be parameter-efficient for compression. Two decomposition and reconstruction protocols are further proposed to improve the effectiveness and efficiency of Tucker decomposition in parameter compression. Our compressed BERT with ${1}/{7}$ parameters in Transformer layers performs on-par with, sometimes slightly better than the original BERT in GLUE benchmark. A tiny version achieves 96.7\% performance of BERT-base with $ {1}/{48} $ encoder parameters (i.e., less than 2M parameters excluding the embedding layer) and \textbf{$2.7 \times$} faster on inference. To show that the proposed method is orthogonal to existing compression methods like knowledge distillation, we also explore the benefit of the proposed method on a distilled BERT.

NeurIPS Conference 2022 Conference Paper

MorphTE: Injecting Morphology in Tensorized Embeddings

  • Guobing Gan
  • Peng Zhang
  • Sunzhu Li
  • Xiuqing Lu
  • Benyou Wang

In the era of deep learning, word embeddings are essential when dealing with text tasks. However, storing and accessing these embeddings requires a large amount of space. This is not conducive to the deployment of these models on resource-limited devices. Combining the powerful compression capability of tensor products, we propose a word embedding compression method with morphological augmentation, Morphologically-enhanced Tensorized Embeddings (MorphTE). A word consists of one or more morphemes, the smallest units that bear meaning or have a grammatical function. MorphTE represents a word embedding as an entangled form of its morpheme vectors via the tensor product, which injects prior semantic and grammatical knowledge into the learning of embeddings. Furthermore, the dimensionality of the morpheme vector and the number of morphemes are much smaller than those of words, which greatly reduces the parameters of the word embeddings. We conduct experiments on tasks such as machine translation and question answering. Experimental results on four translation datasets of different languages show that MorphTE can compress word embedding parameters by about $20$ times without performance loss and significantly outperforms related embedding compression methods.

ICLR Conference 2021 Conference Paper

On Position Embeddings in BERT

  • Benyou Wang
  • Lifeng Shang
  • Christina Lioma
  • Xin Jiang 0002
  • Hao Yang 0006
  • Qun Liu 0001
  • Jakob Grue Simonsen

Various Position Embeddings (PEs) have been proposed in Transformer based architectures~(e.g. BERT) to model word order. These are empirically-driven and perform well, but no formal framework exists to systematically study them. To address this, we present three properties of PEs that capture word distance in vector space: translation invariance, monotonicity, and symmetry. These properties formally capture the behaviour of PEs and allow us to reinterpret sinusoidal PEs in a principled way. Moreover, we propose a new probing test (called `identical word probing') and mathematical indicators to quantitatively detect the general attention patterns with respect to the above properties. An empirical evaluation of seven PEs (and their combinations) for classification (GLUE) and span prediction (SQuAD) shows that: (1) both classification and span prediction benefit from translation invariance and local monotonicity, while symmetry slightly decreases performance; (2) The fully-learnable absolute PE performs better in classification, while relative PEs perform better in span prediction. We contribute the first formal and quantitative analysis of desiderata for PEs, and a principled discussion about their correlation to the performance of typical downstream tasks.

NeurIPS Conference 2021 Conference Paper

Word2Fun: Modelling Words as Functions for Diachronic Word Representation

  • Benyou Wang
  • Emanuele Di Buccio
  • Massimo Melucci

Word meaning may change over time as a reflection of changes in human society. Therefore, modeling time in word representation is necessary for some diachronic tasks. Most existing diachronic word representation approaches train the embeddings separately for each pre-grouped time-stamped corpus and align these embeddings, e. g. , by orthogonal projections, vector initialization, temporal referencing, and compass. However, not only does word meaning change in a short time, word meaning may also be subject to evolution over long timespans, thus resulting in a unified continuous process. A recent approach called `DiffTime' models semantic evolution as functions parameterized by multiple-layer nonlinear neural networks over time. In this paper, we will carry on this line of work by learning explicit functions over time for each word. Our approach, called `Word2Fun', reduces the space complexity from $\mathcal{O}(TVD)$ to $\mathcal{O}(kVD)$ where $k$ is a small constant ($k \ll T $). In particular, a specific instance based on polynomial functions could provably approximate any function modeling word evolution with a given negligible error thanks to the Weierstrass Approximation Theorem. The effectiveness of the proposed approach is evaluated in diverse tasks including time-aware word clustering, temporal analogy, and semantic change detection. Code at: {\url{https: //github. com/wabyking/Word2Fun. git}}.

ICLR Conference 2020 Conference Paper

Encoding word order in complex embeddings

  • Benyou Wang
  • Donghao Zhao
  • Christina Lioma
  • Qiuchi Li
  • Peng Zhang 0002
  • Jakob Grue Simonsen

Sequential word order is important when processing text. Currently, neural networks (NNs) address this by modeling word position using position embeddings. The problem is that position embeddings capture the position of individual words, but not the ordered relationship (e.g., adjacency or precedence) between individual word positions. We present a novel and principled solution for modeling both the global absolute positions of words and their order relationships. Our solution generalizes word embeddings, previously defined as independent vectors, to continuous word functions over a variable (position). The benefit of continuous functions over variable positions is that word representations shift smoothly with increasing positions. Hence, word representations in different positions can correlate with each other in a continuous function. The general solution of these functions can be extended to complex-valued variants. We extend CNN, RNN and Transformer NNs to complex-valued versions to incorporate our complex embedding (we make all code available). Experiments on text classification, machine translation and language modeling show gains over both classical word embeddings and position-enriched word embeddings. To our knowledge, this is the first work in NLP to link imaginary numbers in complex-valued representations to concrete meanings (i.e., word order).

IJCAI Conference 2018 Conference Paper

A Multi-task Learning Approach for Image Captioning

  • Wei Zhao
  • Benyou Wang
  • Jianbo Ye
  • Min Yang
  • Zhou Zhao
  • Ruotian Luo
  • Yu Qiao

In this paper, we propose a Multi-task Learning Approach for Image Captioning (MLAIC ), motivated by the fact that humans have no difficulty performing such task because they possess capabilities of multiple domains. Specifically, MLAIC consists of three key components: (i) A multi-object classification model that learns rich category-aware image representations using a CNN image encoder; (ii) A syntax generation model that learns better syntax-aware LSTM based decoder; (iii) An image captioning model that generates image descriptions in text, sharing its CNN encoder and LSTM decoder with the object classification task and the syntax generation task, respectively. In particular, the image captioning model can benefit from the additional object categorization and syntax knowledge. To verify the effectiveness of our approach, we conduct extensive experiments on MS-COCO dataset. The experimental results demonstrate that our model achieves impressive results compared to other strong competitors.

AAAI Conference 2018 Conference Paper

End-to-End Quantum-like Language Models with Application to Question Answering

  • Peng Zhang
  • Jiabin Niu
  • Zhan Su
  • Benyou Wang
  • Liqun Ma
  • Dawei Song

Language Modeling (LM) is a fundamental research topic in a range of areas. Recently, inspired by quantum theory, a novel Quantum Language Model (QLM) has been proposed for Information Retrieval (IR). In this paper, we aim to broaden the theoretical and practical basis of QLM. We develop a Neural Network based Quantum-like Language Model (NNQLM) and apply it to Question Answering. Specifically, based on word embeddings, we design a new density matrix, which represents a sentence (e. g. , a question or an answer) and encodes a mixture of semantic subspaces. Such a density matrix, together with a joint representation of the question and the answer, can be integrated into neural network architectures (e. g. , 2-dimensional convolutional neural networks). Experiments on the TREC-QA and WIKIQA datasets have verified the effectiveness of our proposed models.

IJCAI Conference 2018 Conference Paper

PLASTIC: Prioritize Long and Short-term Information in Top-n Recommendation using Adversarial Training

  • Wei Zhao
  • Benyou Wang
  • Jianbo Ye
  • Yongqiang Gao
  • Min Yang
  • Xiaojun Chen

Recommender systems provide users with ranked lists of items based on individual's preferences and constraints. Two types of models are commonly used to generate ranking results: long-term models and session-based models. While long-term models represent the interactions between users and items that are supposed to change slowly across time, session-based models encode the information of users' interests and changing dynamics of items' attributes in short terms. In this paper, we propose a PLASTIC model, Prioritizing Long And Short-Term Information in top-n reCommendation using adversarial training. In the adversarial process, we train a generator as an agent of reinforcement learning which recommends the next item to a user sequentially. We also train a discriminator which attempts to distinguish the generated list of items from the real list recorded. Extensive experiments show that our model exhibits significantly better performances on two widely used real-world datasets.