Arrow Research search

Author name cluster

Bailin Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

EAAI Journal 2026 Journal Article

An enhanced memetic algorithm for a novel two-stage flow shop scheduling problem with group processing features

  • Shuaipeng Yuan
  • Bailin Wang
  • Yihan Pei
  • Tieke Li

This study addresses a novel two-stage flow shop scheduling problem motivated by hot rolling production management in modern steel manufacturing. The first stage is modeled as a simultaneous processing machine subject to a first-in-first-out rule and capacity constraints, while the second stage features a group processing mechanism. This problem fundamentally differs from conventional batch scheduling and group scheduling, and it requires simultaneously determining the job sequence and the processing manner of adjacent jobs on the second machine. To tackle this NP-hard problem, a mixed-integer linear programming model is formulated to minimize the makespan, and three fundamental properties are analytically derived. Leveraging these insights, an enhanced memetic algorithm is proposed, incorporating a greedy rule-based decoding strategy, a heuristic-driven initialization, a global search framework with five specialized neighborhood operators, and a local search strategy comprising four tailored operators. A reset mechanism is incorporated to avoid premature convergence. Extensive experiments demonstrate that the proposed memetic algorithm outperforms three other meta-heuristics and an industry-based heuristic, confirming its effectiveness for intelligent scheduling in practical settings.

NeurIPS Conference 2024 Conference Paper

Bias Amplification in Language Model Evolution: An Iterated Learning Perspective

  • Yi Ren
  • Shangmin Guo
  • Linlu Qiu
  • Bailin Wang
  • Danica J. Sutherland

With the widespread adoption of Large Language Models (LLMs), the prevalence of iterative interactions among these models is anticipated to increase. Notably, recent advancements in multi-round on-policy self-improving methods allow LLMs to generate new examples for training subsequent models. At the same time, multi-agent LLM systems, involving automated interactions among agents, are also increasing in prominence. Thus, in both short and long terms, LLMs may actively engage in an evolutionary process. We draw parallels between the behavior of LLMs and the evolution of human culture, as the latter has been extensively studied by cognitive scientists for decades. Our approach involves leveraging Iterated Learning (IL), a Bayesian framework that elucidates how subtle biases are magnified during human cultural evolution, to explain some behaviors of LLMs. This paper outlines key characteristics of agents' behavior in the Bayesian-IL framework, including predictions that are supported by experimental verification with various LLMs. This theoretical framework could help to more effectively predict and guide the evolution of LLMs in desired directions.

ICML Conference 2024 Conference Paper

Gated Linear Attention Transformers with Hardware-Efficient Training

  • Songlin Yang
  • Bailin Wang
  • Yikang Shen
  • Rameswar Panda
  • Yoon Kim

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention generally underperforms ordinary softmax attention. Moreover, current implementations of linear attention lack I/O-awareness and are thus slower than highly optimized implementations of softmax attention. This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability. The resulting implementation, dubbed FlashLinearAttention, is faster than FlashAttention-2 as a standalone layer even on short sequence lengths (e. g. , 1K). We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates. When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention (GLA) Transformer is found to perform competitively against the LLaMA-architecture Transformer as well recent linear-time-inference baselines such as RetNet and Mamba on moderate-scale language modeling experiments. GLA Transformer is especially effective at length generalization, enabling a model trained on 2K to generalize to sequences longer than 20K without significant perplexity degradations. For training speed, the GLA Transformer has higher throughput than a similarly-sized Mamba model.

NeurIPS Conference 2024 Conference Paper

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

  • Yu Zhang
  • Songlin Yang
  • Ruijie Zhu
  • Yue Zhang
  • Leyang Cui
  • Yiqiao Wang
  • Bolun Wang
  • Freda Shi

Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via $\operatorname{softmax}$, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the $\operatorname{softmax}$ operation is particularly beneficial in ``finetuning pretrained Transformers to RNNs'' (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.

ICLR Conference 2024 Conference Paper

GenSim: Generating Robotic Simulation Tasks via Large Language Models

  • Lirui Wang
  • Yiyang Ling
  • Zhecheng Yuan
  • Mohit Shridhar
  • Chen Bao
  • Yuzhe Qin
  • Bailin Wang
  • Huazhe Xu

Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tasks. This has made it challenging for policies trained on simulation data to demonstrate significant task-level generalization. In this paper, we propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models' (LLM) grounding and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks. Furthermore, we observe that LLMs-generated simulation programs can enhance task-level generalization significantly when used for multitask policy training. We further find that with minimal sim-to-real adaptation, the multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world and outperform baselines by 25%. See our project website (https://gen-sim.github.io) and demo (https://huggingface.co/spaces/Gen-Sim/Gen-Sim) for visualizations and open-source models and datasets.

ICML Conference 2024 Conference Paper

In-Context Language Learning: Architectures and Algorithms

  • Ekin Akyürek
  • Bailin Wang
  • Yoon Kim
  • Jacob Andreas

Some neural language models (LMs) exhibit a remarkable capacity for in-context learning (ICL): they can fit predictors to datasets provided as input. While the mechanisms underlying ICL are well-studied in the context of synthetic problems like in-context linear regression, there is still some divergence between these model problems and the “real” ICL exhibited by LMs trained on large text corpora. In this paper, we study ICL through the lens of a new family of model problems we term in context language learning (ICLL). In ICLL, LMs are presented with a set of strings from a formal language, and must generate additional strings from the same language. We focus on in- context learning of regular languages generated by random finite automata. We evaluate a diverse set of neural sequence models on regular ICLL tasks. We first show that Transformers significantly outperform neural sequence models with recurrent or convolutional representations on ICLL tasks. Next, we provide evidence that they do so by computing in-context n-gram statistics using specialized attention heads. Finally, we show that hard-wiring these heads into neural models improves performance not just on synthetic ICLL, but natural language modeling, reducing the perplexity of 340M-parameter Transformers by up to 1. 14 points (6. 7%) on the SlimPajama dataset. Our results highlight the usefulness of in-context formal language learning as a tool for understanding ICL in models of natural text.

ICLR Conference 2024 Conference Paper

Lemur: Harmonizing Natural Language and Code for Language Agents

  • Yiheng Xu
  • Hongjin Su
  • Chen Xing
  • Boyu Mi
  • Qian Liu 0033
  • Weijia Shi
  • Binyuan Hui
  • Fan Zhou

We introduce Lemur and Lemur-Chat, openly accessible language models optimized for both natural language and coding capabilities to serve as the backbone of versatile language agents. The evolution from language chat models to functional language agents demands that models not only master human interaction, reasoning, and planning but also ensure grounding in the relevant environments. This calls for a harmonious blend of language and coding capabilities in the models. Lemur and Lemur-Chat are proposed to address this necessity, demonstrating balanced proficiencies in both domains, unlike existing open-source models that tend to specialize in either. Through meticulous pretraining using a code-intensive corpus and instruction fine-tuning on text and code data, our models achieve state-of-the-art averaged performance across diverse text and coding benchmarks. Comprehensive experiments demonstrate Lemur’s superiority over existing open-source models and its proficiency across various agent tasks involving human communication, tool usage, and interaction under fully- and partially- observable environments. The harmonization between natural and programming languages enables Lemur-Chat to significantly narrow the gap with proprietary models on agent abilities, providing key insights into developing advanced open-source agents adept at reasoning, planning, and operating seamlessly across environments. Our model and code have been open-sourced at https://github.com/OpenLemur/Lemur.

NeurIPS Conference 2024 Conference Paper

MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs

  • Zhongshen Zeng
  • Yinhong Liu
  • Yingjia Wan
  • Jingyao Li
  • Pengguang Chen
  • Jianbo Dai
  • Yuxuan Yao
  • Rongwu Xu

Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, evaluating these reasoning abilities has become increasingly challenging. Existing outcome-based benchmarks are beginning to saturate, becoming less effective in tracking meaningful progress. To address this, we present a process-based benchmark MR-Ben that demands a meta-reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. Our meta-reasoning paradigm is especially suited for system-2 slow thinking, mirroring the human cognitive process of carefully examining assumptions, conditions, calculations, and logic to identify mistakes. MR-Ben comprises 5, 975 questions curated by human experts across a wide range of subjects, including physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, with models like the o1 series from OpenAI demonstrating strong performance by effectively scrutinizing the solution space, many other state-of-the-art models fall significantly behind on MR-Ben, exposing potential shortcomings in their training strategies and inference methodologies.

NeurIPS Conference 2024 Conference Paper

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

  • Songlin Yang
  • Bailin Wang
  • Yu Zhang
  • Yikang Shen
  • Yoon Kim

Transformers with linear attention (i. e. , linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform transformers especially on tasks that require in-context retrieval. While more expressive variants of linear transformers which replace the additive update in linear transformers with the delta rule (DeltaNet) have been found to be more effective at associative recall, existing algorithms for training such models do not parallelize over sequence length and are thus inefficient to train on modern hardware. This work describes a hardware-efficient algorithm for training linear transformers with the delta rule, which exploits a memory-efficient representation for computing products of Householder matrices. This algorithm allows us to scale up DeltaNet to standard language modeling settings. We train a 1. 3B model for 100B tokens and find that it outperforms recent linear-time baselines such as Mamba and GLA in terms of perplexity and zero-shot performance on downstream tasks. We also experiment with two hybrid models which combine DeltaNet layers with (1) sliding-window attention layers every other layer or (2) two global attention layers, and find that these hybrids outperform strong transformer baselines.

ICLR Conference 2024 Conference Paper

Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement

  • Linlu Qiu
  • Liwei Jiang
  • Ximing Lu
  • Melanie Sclar
  • Valentina Pyatkin
  • Chandra Bhagavatula
  • Bailin Wang
  • Yoon Kim

The ability to derive underlying principles from a handful of observations and then generalize to novel situations---known as inductive reasoning---is central to human intelligence. Prior work suggests that language models (LMs) often fall short on inductive reasoning, despite achieving impressive success on research benchmarks. In this work, we conduct a systematic study of the inductive reasoning capabilities of LMs through $\textit{iterative hypothesis refinement}$, a technique that more closely mirrors the human inductive process than standard input-output prompting. Iterative hypothesis refinement employs a three-step process: proposing, selecting, and refining hypotheses in the form of textual rules. By examining the intermediate rules, we observe that LMs are phenomenal $\textit{hypothesis proposers}$ (i.e., generating candidate rules), and when coupled with a (task-specific) symbolic interpreter that is able to systematically filter the proposed set of rules, this hybrid approach achieves strong results across inductive reasoning benchmarks that require inducing causal relations, language-like instructions, and symbolic concepts. However, they also behave as puzzling $\textit{inductive reasoners}$, showing notable performance gaps between rule induction (i.e., identifying plausible rules) and rule application (i.e., applying proposed rules to instances), suggesting that LMs are proposing hypotheses without being able to actually apply the rules. Through empirical and human analyses, we further reveal several discrepancies between the inductive reasoning processes of LMs and humans, shedding light on both the potentials and limitations of using LMs in inductive reasoning tasks.

NeurIPS Conference 2023 Conference Paper

Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs

  • Jinyang Li
  • Binyuan Hui
  • Ge Qu
  • Jiaxi Yang
  • Binhua Li
  • Bowen Li
  • Bailin Wang
  • Bowen Qin

Text-to-SQL parsing, which aims at converting natural language instructions into executable SQLs, has gained increasing attention in recent years. In particular, GPT-4 and Claude-2 have shown impressive results in this task. However, most of the prevalent benchmarks, i. e. , Spider, and WikiSQL, focus on database schema with few rows of database contents leaving the gap between academic study and real-world applications. To mitigate this gap, we present BIRD, a BIg benchmark for laRge-scale Database grounded in text-to-SQL tasks, containing 12, 751 pairs of text-to-SQL data and 95 databases with a total size of 33. 4 GB, spanning 37 professional domains. Our emphasis on database values highlights the new challenges of dirty database contents, external knowledge between NL questions and database contents, and SQL efficiency, particularly in the context of massive databases. To solve these problems, text-to-SQL models must feature database value comprehension in addition to semantic parsing. The experimental results demonstrate the significance of database values in generating accurate text-to-SQLs for big databases. Furthermore, even the most popular and effective text-to-SQL models, i. e. GPT-4, only achieve 54. 89% in execution accuracy, which is still far from the human result of 92. 96%, proving that challenges still stand. We also provide an efficiency analysis to offer insights into generating text-to-efficient-SQLs that are beneficial to industries. We believe that BIRD will contribute to advancing real-world applications of text-to-SQL research. The leaderboard and source code are available: https: //bird-bench. github. io/.

AAAI Conference 2023 Short Paper

Explaining Large Language Model-Based Neural Semantic Parsers (Student Abstract)

  • Daking Rai
  • Yilun Zhou
  • Bailin Wang
  • Ziyu Yao

While large language models (LLMs) have demonstrated strong capability in structured prediction tasks such as semantic parsing, few amounts of research have explored the underlying mechanisms of their success. Our work studies different methods for explaining an LLM-based semantic parser and qualitatively discusses the explained model behaviors, hoping to inspire future research toward better understanding them.

NeurIPS Conference 2023 Conference Paper

Grammar Prompting for Domain-Specific Language Generation with Large Language Models

  • Bailin Wang
  • Zi Wang
  • Xuezhi Wang
  • Yuan Cao
  • Rif A. Saurous
  • Yoon Kim

Large language models (LLMs) can learn to perform a wide range of natural language tasks from just a handful of in-context examples. However, for generating strings from highly structured languages (e. g. , semantic parsing to complex domain-specific languages), it is challenging for the LLM to generalize from just a few exemplars. We propose \emph{grammar prompting}, a simple approach to enable LLMs to use external knowledge and domain-specific constraints, expressed through a grammar in Backus--Naur Form (BNF), during in-context learning. Grammar prompting augments each demonstration example with a specialized grammar that is minimally sufficient for generating the particular output example, where the specialized grammar is a subset of the full DSL grammar. For inference, the LLM first predicts a BNF grammar given a test input, and then generates the output according to the rules of the grammar. Experiments demonstrate that grammar prompting can enable LLMs to perform competitively on a diverse set of DSL generation tasks, including semantic parsing (SMCalFlow, Overnight, GeoQuery), PDDL planning, and SMILES-based molecule generation.

ICLR Conference 2021 Conference Paper

GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing

  • Tao Yu 0009
  • Chien-Sheng Wu
  • Xi Victoria Lin
  • Bailin Wang
  • Yi Chern Tan
  • Xinyi Yang 0002
  • Dragomir Radev
  • Richard Socher

We present GraPPa, an effective pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data. We construct synthetic question-SQL pairs over high-quality tables via a synchronous context-free grammar (SCFG). We pre-train our model on the synthetic data to inject important structural properties commonly found in semantic parsing into the pre-training language model. To maintain the model's ability to represent real-world data, we also include masked language modeling (MLM) on several existing table-related datasets to regularize our pre-training process. Our proposed pre-training strategy is much data-efficient. When incorporated with strong base semantic parsers, GraPPa achieves new state-of-the-art results on four popular fully supervised and weakly supervised table semantic parsing tasks.

NeurIPS Conference 2021 Conference Paper

Structured Reordering for Modeling Latent Alignments in Sequence Transduction

  • Bailin Wang
  • Mirella Lapata
  • Ivan Titov

Despite success in many domains, neural models struggle in settings where train and test examples are drawn from different distributions. In particular, in contrast to humans, conventional sequence-to-sequence (seq2seq) models fail to generalize systematically, i. e. , interpret sentences representing novel combinations of concepts (e. g. , text segments) seen in training. Traditional grammar formalisms excel in such settings by implicitly encoding alignments between input and output segments, but are hard to scale and maintain. Instead of engineering a grammar, we directly model segment-to-segment alignments as discrete structured latent variables within a neural seq2seq model. To efficiently explore the large space of alignments, we introduce a reorder-first align-later framework whose central component is a neural reordering module producing separable permutations. We present an efficient dynamic programming algorithm performing exact marginal inference of separable permutations, and, thus, enabling end-to-end differentiable training of our model. The resulting seq2seq model exhibits better systematic generalization than standard models on synthetic problems and NLP tasks (i. e. , semantic parsing and machine translation).

AAAI Conference 2018 Conference Paper

Learning Latent Opinions for Aspect-level Sentiment Classification

  • Bailin Wang
  • Wei Lu

Aspect-level sentiment classification aims at detecting the sentiment expressed towards a particular target in a sentence. Based on the observation that the sentiment polarity is often related to specific spans in the given sentence, it is possible to make use of such information for better classification. On the other hand, such information can also serve as justifications associated with the predictions. We propose a segmentation attention based LSTM model which can effectively capture the structural dependencies between the target and the sentiment expressions with a linear-chain conditional random field (CRF) layer. The model simulates human’s process of inferring sentiment information when reading: when given a target, humans tend to search for surrounding relevant text spans in the sentence before making an informed decision on the underlying sentiment information. We perform sentiment classification tasks on publicly available datasets on online reviews across different languages from SemEval tasks and social comments from Twitter. Extensive experiments show that our model achieves the state-of-the-art performance while extracting interpretable sentiment expressions.