Arrow Research search

Author name cluster

Junyang Lin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers
2 author rows

Possible papers

24

AAAI Conference 2026 Conference Paper

Towards Better Correctness and Efficiency in Code Generation

  • Yunlong Feng
  • Yang Xu
  • Xiao Xu
  • Binyuan Hui
  • Junyang Lin

While code large language models have demonstrated remarkable progress in code generation, the generated code often exhibits poor runtime efficiency, limiting its practical application in performance-sensitive scenarios. To address this limitation, we propose an efficiency-oriented reinforcement learning framework guided by a novel performance reward. Based on this framework, we take a deeper dive into the code efficiency problem, identifying then proposing methods to overcome key bottlenecks: (1) Dynamic exploration overcomes the static data constraints of offline fine-tuning, enabling the discovery of more efficient code implementations. (2) The error-insensitive reinforcement learning method and high-contrast efficiency signals are crucial for mitigating systematic errors and achieving effective optimization. (3) Online exploration is most effective when starting from a high-correctness baseline, as this allows for efficiency improvements without sacrificing accuracy. With these discoveries, we finally propose a two-stage tuning method, which achieves high and balanced performance across correctness and efficiency. The results of experiments show the effectiveness of the method, which improves code correctness by 10.18% and runtime efficiency by 7.75% on a 7B model, achieving performance comparable to much larger model.

ICLR Conference 2025 Conference Paper

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

  • Liang Chen 0024
  • Sinan Tan
  • Zefan Cai
  • Weichu Xie
  • Haozhe Zhao
  • Yichi Zhang 0010
  • Junyang Lin
  • Jinze Bai

This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer predicts more codes for an image by introducing a new direction, **model depth**, along with the sequence length. Compared to 1D autoregression and previous work using similar 2D image decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end model that can generate higher quality images with the same backbone model size and sequence length, opening a new optimization perspective for autoregressive image generation. Furthermore, our experiments reveal that the DnD-Transformer's potential extends beyond generating natural images. It can even generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities. This has not been previously demonstrated for popular vision generative models such as diffusion models, showing a spark of vision-language intelligence when trained solely on images. Code, datasets and models are open at https://github.com/chenllliang/DnD-Transformer.

NeurIPS Conference 2025 Conference Paper

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

  • Shenzhi Wang
  • Le Yu
  • Chang Gao
  • Chujie Zheng
  • Shixuan Liu
  • Rui Lu
  • Kai Dang
  • Xiong-Hui Chen

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), yet its underlying mechanisms remain insufficiently understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction (approximately 20\%) of tokens exhibit high entropy, and these tokens semantically act as critical forks that steer the model toward diverse reasoning pathways. We further demonstrate that moderately increasing the entropy of these high-entropy tokens via decoding temperature adjustments leads to improved performance, quantitatively confirming their role as decision points in reasoning. We ultimately refine RLVR by restricting policy gradient updates to these forking tokens. Despite utilizing only 20\% of tokens, our approach achieves comparable performance to full-gradient updates on the Qwen3-8B base model. Moreover, it demonstrates remarkable improvements on the larger Qwen3-32B base model, boosting AIME'25 scores by 11. 04 and AIME'24 scores by 7. 71. In contrast, training exclusively on the 80\% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that dictate key reasoning directions. Collectively, our results suggest promising avenues for optimizing RLVR algorithms by strategically leveraging the potential of these high-entropy minority tokens to further enhance the reasoning abilities of LLMs.

NeurIPS Conference 2025 Conference Paper

CARE: Decoding-Time Safety Alignment via Rollback and Introspection Intervention

  • Xiaomeng Hu
  • Fei Huang
  • Chenhan Yuan
  • Junyang Lin
  • Tsung-Yi Ho

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring the safety of their outputs during decoding has become a critical challenge. However, existing decoding-time interventions, such as Contrastive Decoding, often force a severe trade-off between safety and response quality. In this work, we propose CARE, a novel framework for decoding-time safety alignment that integrates three key components: (1) a guard model for real-time safety monitoring, enabling detection of potentially unsafe content; (2) a rollback mechanism with a token buffer to correct unsafe outputs efficiently at an earlier stage without disrupting the user experience; and (3) a novel introspection-based intervention strategy, where the model generates self-reflective critiques of its previous outputs and incorporates these reflections into the context to guide subsequent decoding steps. The framework achieves a superior safety-quality trade-off by using its guard model for precise interventions, its rollback mechanism for timely corrections, and our novel introspection method for effective self-correction. Experimental results demonstrate that our framework achieves a superior balance of safety, quality, and efficiency, attaining a low harmful response rate and minimal disruption to the user experience while maintaining high response quality.

ICML Conference 2025 Conference Paper

CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration

  • Haoyun Jiang
  • Haolin Li 0001
  • Jianwei Zhang 0012
  • Fei Huang 0005
  • Qiang Hu 0003
  • Minmin Sun
  • Shuai Xiao 0002
  • Yong Li 0020

Large language models (LLMs) have demonstrated strong capabilities in handling long-context tasks, but processing such long contexts remains challenging due to the substantial memory requirements and inference latency. In this work, we discover that certain attention heads exhibit sequential consistency in their attention patterns, which can be persistently identified using a coefficient-of-variation-based algorithm. Inspired by this observation, we propose CateKV, a hybrid KV cache method that retains only critical token information for consistent heads, thereby reducing KV cache size and computational overhead, while preserving the majority of KV pairs in adaptive heads to ensure high accuracy. We show the unique characteristics of our algorithm and its extension with existing acceleration methods. Comprehensive evaluations on long-context benchmarks show that, while maintaining accuracy comparable to full attention, CateKV reduces memory usage by up to $2. 72\times$ and accelerates decoding by $2. 18\times$ in single-sample inputs, and boosts throughput by $3. 96\times$ in batch scenarios.

NeurIPS Conference 2025 Conference Paper

Chain of Execution Supervision Promotes General Reasoning in Large Language Models

  • Nuo Chen
  • Zehua Li
  • Keqin Bao
  • Junyang Lin
  • Dayiheng Liu

Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms—such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code suboptimal. To address this, we introduce TraceMind, a large-scale corpus of 2. 6 million samples that transforms code execution into explicit, step-by-step chain-of-thought style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and code diversity. We evaluate Tracepile using three training setups—continue-pretraining, instruction tuning after pretraining, and two-stage finetuning. Experiments across four base models (LLaMA 3, LLaMA 3. 1, Qwen-2. 5, and Qwen-2. 5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. Notably, Tracepile boosts LLaMA3-8B by 9. 2\% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and Zebra Logic under two-stage finetuning.

ICLR Conference 2025 Conference Paper

DataMan: Data Manager for Pre-training Large Language Models

  • Ru Peng
  • Kexin Yang 0002
  • Yawen Zeng
  • Junyang Lin
  • Dayiheng Liu
  • Junbo Zhao 0002

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by *``reverse thinking''* -- prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a **Data** **Man**ager (**DataMan**) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the *Overall Score l=5* surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan's domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.

ICML Conference 2025 Conference Paper

Efficient Long Context Fine-tuning with Chunk Flow

  • Xiulong Yuan
  • Hongtao Xu
  • Wenting Shen
  • Ang Wang
  • Xiafei Qiu
  • Jie Zhang 0135
  • Yuqiong Liu
  • Bowen Yu 0002

Long context fine-tuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail distribution and employ training strategies designed specifically for long sequences. Moreover, these approaches also fail to address the challenges posed by variable sequence lengths during distributed training, such as load imbalance in data parallelism and severe pipeline bubbles in pipeline parallelism. These issues lead to suboptimal training performance and poor GPU resource utilization. To tackle these problems, we propose a chunk-centric training method named ChunkFlow. ChunkFlow reorganizes input sequences into uniformly sized chunks by consolidating short sequences and splitting longer ones. This approach achieves optimal computational efficiency and balance among training inputs. Additionally, ChunkFlow incorporates a state-aware chunk scheduling mechanism to ensure that the peak memory usage during training is primarily determined by the chunk size rather than the maximum sequence length in the dataset. Integrating this scheduling mechanism with existing pipeline scheduling algorithms further enhances the performance of distributed training. Experimental results demonstrate that, compared with Megatron-LM, ChunkFlow can be up to 4. 53x faster in the long context fine-tuning of LLMs. Furthermore, we believe that ChunkFlow serves as an effective solution for a broader range of scenarios, such as long context continual pre-training, where datasets contain variable-length sequences.

AAAI Conference 2025 Conference Paper

Fine-Tuning Language Models with Collaborative and Semantic Experts

  • Jiaxi Yang
  • Binyuan Hui
  • Min Yang
  • Jian Yang
  • Lei Zhang
  • Qiang Qu
  • Junyang Lin

Recent advancements in large language models (LLMs) have broadened their application scope but revealed challenges in balancing capabilities across general knowledge, coding, and mathematics. To address this, we introduce a Collaborative and Semantic Experts (CoE) approach for supervised fine-tuning (SFT), which employs a two-phase training strategy. Initially, expert training fine-tunes the feed-forward network on specialized datasets, developing distinct experts in targeted domains. Subsequently, expert leveraging synthesizes these trained experts into a structured model with semantic guidance to activate specific experts, enhancing performance and interpretability. Evaluations on comprehensive benchmarks across MMLU, HumanEval, GSM8K, MT-Bench, and AlpacaEval confirm CoE's efficacy, demonstrating improved performance and expert collaboration in diverse tasks, significantly outperforming traditional SFT methods.

NeurIPS Conference 2025 Conference Paper

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

  • Zihan Qiu
  • Zekun Wang
  • Bo Zheng
  • Zeyu Huang
  • Kaiyue Wen
  • Songlin Yang
  • Rui Men
  • Le Yu

Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1. 7B dense models trained on a 3. 5 trillion token dataset. Our central finding is that a simple modification—applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)—consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates massive activation, attention sink and enhances long-context extrapolation performance. We also release related codes (https: //github. com/qiuzh20/gated attention}) and models (https: //huggingface. co/QwQZh/gated attention) to facilitate future research. Furthermore, the most effective SDPA output gating is used in the Qwen3-Next models (https: //huggingface. co/collections/Qwen/qwen3-next).

ICML Conference 2025 Conference Paper

MARGE: Improving Math Reasoning with Guided Exploration

  • Jingyue Gao
  • Runji Lin
  • Keming Lu
  • Bowen Yu 0002
  • Junyang Lin
  • Jianyu Chen 0002

Large Language Models (LLMs) exhibit strong potential in mathematical reasoning, yet their effectiveness is often limited by a shortage of high-quality queries. This limitation necessitates scaling up computational responses through self-generated data, yet current methods struggle due to spurious correlated data caused by ineffective exploration across all reasoning stages. To address such challenge, we introduce MARGE: Improving Ma th R easoning with G uided E xploration, a novel method that enhances mathematical reasoning through hit-guided exploration. MARGE systematically explores intermediate reasoning states derived from self-generated solutions, enabling adequate exploration and improved credit assignment throughout the reasoning process. Notably, MARGE improves both single-shot accuracy and exploration diversity, mitigating a common trade-off in alignment methods. These results demonstrate MARGE’s effectiveness in enhancing mathematical reasoning capabilities and unlocking the potential of scaling self-generated training data.

ICLR Conference 2025 Conference Paper

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

  • Xingyao Wang 0002
  • Boxuan Li
  • Yufan Song
  • Frank F. Xu
  • Xiangru Tang
  • Mingchen Zhuge
  • Jiayi Pan
  • Yueqi Song

Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and effect change in their surrounding environments. In this paper, we introduce OpenHands, a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, utilization of various LLMs, safe interaction with sandboxed environments for code execution, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 13 challenging tasks, including software engineering (e.g., SWE-Bench) and web browsing (e.g., WebArena), amongst others. Released under the permissive MIT license, OpenHands is a community project spanning academia and industry with more than 2K contributions from over 186 contributors in less than six months of development, and will improve going forward.

NeurIPS Conference 2025 Conference Paper

Parallel Scaling Law for Language Models

  • Mouxiang Chen
  • Binyuan Hui
  • Zeyu Cui
  • Jiaxi Yang
  • Dayiheng Liu
  • Jianling Sun
  • Junyang Lin
  • Zhongxin Liu

It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce another and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply $P$ diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the $P$ outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with $P$ parallel streams is similar to scaling the parameters by $\mathcal O(\log P)$ while showing superior inference efficiency. For example, ParScale can use up to 22$\times$ less memory increase and 6$\times$ less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning. Our code and 67 trained model checkpoints are publicly available at https: //github. com/QwenLM/ParScale and https: //huggingface. co/ParScale.

NeurIPS Conference 2025 Conference Paper

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

  • Yiming Wang
  • Pei Zhang
  • Jialong Tang
  • Hao-Ran Wei
  • Baosong Yang
  • Rui Wang
  • Chenshu Sun
  • Feitong Sun

In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced LLMs and find that even Qwen-3-235B-A22B-Thinking and Gemini-2. 5-pro, achieve only 54. 6 and 52. 2 benchmark scores, with about 40% accuracy under the highest level. From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning performance varies widely across languages for current LLMs; (2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance; (3) The thinking length differs significantly by language for current LLMs. Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.

ICLR Conference 2025 Conference Paper

Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

  • Ke Yi 0003
  • Zengke Liu
  • Jianwei Zhang 0012
  • Chengyuan Li
  • Tong Zhang 0015
  • Junyang Lin
  • Jingren Zhou 0001

Large language models have demonstrated promising capabilities upon scaling up parameters. However, serving large language models incurs substantial computation and memory movement costs due to their large scale. Quantization methods have been employed to reduce service costs and latency. Nevertheless, outliers in activations hinder the development of INT4 weight-activation quantization. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. Based on observing activations from large language models, outliers can be classified into channel-wise and spike outliers. In this work, we propose Rotated Runtime Smooth (**RRS**), a plug-and-play activation smoother for quantization, consisting of Runtime Smooth and the Rotation operation. Runtime Smooth (**RS**) is introduced to eliminate **channel-wise outliers** by smoothing activations with channel-wise maximums during runtime. The Rotation operation can narrow the gap between **spike outliers** and normal values, alleviating the effect of victims caused by channel-wise smoothing. The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.

ICML Conference 2025 Conference Paper

Synthesizing Software Engineering Data in a Test-Driven Manner

  • Lei Zhang 0201
  • Jiaxi Yang 0004
  • Min Yang 0007
  • Jian Yang 0003
  • Mouxiang Chen
  • Jiajun Zhang 0012
  • Zeyu Cui
  • Binyuan Hui

We introduce SWE-Flow, a novel data synthesis framework grounded in Test-Driven Development (TDD). Unlike existing software engineering data that rely on human-submitted issues, SWE-Flow automatically infers incremental development steps directly from unit tests, which inherently encapsulate high-level requirements. The core of SWE-Flow is the construction of a Runtime Dependency Graph (RDG), which precisely captures function interactions, enabling the generation of a structured, step-by-step development schedule. At each step, SWE-Flow produces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16, 061 training instances and 2, 020 test instances from real-world GitHub projects, creating the SWE-Flow-Eval benchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research, we release all code, datasets, models, and Docker images at Github.

NeurIPS Conference 2025 Conference Paper

Teaching Language Models to Reason with Tools

  • Chengpeng Li
  • Zhengyang Tang
  • Ziniu Li
  • Mingfeng Xue
  • Keqin Bao
  • Tian Ding
  • Ruoyu Sun
  • Benyou Wang

Large reasoning models (LRMs) like OpenAI-o1 have shown impressive capabilities in natural language reasoning. However, these models frequently demonstrate inefficiencies or inaccuracies when tackling complex mathematical operations. While integrating computational tools such as Code Interpreters (CIs) offers a promising solution, it introduces a critical challenge: a conflict between the model's internal, probabilistic reasoning and the external, deterministic knowledge provided by the CI, which often leads models to unproductive deliberation. To overcome this, we introduce CoRT (Code-Optimized Reasoning Training), a post-training framework designed to teach LRMs to effectively utilize CIs. We propose Hint-Engineering, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths. This approach generates high-quality, code-integrated reasoning data specifically tailored to optimize LRM-CI interaction. Using this method, we have synthesized 30 high-quality samples to post-train models ranging from 1. 5B to 32B parameters through supervised fine-tuning. CoRT further refines the multi-round interleaving of external CI usage and internal thinking by employing rejection sampling and reinforcement learning. Our experimental evaluations demonstrate CoRT's effectiveness, yielding absolute improvements of 4\% and 8\% on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1. 5B, respectively, across five challenging mathematical reasoning datasets. Moreover, CoRT significantly enhances efficiency, reducing token usage by approximately 30\% for the 32B model and 50\% for the 1. 5B model compared to pure natural language reasoning baselines. The models and code are available at: this url.

ICLR Conference 2024 Conference Paper

#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models

  • Keming Lu
  • Hongyi Yuan
  • Zheng Yuan 0002
  • Runji Lin
  • Junyang Lin
  • Chuanqi Tan
  • Chang Zhou 0005
  • Jingren Zhou 0001

Pre-trained large language models (LLMs) can understand and align with human instructions by supervised fine-tuning (SFT). It is commonly believed that diverse and complex SFT data are of the essence to enable good instruction-following abilities. However, such diversity and complexity are obscure and lack quantitative analyses. In this work, we propose InsTag, an open-set instruction tagging method, to identify semantics and intentions of human instructions by tags that provide access to definitions and quantified analyses of instruction diversity and complexity. We obtain 6.6K fine-grained tags to describe instructions from popular open-sourced SFT datasets comprehensively. We find that the abilities of aligned LLMs benefit from more diverse and complex instructions in SFT data. Based on this observation, we propose a data sampling procedure based on InsTag, and select 6K diverse and complex samples from open-source datasets for SFT. The resulting models, TagLM, outperform open-source models based on considerably larger SFT data evaluated by MT-Bench, echoing the importance of instruction diversity and complexity and the effectiveness of InsTag. InsTag has robust potential to be extended to more applications beyond the data selection as it provides an effective way to analyze the distribution of instructions.

ICML Conference 2022 Conference Paper

Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)

  • Yu Huang 0023
  • Junyang Lin
  • Chang Zhou
  • Hongxia Yang
  • Longbo Huang

Despite the remarkable success of deep multi-modal learning in practice, it has not been well-explained in theory. Recently, it has been observed that the best uni-modal network outperforms the jointly trained multi-modal network across different combinations of modalities on various tasks, which is counter-intuitive since multiple signals would bring more information (Wang et al. , 2020). This work provides a theoretical explanation for the emergence of such performance gap in neural networks for the prevalent joint training framework. Based on a simplified data distribution that captures the realistic property of multi-modal data, we prove that for multi-modal late-fusion network with (smoothed) ReLU activation trained jointly by gradient descent, different modalities will compete with each other and only a subset of modalities will be learned by its corresponding encoder networks. We refer to this phenomenon as modality competition, and the losing modalities, which fail to be discovered, are the origins where the sub-optimality of joint training comes from. In contrast, for uni-modal networks with similar learning settings, we provably show that the networks will focus on learning modality-associated features. Experimentally, we illustrate that modality competition matches the intrinsic behavior of late-fusion joint training to supplement our theoretical results. To the best of our knowledge, our work is the first theoretical treatment towards the degenerating aspect of multi-modal learning in neural networks.

ICML Conference 2022 Conference Paper

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

  • Peng Wang 0028
  • An Yang
  • Rui Men
  • Junyang Lin
  • Shuai Bai
  • Zhikang Li
  • Jianxin Ma
  • Chang Zhou 0005

In this work, we pursue a unified paradigm for multimodal pretraining to break the shackles of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc. , in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https: //github. com/OFA-Sys/OFA.

NeurIPS Conference 2021 Conference Paper

CogView: Mastering Text-to-Image Generation via Transformers

  • Ming Ding
  • Zhuoyi Yang
  • Wenyi Hong
  • Wendi Zheng
  • Chang Zhou
  • Da Yin
  • Junyang Lin
  • Xu Zou

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e. g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e. g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.

ICML Conference 2021 Conference Paper

KNAS: Green Neural Architecture Search

  • Jingjing Xu 0001
  • Liang Zhao
  • Junyang Lin
  • Rundong Gao
  • Xu Sun 0001
  • Hongxia Yang

Many existing neural architecture search (NAS) solutions rely on downstream training for architecture evaluation, which takes enormous computations. Considering that these computations bring a large carbon footprint, this paper aims to explore a green (namely environmental-friendly) NAS solution that evaluates architectures without training. Intuitively, gradients, induced by the architecture itself, directly decide the convergence and generalization results. It motivates us to propose the gradient kernel hypothesis: Gradients can be used as a coarse-grained proxy of downstream training to evaluate random-initialized networks. To support the hypothesis, we conduct a theoretical analysis and find a practical gradient kernel that has good correlations with training loss and validation performance. According to this hypothesis, we propose a new kernel based architecture search approach KNAS. Experiments show that KNAS achieves competitive results with orders of magnitude faster than “train-then-test” paradigms on image classification tasks. Furthermore, the extremely low search cost enables its wide applications. The searched network also outperforms strong baseline RoBERTA-large on two text classification tasks.

NeurIPS Conference 2019 Conference Paper

Understanding and Improving Layer Normalization

  • Jingjing Xu
  • Xu Sun
  • Zhiyuan Zhang
  • Guangxiang Zhao
  • Junyang Lin

Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization. Unlike them, we find that the derivatives of the mean and variance are more important than forward normalization by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of LayerNorm, including the bias and gain, increase the risk of over-fitting and do not work in most cases. Experiments show that a simple version of LayerNorm (LayerNorm-simple) without the bias and gain outperforms LayerNorm on four datasets. It obtains the state-of-the-art performance on En-Vi machine translation. To address the over-fitting problem, we propose a new normalization method, Adaptive Normalization (AdaNorm), by replacing the bias and gain with a new transformation function. Experiments show that AdaNorm demonstrates better results than LayerNorm on seven out of eight datasets.

IJCAI Conference 2018 Conference Paper

A Hierarchical End-to-End Model for Jointly Improving Text Summarization and Sentiment Classification

  • Shuming Ma
  • Xu Sun
  • Junyang Lin
  • Xuancheng Ren

Text summarization and sentiment classification both aim to capture the main ideas of the text but at different levels. Text summarization is to describe the text within a few sentences, while sentiment classification can be regarded as a special type of summarization which ``summarizes'' the text into a even more abstract fashion, i. e. , a sentiment class. Based on this idea, we propose a hierarchical end-to-end model for joint learning of text summarization and sentiment classification, where the sentiment classification label is treated as the further ``summarization'' of the text summarization output. Hence, the sentiment classification layer is put upon the text summarization layer, and a hierarchical structure is derived. Experimental results on Amazon online reviews datasets show that our model achieves better performance than the strong baseline systems on both abstractive summarization and sentiment classification.