Author name cluster

Hang Yan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

1 author row

AAAI Conference 2026 Conference Paper

What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study

Xiaoran Fan
Zhichao Sun
Yangfan Gao
Jingfei Xiong
Hang Yan
Yifei Cao
Jiajun Sun
Shuo Li

Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12× faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

Zifan Song
Yudong Wang
Wenwei Zhang
Kuikun Liu
Chengqi Lyu
Demin Song
Qipeng Guo
Hang Yan

Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6. 7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence. Source code and models are available at https: //github. com/InternLM/AlchemistCoder.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Empowering and Assessing the Utility of Large Language Models in Crop Science

Hang Zhang
Jiawei Sun
Renqi Chen
Wei Liu
Zhonghang Yuan
Xinzhe Zheng
Zhefan Wang
Zhiyuan Yang

Large language models (LLMs) have demonstrated remarkable efficacy across knowledge-intensive tasks. Nevertheless, their untapped potential in crop science presents an opportunity for advancement. To narrow this gap, we introduce CROP, which includes a novel instruction tuning dataset specifically designed to enhance LLMs’ professional capabilities in the crop science sector, along with a benchmark that serves as a comprehensive evaluation of LLMs’ understanding of the domain knowledge. The CROP dataset is curated through a task-oriented and LLM-human integrated pipeline, comprising 210, 038 single-turn and 1, 871 multi-turn dialogues related to crop science scenarios. The CROP benchmark includes 5, 045 multiple-choice questions covering three difficulty levels. Our experiments based on the CROP benchmark demonstrate notable enhancements in crop science-related tasks when LLMs are fine-tuned with the CROP dataset. To the best of our knowledge, CROP dataset is the first-ever instruction tuning dataset in the crop science domain. We anticipate that CROP will accelerate the adoption of LLMs in the domain of crop science, ultimately contributing to global food production.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Xiaoyi Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Bin Wang
Linke Ouyang
Songyang Zhang
Haodong Duan

The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 $\times$ 1500 pixels and constrained to a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 × 1600) and beyond. Concurrently, considering the ultra-high resolution may not be necessary in all scenarios, it supports a wide range of diverse resolutions from 336 pixels to 4K standard, significantly broadening its scope of applicability. Specifically, this research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration. It maintains the training image aspect ratios while automatically varying patch counts and configuring layouts based on a pre-trained Vision Transformer (ViT) (336 $\times$ 336), leading to dynamic training resolution from 336 pixels to 4K standard. Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements. InternLM-XComposer2-4KHD shows superb capability that matches or even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Contrast and Generation Make BART a Good Dialogue Emotion Recognizer

Shimin Li
Hang Yan
Xipeng Qiu

In dialogue systems, utterances with similar semantics may have distinctive emotions under different contexts. Therefore, modeling long-range contextual emotional relationships with speaker dependency plays a crucial part in dialogue emotion recognition. Meanwhile, distinguishing the different emotion categories is non-trivial since they usually have semantically similar sentiments. To this end, we adopt supervised contrastive learning to make different emotions mutually exclusive to identify similar emotions better. Meanwhile, we utilize an auxiliary response generation task to enhance the model’s ability of handling context information, thereby forcing the model to recognize emotions with similar semantics in diverse contexts. To achieve these objectives, we use the pretrained encoder-decoder model BART as our backbone model since it is very suitable for both understanding and generation tasks. The experiments on four datasets demonstrate that our proposed model obtains significantly more favorable results than the state-of-the-art model in dialogue emotion recognition. The ablation study further demonstrates the effectiveness of supervised contrastive loss and generative loss.

PDF Details

AAAI Conference 2020 Conference Paper

Constructing Multiple Tasks for Augmentation: Improving Neural Image Classification with K-Means Features

Tao Gui
Lizhi Qing
Qi Zhang
Jiacheng Ye
Hang Yan
Zichu Fei
Xuanjing Huang

Multi-task learning (MTL) has received considerable attention, and numerous deep learning applications beneﬁt from MTL with multiple objectives. However, constructing multiple related tasks is difﬁcult, and sometimes only a single task is available for training in a dataset. To tackle this problem, we explored the idea of using unsupervised clustering to construct a variety of auxiliary tasks from unlabeled data or existing labeled data. We found that some of these newly constructed tasks could exhibit semantic meanings corresponding to certain human-speciﬁc attributes, but some were non-ideal. In order to effectively reduce the impact of non-ideal auxiliary tasks on the main task, we further proposed a novel meta-learning-based multi-task learning approach, which trained the shared hidden layers on auxiliary tasks, while the meta-optimization objective was to minimize the loss on the main task, ensuring that the optimizing direction led to an improvement on the main task. Experimental results across ﬁve image datasets demonstrated that the proposed method signiﬁcantly outperformed existing single task learning, semi-supervised learning, and some data augmentation methods, including an improvement of more than 9% on the Omniglot dataset.

PDF Details

AAAI Conference 2020 Conference Paper

Learning Sparse Sharing Architectures for Multiple Tasks

Tianxiang Sun
Yunfan Shao
Xiaonan Li
Pengfei Liu
Hang Yan
Xipeng Qiu
Xuanjing Huang

Most existing deep multi-task learning models are based on parameter sharing, such as hard sharing, hierarchical sharing, and soft sharing. How choosing a suitable sharing mechanism depends on the relations among the tasks, which is not easy since it is difﬁcult to understand the underlying shared factors among these tasks. In this paper, we propose a novel parameter sharing mechanism, named Sparse Sharing. Given multiple tasks, our approach automatically ﬁnds a sparse sharing structure. We start with an over-parameterized base network, from which each task extracts a subnetwork. The subnetworks of multiple tasks are partially overlapped and trained in parallel. We show that both hard sharing and hierarchical sharing can be formulated as particular instances of the sparse sharing framework. We conduct extensive experiments on three sequence labeling tasks. Compared with single-task models and three typical multi-task learning baselines, our proposed approach achieves consistent improvement while requiring fewer parameters.

PDF Details