Arrow Research search

Author name cluster

Kai Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

33 papers
2 author rows

Possible papers

33

AAAI Conference 2026 Conference Paper

AHAMask: Reliable Task Specification for Large Audio Language Models Without Instructions

  • Yiwei Guo
  • Bohan Li
  • Hankun Wang
  • Zhihan Li
  • Shuai Wang
  • Xie Chen
  • Kai Yu

Although current large audio language models (LALMs) extend text large language models (LLMs) with generic acoustic understanding abilities, they usually suffer from prompt sensitivity, where different instructions of the same intention can yield drastically different outcomes. In this work, we propose AHAMask, where we simply mask some of the attention heads in the decoder-only LLM backbone of LALMs, to trigger specific acoustic task functionalities without instructions. These masks are efficiently obtained by training on an LALM, with the number of trainable parameters equal to the attention head count in its LLM backbone. We show by experiments that applying such selective attention head masks achieves comparable or even better performance than using instructions, either on single or composite tasks. Besides achieving reliable acoustic task specification for LALMs, this also reveals that LALMs exhibit certain ``functional pathways'' in their attention heads.

AAAI Conference 2026 Conference Paper

MergeDNA: Context-Aware Genome Modeling with Dynamic Tokenization Through Token Merging

  • Siyuan Li
  • Kai Yu
  • Anna Wang
  • Zicheng Liu
  • Chang Yu
  • Jingbo Zhou
  • Qirong Yang
  • Yucheng Guo

Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.

AAAI Conference 2026 Conference Paper

Phased One-Step Adversarial Equilibrium for Video Diffusion Models

  • Jiaxiang Cheng
  • Bing Ma
  • Xuhua Ren
  • Hongyi Henry Jin
  • Kai Yu
  • Peng Zhang
  • Wenyue Li
  • Yuan Zhou

Video diffusion generation suffers from critical sampling efficiency bottlenecks, particularly for large-scale models and long contexts. Existing video acceleration methods, adapted from image-based techniques, lack a single-step distillation ability for large-scale video models and task generalization for conditional downstream tasks. To bridge this gap, we propose the Video Phased Adversarial Equilibrium (V-PAE), a distillation framework that enables high-quality, single-step video generation from large-scale video models. Our approach employs a two-phase process. (i) Stability priming is a warm-up process to align the distributions of real and generated videos. It improves the stability of single-step adversarial distillation in the following process. (ii) Unified adversarial equilibrium is a flexible self-adversarial process that reuses generator parameters for the discriminator backbone. It achieves a co-evolutionary adversarial equilibrium in the Gaussian noise space. For the conditional tasks, we primarily preserve video-image subject consistency, which is caused by semantic degradation and conditional frame collapse during the distillation training in image-to-video (I2V) generation. Comprehensive experiments on VBench-I2V demonstrate that V-PAE outperforms existing acceleration methods by an average of 5.8% in the overall quality score, including semantic alignment, temporal coherence, and frame quality. In addition, our approach reduces the diffusion latency of the large-scale video model (e.g., Wan2.1-I2V-14B) by 100 times, while preserving competitive performance.

NeurIPS Conference 2025 Conference Paper

Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism

  • Kunyun Wang
  • Bohan Li
  • Kai Yu
  • Minyi Guo
  • Jieru Zhao

Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis. However, their deployment is often limited by significant inference latency, primarily due to the inherently sequential nature of the denoising process. While existing parallelization strategies attempt to accelerate inference by distributing computation across multiple devices, they typically incur high communication overhead, hindering deployment on commercial hardware. To address this challenge, we propose $\textbf{ParaStep}$, a novel parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps. Unlike prior approaches that rely on layer-wise or stage-wise communication, ParaStep employs lightweight, step-wise communication, substantially reducing overhead. ParaStep achieves end-to-end speedups of up to $\textbf{3. 88}$$\times$ on SVD, $\textbf{2. 43}$$\times$ on CogVideoX-2b, and $\textbf{6. 56}$$\times$ on AudioLDM2-large, while maintaining generation quality. These results highlight ParaStep as a scalable and communication-efficient solution for accelerating diffusion inference, particularly in bandwidth-constrained environments.

NeurIPS Conference 2025 Conference Paper

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

  • Ziyang Ma
  • Yinghao Ma
  • Yanqiao Zhu
  • Chen Yang
  • Yi-Wen Chao
  • Ruiyang Xu
  • Wenxi Chen
  • Yuanzhe Chen

We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1, 000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. These findings underscore the urgent need for greater research attention in audio-language reasoning, including both data and algorithm innovation. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.

NeurIPS Conference 2025 Conference Paper

MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation

  • Yang Han
  • Pengyu Wang
  • Kai Yu
  • Xin Chen
  • Lu Chen

Mass spectrometry (MS) plays a critical role in molecular identification, significantly advancing scientific discovery. However, structure elucidation from MS data remains challenging due to the scarcity of annotated spectra. While large-scale pretraining has proven effective in addressing data scarcity in other domains, applying this paradigm to mass spectrometry is hindered by the complexity and heterogeneity of raw spectral signals. To address this, we propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary, enabling cross-modal learning through large-scale pretraining on reliably computed fingerprint–molecule datasets. Multi-task pretraining objectives further enhance MS-BART's generalization by jointly optimizing denoising and translation task. The pretrained model is subsequently transferred to experimental spectra through finetuning on fingerprint predictions generated with MIST, a pre-trained spectral inference model, thereby enhancing robustness to real-world spectral variability. While finetuning alleviates the distributional difference, MS-BART still suffers molecular hallucination and requires further alignment. We therefore introduce a chemical feedback mechanism that guides the model toward generating molecules closer to the reference structure. Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1 and is faster by one order of magnitude than competing diffusion-based methods, while comprehensive ablation studies systematically validate the model's effectiveness and robustness. We provide the data and code at https: //github. com/OpenDFM/MS-BART.

NeurIPS Conference 2025 Conference Paper

Task-Specific Data Selection for Instruction Tuning via Monosemantic Neuronal Activations

  • Da Ma
  • Gonghu Shang
  • Zhi Chen
  • Libo Qin
  • Yijie LUO
  • Hongshen Xu
  • Lei Pan
  • Shuai Fan

Instruction tuning improves the ability of large language models (LLMs) to follow diverse human instructions, but achieving strong performance on specific target tasks remains challenging. A critical bottleneck is selecting the most relevant data to maximize task-specific performance. Existing data selection approaches include unstable influence-based methods and more stable distribution alignment methods, the latter of which critically rely on the underlying sample representation. In practice, most distribution alignment methods, from shallow features (e. g. , BM25) to neural embeddings (e. g. , BGE, LLM2Vec), may fail to capture how the model internally processes samples. To bridge this gap, we adopt a model-centric strategy in which each sample is represented by its neuronal activation pattern in the model, directly reflecting internal computation. However, directly using raw neuron activations leads to spurious similarity between unrelated samples due to neuron polysemanticity, where a single neuron may respond to multiple, unrelated concepts. To address this, we employ sparse autoencoders to disentangle polysemantic activations into sparse, monosemantic representations, and introduce a dedicated similarity metric for this space to better identify task-relevant data. Comprehensive experiments across multiple instruction datasets, models, tasks, and selection ratios show that our approach consistently outperforms existing data selection baselines in both stability and task-specific performance.

AAAI Conference 2025 Conference Paper

VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization

  • Tao Liu
  • Ziyang Ma
  • Qi Chen
  • Feilong Chen
  • Shuai Fan
  • Xie Chen
  • Kai Yu

We present VQTalker, a Vector Quantization-based framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units (phonemes) and corresponding visual articulations (visemes), which often share commonalities across languages. We introduce a facial motion tokenizer based on Group Residual Finite Scalar Quantization (GRFSQ), which creates a discretized representation of facial features. This method enables comprehensive capture of facial movements while improving generalization to multiple languages, even with limited training data. Building on this quantized representation, we implement a coarse-to-fine motion generation process that progressively refines facial animations. Extensive experiments demonstrate that VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios, particularly in multilingual settings. Notably, our method achieves high-quality results at a resolution of 512 × 512 pixels while maintaining a lower bitrate of approximately 11 kbps. Our work opens new possibilities for cross-lingual talking face generation.

AAAI Conference 2024 Conference Paper

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

  • Liangtai Sun
  • Yang Han
  • Zihan Zhao
  • Da Ma
  • Zhennan Shen
  • Baocai Chen
  • Lu Chen
  • Kai Yu

Recently, there has been growing interest in using Large Language Models (LLMs) for scientific research. Numerous benchmarks have been proposed to evaluate the ability of LLMs for scientific research. However, current benchmarks are mostly based on pre-collected objective questions. This design suffers from data leakage problem and lacks the evaluation of subjective Q/A ability. In this paper, we propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues. Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability. In particular, we design a "dynamic" subset based on scientific principles to prevent evaluation from potential data leakage. Both objective and subjective questions are included in SciEval. These characteristics make SciEval a more effective benchmark for scientific research ability evaluation of LLMs. Comprehensive experiments on most advanced LLMs show that, although GPT-4 achieves SOTA performance compared to other LLMs, there is still substantial room for improvement, especially for dynamic questions. The codes and data are publicly available on https://github.com/OpenDFM/SciEval.

NeurIPS Conference 2024 Conference Paper

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

  • Ruisheng Cao
  • Fangyu Lei
  • Haoyuan Wu
  • Jixuan Chen
  • Yeqiao Fu
  • Hongcheng Gao
  • Xinzhuang Xiong
  • Hanchong Zhang

Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivity of experts while democratizing access to large-scale data analysis. In this paper, we introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering workflows, featuring 494 real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks, derived from real-world use cases, evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems. To balance realistic simulation with evaluation simplicity, we devote significant effort to developing automatic configurations for task setup and carefully crafting evaluation metrics for each task. Furthermore, we supplement multimodal agents with comprehensive documents of these enterprise data software systems. Our empirical evaluation reveals that existing state-of-the-art LLM/VLM-based agents do not reliably automate full data workflows (14. 0% success). Even with step-by-step guidance, these agents still underperform in tasks that require fine-grained, knowledge-intensive GUI actions (16. 2%) and involve remote cloud-hosted workspaces (10. 6%). We hope that Spider2-V paves the way for autonomous multimodal agents to transform the automation of data science and engineering workflow. Our code and data are available at https: //spider2-v. github. io.

AAAI Conference 2024 Conference Paper

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

  • Chenpeng Du
  • Yiwei Guo
  • Feiyu Shen
  • Zhijun Liu
  • Zheng Liang
  • Xie Chen
  • Shuai Wang
  • Hui Zhang

The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the input text, enabling it to incorporate the semantic context and maintain seamless concatenation with the surrounding context. Following that, CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into waveforms, taking into consideration the acoustic context. Our experimental results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms of speech resynthesis from semantic tokens. Moreover, we show that UniCATS achieves state-of-the-art performance in both speech continuation and editing. Audio samples are available at https://cpdu.github.io/unicats.

NeurIPS Conference 2023 Conference Paper

Large Language Models Are Semi-Parametric Reinforcement Learning Agents

  • Danyang Zhang
  • Lu Chen
  • Situo Zhang
  • Hongshen Xu
  • Zihan Zhao
  • Kai Yu

Inspired by the insights in cognitive science with respect to human memory and reasoning mechanism, a novel evolvable LLM-based (Large Language Model) agent framework is proposed as Rememberer. By equipping the LLM with a long-term experience memory, Rememberer is capable of exploiting the experiences from the past episodes even for different task goals, which excels an LLM-based agent with fixed exemplars or equipped with a transient working memory. We further introduce R einforcement L earning with E xperience M emory ( RLEM ) to update the memory. Thus, the whole system can learn from the experiences of both success and failure, and evolve its capability without fine-tuning the parameters of the LLM. In this way, the proposed Rememberer constitutes a semi-parametric RL agent. Extensive experiments are conducted on two RL task sets to evaluate the proposed framework. The average results with different initialization and training sets exceed the prior SOTA by 4% and 2% for the success rate on two task sets and demonstrate the superiority and robustness of Rememberer.

NeurIPS Conference 2023 Conference Paper

PointGPT: Auto-regressively Generative Pre-training from Point Clouds

  • Guangyan Chen
  • Meiling Wang
  • Yi Yang
  • Kai Yu
  • Li Yuan
  • Yufeng Yue

Large language models (LLMs) based on the generative pre-training transformer (GPT) have demonstrated remarkable effectiveness across a diverse range of downstream tasks. Inspired by the advancements of the GPT, we present PointGPT, a novel approach that extends the concept of GPT to point clouds, addressing the challenges associated with disorder properties, low information density, and task gaps. Specifically, a point cloud auto-regressive generation task is proposed to pre-train transformer models. Our method partitions the input point cloud into multiple point patches and arranges them in an ordered sequence based on their spatial proximity. Then, an extractor-generator based transformer decode, with a dual masking strategy, learns latent representations conditioned on the preceding point patches, aiming to predict the next one in an auto-regressive manner. To explore scalability and enhance performance, a larger pre-training dataset is collected. Additionally, a subsequent post-pre-training stage is introduced, incorporating a labeled hybrid dataset. Our scalable approach allows for learning high-capacity models that generalize well, achieving state-of-the-art performance on various downstream tasks. In particular, our approach achieves classification accuracies of 94. 9% on the ModelNet40 dataset and 93. 4% on the ScanObjectNN dataset, outperforming all other transformer models. Furthermore, our method also attains new state-of-the-art accuracies on all four few-shot learning benchmarks. Codes are available at https: //github. com/CGuangyan-BIT/PointGPT.

JBHI Journal 2022 Journal Article

Speckle Noise Reduction for OCT Images Based on Image Style Transfer and Conditional GAN

  • Yi Zhou
  • Kai Yu
  • Meng Wang
  • Yuhui Ma
  • Yuanyuan Peng
  • Zhongyue Chen
  • Weifang Zhu
  • Fei Shi

Raw optical coherence tomography (OCT) images typically are of low quality because speckle noise blurs retinal structures, severely compromising visual quality and degrading performances of subsequent image analysis tasks. In our previous study (Ma et al. , 2018), we have developed a Conditional Generative Adversarial Network (cGAN) for speckle noise removal in OCT images collected by several commercial OCT scanners, which we collectively refer to as scanner T. In this paper, we improve the cGAN model and apply it to our in-house OCT scanner (scanner B) for speckle noise suppression. The proposed model consists of two steps: 1) We train a Cycle-Consistent GAN (CycleGAN) to learn style transfer between two OCT image datasets collected by different scanners. The purpose of the CycleGAN is to leverage the ground truth dataset created in our previous study. 2) We train a mini-cGAN model based on the PatchGAN mechanism with the ground truth dataset to suppress speckle noise in OCT images. After training, we first apply the CycleGAN model to convert raw images collected by scanner B to match the style of the images from scanner T, and subsequently use the mini-cGAN model to suppress speckle noise in the style transferred images. We evaluate the proposed method on a dataset collected by scanner B. Experimental results show that the improved model outperforms our previous method and other state-of-the-art models in speckle noise removal, retinal structure preservation and contrast enhancement.

AAAI Conference 2021 Conference Paper

LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching

  • Boer Lyu
  • Lu Chen
  • Su Zhu
  • Kai Yu

Chinese short text matching is a fundamental task in natural language processing. Existing approaches usually take Chinese characters or words as input tokens. They have two limitations: 1) Some Chinese words are polysemous, and semantic information is not fully utilized. 2) Some models suffer potential issues caused by word segmentation. Here we introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity. Additionally, we adopt the word lattice graph as input to maintain multi-granularity information. Our model is also complementary to pre-trained language models. Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches. Ablation study also indicates that both semantic information and multi-granularity information are important for text matching modeling.

AAAI Conference 2020 Conference Paper

Schema-Guided Multi-Domain Dialogue State Tracking with Graph Attention Neural Networks

  • Lu Chen
  • Boer Lv
  • Chi Wang
  • Su Zhu
  • Bowen Tan
  • Kai Yu

Dialogue state tracking (DST) aims at estimating the current dialogue state given all the preceding conversation. For multidomain DST, the data sparsity problem is also a major obstacle due to the increased number of state candidates. Existing approaches generally predict the value for each slot independently and do not consider slot relations, which may aggravate the data sparsity problem. In this paper, we propose a Schema-guided multi-domain dialogue State Tracker with graph attention networks (SST) that predicts dialogue states from dialogue utterances and schema graphs which contain slot relations in edges. We also introduce a graph attention matching network to fuse information from utterances and graphs, and a recurrent graph attention network to control state updating. Experiment results show that our approach obtains new state-of-the-art performance on both MultiWOZ 2. 0 and MultiWOZ 2. 1 benchmarks.

AAAI Conference 2020 Conference Paper

Semi-Supervised Text Simplification with Back-Translation and Asymmetric Denoising Autoencoders

  • Yanbin Zhao
  • Lu Chen
  • Zhi Chen
  • Kai Yu

Text simplification (TS) rephrases long sentences into simplified variants while preserving inherent semantics. Traditional sequence-to-sequence models heavily rely on the quantity and quality of parallel sentences, which limits their applicability in different languages and domains. This work investigates how to leverage large amounts of unpaired corpora in TS task. We adopt the back-translation architecture in unsupervised machine translation (NMT), including denoising autoencoders for language modeling and automatic generation of parallel data by iterative back-translation. However, it is non-trivial to generate appropriate complex-simple pair if we directly treat the set of simple and complex corpora as two different languages, since the two types of sentences are quite similar and it is hard for the model to capture the characteristics in different types of sentences. To tackle this problem, we propose asymmetric denoising methods for sentences with separate complexity. When modeling simple and complex sentences with autoencoders, we introduce different types of noise into the training process. Such a method can significantly improve the simplification performance. Our model can be trained in both unsupervised and semi-supervised manner. Automatic and human evaluations show that our unsupervised model outperforms the previous systems, and with limited supervision, our model can perform competitively with multiple state-of-the-art simplification systems.

JBHI Journal 2019 Journal Article

Surrogate-Assisted Retinal OCT Image Classification Based on Convolutional Neural Networks

  • Yibiao Rong
  • Dehui Xiang
  • Weifang Zhu
  • Kai Yu
  • Fei Shi
  • Zhun Fan
  • Xinjian Chen

Optical Coherence Tomography (OCT) is becoming one of the most important modalities for the noninvasive assessment of retinal eye diseases. As the number of acquired OCT volumes increases, automating the OCT image analysis is becoming increasingly relevant. In this paper, we propose a surrogate-assisted classification method to classify retinal OCT images automatically based on convolutional neural networks (CNNs). Image denoising is first performed to reduce the noise. Thresholding and morphological dilation are applied to extract the masks. The denoised images and the masks are then employed to generate a lot of surrogate images, which are used to train the CNN model. Finally, the prediction for a test image is determined by the average of the outputs from the trained CNN model on the surrogate images. The proposed method has been evaluated on different databases. The results (AUC of 0. 9783 in the local database and AUC of 0. 9856 in the Duke database) show that the proposed method is a very promising tool for classifying the retinal OCT images automatically.

JBHI Journal 2017 Journal Article

Single-Channel Sparse Non-Negative Blind Source Separation Method for Automatic 3-D Delineation of Lung Tumor in PET Images

  • Ivica Kopriva
  • Wei Ju
  • Bin Zhang
  • Fei Shi
  • Dehui Xiang
  • Kai Yu
  • Ximing Wang
  • Ulas Bagci

In this paper, we propose a novel method for single-channel blind separation of nonoverlapped sources and, to the best of our knowledge, apply it for the first time to automatic segmentation of lung tumors in positron emission tomography (PET) images. Our approach first converts a 3-D PET image into a pseudo-multichannel image. Afterward, regularization free sparseness constrained non-negative matrix factorization is used to separate tumor from other tissues. By using complexity based criterion, we select tumor component as the one with minimal complexity. We have compared the proposed method with threshold based on 40% and 50% maximum standardized uptake value (SUV), graph cuts (GC), random walks (RW), and affinity propagation (AP) algorithms on 18 nonsmall cell lung cancer datasets with respect to ground truth (GT) provided by two radiologists. Dice similarity coefficient averaged with respect to two GTs is: 0. 78 ± 0. 12 by the proposed algorithm, 0. 78 ± 0. 1 by GC, 0. 77 ± 0. 13 by AP, 0. 77 ± 0. 07 by RW, and 0. 75 ± 0. 13 by 50% maximum SUV threshold. Since the proposed method achieved performance comparable with interactive methods, considering the unique challenges of lung tumor segmentation from PET images, our findings support possibility of using our fully automated method in routine clinics. The source codes will be available at www. mipav. net/English/research/research. html.

NeurIPS Conference 2014 Conference Paper

Communication Efficient Distributed Machine Learning with the Parameter Server

  • Mu Li
  • David Andersen
  • Alexander Smola
  • Kai Yu

This paper describes a third-generation parameter server framework for distributed machine learning. This framework offers two relaxations to balance system performance and algorithm efficiency. We propose a new algorithm that takes advantage of this framework to solve non-convex non-smooth problems with convergence guarantees. We present an in-depth analysis of two large scale machine learning problems ranging from $\ell_1$-regularized logistic regression on CPUs to reconstruction ICA on GPUs, using 636TB of real data with hundreds of billions of samples and dimensions. We demonstrate using these examples that the parameter server framework is an effective and straightforward way to scale machine learning to larger problems and systems than have been previously achieved.

NeurIPS Conference 2012 Conference Paper

Deep Learning of Invariant Features via Simulated Fixations in Video

  • Will Zou
  • Shenghuo Zhu
  • Kai Yu
  • Andrew Ng

We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classification accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset.

NeurIPS Conference 2010 Conference Paper

Deep Coding Network

  • Yuanqing Lin
  • Tong Zhang
  • Shenghuo Zhu
  • Kai Yu

This paper proposes a principled extension of the traditional single-layer flat sparse coding scheme, where a two-layer coding scheme is derived based on theoretical analysis of nonlinear functional approximation that extends recent results for local coordinate coding. The two-layer approach can be easily generalized to deeper structures in a hierarchical multiple-layer manner. Empirically, it is shown that the deep coding approach yields improved performance in benchmark datasets.

NeurIPS Conference 2009 Conference Paper

Nonlinear Learning using Local Coordinate Coding

  • Kai Yu
  • Tong Zhang
  • Yihong Gong

This paper introduces a new method for semi-supervised learning on high dimensional nonlinear manifolds, which includes a phase of unsupervised basis learning and a phase of supervised function learning. The learned bases provide a set of anchor points to form a local coordinate system, such that each data point x on the manifold can be locally approximated by a linear combination of its nearby anchor points, and the linear weights become its local coordinate coding. We show that a high dimensional nonlinear function can be approximated by a global linear function with respect to this coding scheme, and the approximation quality is ensured by the locality of such coding. The method turns a difficult nonlinear learning problem into a simple global linear learning problem, which overcomes some drawbacks of traditional local learning methods.

NeurIPS Conference 2008 Conference Paper

Deep Learning with Kernel Regularization for Visual Recognition

  • Kai Yu
  • Wei Xu
  • Yihong Gong

In this paper we focus on training deep neural networks for visual recognition tasks. One challenge is the lack of an informative regularization on the network parameters, to imply a meaningful control on the computed function. We propose a training strategy that takes advantage of kernel methods, where an existing kernel function represents useful prior knowledge about the learning task of interest. We derive an efficient algorithm using stochastic gradient descent, and demonstrate very positive results in a wide range of visual recognition tasks.

NeurIPS Conference 2008 Conference Paper

Stochastic Relational Models for Large-scale Dyadic Data using MCMC

  • Shenghuo Zhu
  • Kai Yu
  • Yihong Gong

Stochastic relational models provide a rich family of choices for learning and predicting dyadic data between two sets of entities. It generalizes matrix factorization to a supervised learning problem that utilizes attributes of objects in a hierarchical Bayesian framework. Previously empirical Bayesian inference was applied, which is however not scalable when the size of either object sets becomes tens of thousands. In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm to scale the model to very large-scale dyadic data. Both superior scalability and predictive accuracy are demonstrated on a collaborative filtering problem, which involves tens of thousands users and a half million items.

NeurIPS Conference 2007 Conference Paper

Gaussian Process Models for Link Analysis and Transfer Learning

  • Kai Yu
  • Wei Chu

In this paper we develop a Gaussian process (GP) framework to model a collection of reciprocal random variables defined on the \emph{edges} of a network. We show how to construct GP priors, i. e. ,~covariance functions, on the edges of directed, undirected, and bipartite graphs. The model suggests an intimate connection between \emph{link prediction} and \emph{transfer learning}, which were traditionally considered two separate research topics. Though a straightforward GP inference has a very high complexity, we develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity.

NeurIPS Conference 2007 Conference Paper

Predictive Matrix-Variate t Models

  • Shenghuo Zhu
  • Kai Yu
  • Yihong Gong

It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrix-variate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difficult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efficient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model.

NeurIPS Conference 2006 Conference Paper

Stochastic Relational Models for Discriminative Link Prediction

  • Kai Yu
  • Wei Chu
  • Shipeng Yu
  • Volker Tresp
  • Zhao Xu

We introduce a Gaussian process (GP) framework, stochastic relational models (SRM), for learning social, physical, and other relational phenomena where interactions between entities are observed. The key idea is to model the stochastic structure of entity relationships (i. e. , links) via a tensor interaction of multiple GPs, each defined on one type of entities. These models in fact define a set of nonparametric priors on infinite dimensional tensor matrices, where each element represents a relationship between a tuple of entities. By maximizing the marginalized likelihood, information is exchanged between the participating GPs through the entire relational network, so that the dependency structure of links is messaged to the dependency of entities, reflected by the adapted GP kernels. The framework offers a discriminative approach to link prediction, namely, predicting the existences, strengths, or types of relationships based on the partially observed linkage network as well as the attributes of entities (if given). We discuss properties and variants of SRM and derive an efficient learning algorithm. Very encouraging experimental results are achieved on a toy problem and a user-movie preference link prediction task. In the end we discuss extensions of SRM to general relational learning tasks.

NeurIPS Conference 2005 Conference Paper

Soft Clustering on Graphs

  • Kai Yu
  • Shipeng Yu
  • Volker Tresp

We propose a simple clustering framework on graphs encoding pairwise data similarities. Unlike usual similarity-based methods, the approach softly assigns data to clusters in a probabilistic way. More importantly, a hierarchical clustering is naturally derived in this framework to gradually merge lower-level clusters into higher-level ones. A random walk analysis indicates that the algorithm exposes clustering structures in various resolutions, i. e. , a higher level statistically models a longer-term diffusion on graphs and thus discovers a more global clustering structure. Finally we provide very encouraging experimental results.

NeurIPS Conference 2004 Conference Paper

Learning Gaussian Process Kernels via Hierarchical Bayes

  • Anton Schwaighofer
  • Volker Tresp
  • Kai Yu

We present a novel method for learning with Gaussian process regres- sion in a hierarchical Bayesian framework. In a first step, kernel matri- ces on a fixed set of input points are learned from data using a simple and efficient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are fitted to approximate the learned covariance matrix using a generalized Nystrom method, which results in a complex, data driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. 1 Introduction In many real-world application domains, the available training data sets are quite small, which makes learning and model selection difficult. For example, in the user preference modelling problem we will consider later, learning a preference model would amount to fitting a model based on only 20 samples of a user's preference data. Fortunately, there are situations where individual data sets are small, but data from similar scenarios can be obtained. Returning to the example of preference modelling, data for many different users are typically available. This data stems from clearly separate individuals, but we can expect that models can borrow strength from data of users with similar tastes. Typically, such problems have been handled by either mixed effects models or hierarchical Bayesian modelling. In this paper we present a novel approach to hierarchical Bayesian modelling in the context of Gaussian process regression, with an application to recommender systems. Here, hier- archical Bayesian modelling essentially means to learn the mean and covariance function of the Gaussian process. In a first step, a common collaborative kernel matrix is learned from the data via a simple and efficient EM algorithm. This circumvents the problem of kernel design, as no paramet- ric form of kernel function is required here. Thus, this form of learning a covariance matrix is also suited for problems with complex covariance structure (e. g. nonstationarity). A portion of the learned covariance matrix can be explained by the input features and, thus, generalized to new objects via a content-based kernel smoother. Thus, in a second step, we generalize the covariance matrix (learned by the EM-algorithm) to new items using a generalized Nystrom method. The result is a complex content-based kernel which itself is a weighted superposition of simple smoothing kernels. This second part could also be applied to other situations where one needs to extrapolate a covariance matrix on a finite set (e. g. a graph) to a continuous input space, as, for example, required in induction for semi-supervised learning [14]. The paper is organized as follows. Sec. 2 casts Gaussian process regression in a hierarchical Bayesian framework, and shows the EM updates to learn the covariance matrix in the first step. Extrapolating the covariance matrix is shown in Sec. 3. We illustrate the function of the EM-learning on a toy example in Sec. 4, before applying the proposed methods as a recommender system for images in Sec. 4. 1. 1. 1 Previous Work In statistics, modelling data from related scenarios is typically done via mixed effects mod- els or hierarchical Bayesian (HB) modelling [6]. In HB, parameters of models for indi- vidual scenarios (e. g. users in recommender systems) are assumed to be drawn from a common (hyper)prior distribution, allowing the individual models to interact and regular- ize each other. Recent examples of HB modelling in machine learning include [1, 2]. In other contexts, this learning framework is called multi-task learning [4]. Multi-task learn- ing with Gaussian processes has been suggested by [8], yet with the rather stringent as- sumption that one has observations on the same set of points in each individual scenario. Based on sparse approximations of GPs, a more general GP multi-task learner with para- metric covariance functions has been presented in [7]. In contrast, the approach presented in this paper only considers covariance matrices (and is thus non-parametric) in the first step. Only in a second extrapolation step, kernel smoothing leads to predictions based on a covariance function that is a data-driven combination of simple kernel functions. 2 Learning GP Kernel Matrices via EM The learning task we are concerned with can be stated as follows: The data are observations from M different scenarios. In the i. th scenario, we have observations yi = (yi, .. ., yi ) 1 N i on a total of N i points, Xi = {xi, .. ., xi } 1. In order to analyze this data in a hierarchical N i Bayesian way, we assume that the data for each scenario is a noisy sample of a Gaussian process (GP) with unknown mean and covariance function. We assume that mean and covariance function are shared across different scenarios. 1 In the first modelling step presented in this section, we consider transductive learning ("la- belling a partially labelled data set"), that is, we are interested in the model's behavior only on points X, with X = M Xi and cardinality N = |X|. This situation is relevant i=1 for most collaborative filtering applications. Thus, test points are the unlabelled points in each scenario. This reduces the whole "infinite dimensional" Gaussian process to its finite dimensional projection on points X, which is an N -variate Gaussian distribution with co- variance matrix K and mean vector m. For the EM algorithm to work, we also require that there is some overlap between scenarios, that is, Xi Xj = for some i, j. Coming back to the user modelling problem mentioned above, this means that at least some items have been rated by more than one user. Thus, our first modelling step focusses on directly learning the covariance matrix K and 1Alternative HB approaches for collaborative filtering, like that discussed in [5], assume that model weights are drawn from a shared Gaussian distribution. m from the data via an efficient EM algorithm. This may be of particular help in problems where one would need to specify a complex (e. g. nonstationary) covariance function. Following the hierarchical Bayesian assumption, the data observed in each scenario is thus a partial sample from N (y | m, K + 21), where 1 denotes the unit matrix. The joint model is simply M p(m, K) p(yi | f i)p(f i | m, K), (1) i=1 where p(m, K) denotes the prior distribution for mean and covariance. We assume a Gaus- sian likelihood p(yi | f i) with diagonal covariance matrix 21. 2. 1 EM Learning For the above hierarchical Bayesian model, Eq. (1), the marginal likelihood becomes M p(m, K) p(yi | f i)p(f i | m, K) df i. (2) i=1 To obtain simple and stable solutions when estimating m and K from the data, we con- sider point estimates of the parameters m and K, based on a penalized likelihood approach with conjugate priors. 2 The conjugate prior for mean m and covariance K of a multivari- ate Gaussian is the so-called Normal-Wishart distribution [6], which decomposes into the product of an inverse Wishart distribution for K and a Normal distribution for m, p(m, K) = N (m |, -1K)Wi-1(K|, U ). (3) That is, the prior for the Gram matrix K is given by an inverse Wishart distribution with scalar parameter > 1/2(N - 1) and U being a symmetric positive-definite matrix. Given the covariance matrix K, m is Gaussian distributed with mean and covariance -1K, where is a positive scalar. The parameters can be interpreted in terms of an equivalent data set for the mean (this data set has size A, with A =, and mean = ) and a data set for the covariance that has size B, with = (B + N )/2, and covariance S, U = (B/2)S. In order to write down the EM algorithm in a compact way, we denote by I(i) the set of indices of those data points that have been observed in the i. th scenario, that is I(i) = {j | j {1, .. ., N } and xj Xi}. Keep in mind that in most applications of interest N i N such that most targets are missing in training. KI(i), I(i) denotes the square submatrix of K that corresponds to points I(i), that is, the covariance matrix for points in the i. th scenario. By K, I(i) we denote the covariance matrix of all N points versus those in the i. th scenario.